This topic describes how to submit a Spark SQL job in the AnalyticDB for MySQL or Data Management (DMS) console.

Background information

Spark is a fast and general-purpose computing engine designed for large-scale data processing. The Spark engine of Apache Spark 3.0.1 can be built into AnalyticDB for MySQL. This way, you can use Spark SQL to directly read, write, and analyze data of the same AnalyticDB for MySQL cluster as well as complete offline computing and processing of large-scale data on the same data storage. You can submit a Spark SQL job in the AnalyticDB for MySQL or DMS console for development and debugging.

Submit a Spark SQL job in the AnalyticDB for MySQL console

For information about how to submit a Spark SQL job in the AnalyticDB for MySQL console, see Test jobs.

After you submit a Spark SQL job, you can also view the details of the job on the Yarn Cluster UI or Detailed Monitoring UI page in the AnalyticDB for MySQL console. For more information, see Manage clusters.

On the Yarn Cluster UI and Detailed Monitoring UI page, you can view the following information:
  • Yarn Cluster UI
    You can view the information such as the ID and the status of a Spark SQL job. 1
  • Detailed Monitoring UI
    You can view the information such as the ID, start time, and end time of a completed Spark SQL job. 2
    You can also click the app ID of the Spark SQL job that you want to view in the APP ID column to view the detailed execution information. For example, you can click stdout in the Logs column on the Executors tab to view the detailed execution result of the Spark SQL job. 3

Submit a Spark SQL job in the DMS console

Note
  • By default, AnalyticDB for MySQL clusters use the Xihe engine to execute SQL statements in the DMS console. To execute SQL statements by using the Spark engine when you execute Spark SQL jobs, you must add the /*+query_type=spark*/ hint to SQL statements or set query_type to spark.
  • When you use the Spark engine, DMS supports only INSERT statements. Separate SELECT statements are not supported. For example, you can use the Spark engine to execute the INSERT INTO target_table SELECT col1, col2, col3 FROM source_table; statement in the DMS console. However, you cannot execute the SELECT * FROM target_table; statement.
  • In this example, you must export data from the source_table table to the target_table table and add a column named concat to the target_table table by executing a Spark SQL job.
  1. Use DMS to connect to an AnalyticDB for MySQL cluster. For more information, see Use DMS to connect to an cluster.
  2. (Optional) If no database has been created, create a database on the SQLConsole tab in the DMS console. In this example, the database is named adb_demo. For more information, see Create a database.
  3. On the SQLConsole tab, execute the CREATE TABLE statement to create a source table and a destination table and execute the INSERT INTO statement to insert test data to the source and destination tables. Sample statements:
    CREATE TABLE source_table(k int, name varchar, PRIMARY KEY (k)) DISTRIBUTED BY HASH(k);
    CREATE TABLE target_table(k int, name varchar, hi varchar, PRIMARY KEY (k)) DISTRIBUTED BY HASH(k);
    INSERT INTO source_table VALUES(1, 'spark');
    INSERT INTO source_table VALUES(2, 'world');
    After data is inserted, you can separately execute the following statements to view the current data in the source_table and target_table tables:
    SELECT * FROM source_table;
    SELECT * FROM target_table;
  4. Use a hint or the SET parameter to specify that the Spark engine is used to execute SQL statements. Sample statement:
    • Hint
      /*+query_type=spark*/INSERT INTO target_table SELECT k, name, concat('Hi ', name) FROM source_table;
    • SET
      Note You must select both the SET and INSERT INTO statements to execute on the SQLConsole tab.
      SET query_type=spark;
      SET spark.driver.resourceSpec=small;
      SET spark.executor.resourceSpec=small;
      SET spark.executor.instances=1;
      INSERT INTO target_table SELECT k, name, concat('Hi ', name) FROM source_table;
      Table 1. Parameters
      Parameter Required Description
      query_type Yes The type of the engine that is used to execute SQL statements. When you use Spark SQL, you must select the Spark engine.
      spark.driver.resourceSpec No The resource specifications of the Spark driver. Default value: small. Valid values:
      • small: 1 vCPU and 4 GB memory
      • medium: 2 vCPUs and 8 GB memory
      • large: 4 vCPUs and 16 GB memory
      • xlarge: 8 vCPUs and 32 GB memory
      spark.executor.resourceSpec The resource specifications of the Spark executor. Default value: small. Valid values:
      • small: 1 vCPU and 4 GB memory
      • medium: 2 vCPUs and 8 GB memory
      • large: 4 vCPUs and 16 GB memory
      • xlarge: 8 vCPUs and 32 GB memory
      spark.executor.instances The number of Spark executors. The value must be a positive integer. The maximum number is limited by the overall resources of the AnalyticDB for MySQL cluster.
      Note The system dynamically allocates executor resources based on actual computing requirements. Typically, you do not need to specify this parameter, unless you have special requirements of resource consumption for SQL tasks.
  5. After data is written to the target_table table, you can execute the following statement to view and verify the data in this table:
    SELECT * FROM target_table;

References

Use UDFs in Spark SQL