Spark is a computing engine designed to process large amounts of data. cloud-native Data Warehouse AnalyticDB for MySQL Edition can work together with the Spark engine to analyze AnalyticDB for MySQL Edition databases. This topic describes how to create and manage Spark clusters.

Precautions

  • The Spark feature is in canary release. To experience this feature, Submit a ticket.
  • You can create Spark clusters only after you create an AnalyticDB for MySQL Edition cluster. For more information about how to create AnalyticDB for MySQL Edition clusters, see Create a cluster.

Create a Spark cluster

  1. Log on to the AnalyticDB for MySQL console by your Alibaba Cloud account.
  2. In the upper-left corner of the page, select the region where clusters reside.
  3. In the left-side navigation pane, click Clusters.
  4. On the V3.0 Clusters tab, click the target Cluster ID.
  5. In the left-side navigation pane, choose Spark > Clusters.
  6. In the Note message, click Create to go to the buy page.
    Note
    • Up to 200 compute nodes can be selected for a Spark cluster. You are charged CNY 5.6 per hour for each compute node.
    • It takes about 10 minutes to create a Spark cluster. While a Spark cluster is being created, all features related to Spark are disabled.

Manage clusters

  1. Log on to the AnalyticDB for MySQL console by your Alibaba Cloud account.
  2. In the upper-left corner of the page, select the region where clusters reside.
  3. In the left-side navigation pane, click Clusters.
  4. On the V3.0 Clusters tab, click the target Cluster ID.
  5. In the left-side navigation pane, choose Spark > Clusters.
  6. On the Clusters page, perform the following operations:
    • Release Cluster
      1. In the Node Information section, click Release Cluster in the upper-right corner.
      2. In the message that appears, click OK to delete the Spark cluster.
    • Set UI Access Password
      1. In the UI Access section, click Set UI Access Password in the upper-right corner.
      2. In the Set UI Access Password panel, set the following parameters.
        Parameter Description
        Account Name

        Enter the name of the account that is used for UI access. The name must meet the following requirements:

        • The name must start with a lowercase letter and end with a lowercase letter or a digit.
        • The name can contain lowercase letters, digits, and underscores (_).
        • The name must be 2 to 16 characters in length.
        Password Enter the password for the account.
        Confirm Password Enter the password again.
      Note You can also click Yarn in the UI Access section to view the Yarn Cluster UI or click History Server to view the Detailed Monitoring UI.

Manage resources

  1. Log on to the AnalyticDB for MySQL console by your Alibaba Cloud account.
  2. In the upper-left corner of the page, select the region where clusters reside.
  3. In the left-side navigation pane, click Clusters.
  4. On the V3.0 Clusters tab, click the target Cluster ID.
  5. In the left-side navigation pane, choose Spark > Resources.
  6. On the Resources page, perform the following operations:
    • Click Upload File to upload JAR files that contain Spark jobs to an Object Storage Service (OSS) directory.
    • Click Upload Directory to batch upload JAR files from a directory.
    • Click Create Folder to create a folder to manage JAR files that you upload.
    • Click Delete Resource to delete JAR files that are no longer needed.
    Note Uploaded files can be previewed, deleted, or downloaded. You can click Copy Path in the Actions column corresponding to a file to copy its OSS directory. The directory can be used to set parameters in Airflow scripts. For more information, see Airflow clusters.

Test jobs

  1. Log on to the AnalyticDB for MySQL console by your Alibaba Cloud account.
  2. In the upper-left corner of the page, select the region where clusters reside.
  3. In the left-side navigation pane, click Clusters.
  4. On the V3.0 Clusters tab, click the target Cluster ID.
  5. In the left-side navigation pane, choose Spark > Job Test.
  6. On the Job Test page, submit a Spark SQL job or a DataFrame API-based JAR file job in the command-line interface (CLI). The following section describes the syntax and parameters:
    • Submit a Spark SQL job
      Syntax:
      {
          "sql" : "insert into testspark.target_table select * from testspark.source_table",
          "conf" : {
               "spark.adb.userName" : "userName",
               "spark.adb.password" : "password"
          }
      }
      Table 1. Parameter description
      Field Data type Required Description
      sql String Yes The SQL statement that is used to submit the Spark job.
      Note You need to add the database name in front of the table name in the SQL statement. For example, in the SQL statement INSERT INTO testspark.target_table SELECT * FROM testspark.source_table, testspark is the name of the database for target_table and source_table.
      conf Map Yes The parameters that are used to configure the Spark job. The parameters are the same as those configured in Apache Spark. The parameters are in the key: value format.
      Note If the SQL statement contains an AnalyticDB for MySQL Edition table or data that needs to be accessed, the spark.adb.userName and spark.adb.password parameters are required.
      name String No The name of the Spark job. We recommend that you set this parameter to something easy to identify to facilitate subsequent job management.
      driverMemory String No The memory size that is used for each driver process. Default value: 4. Unit: GB.
      driverCores Integer No The number of CPU cores that are used for each driver process. Default value: 1. Unit: vcore.
      executorMemory String No The memory size that is used for each executor process. Default value: 4. Unit: GB.
      executorCores Integer No The number of CPU cores that are used for each executor process. Default value: 1. Unit: vcore.
      numExecutors Integer No The number of executors that are used in the Spark job. By default, this parameter is dynamically assigned based on the size of the job.
    • Submit a DataFrame API-based Spark job
      Syntax:
      {
          "className":"org.apache.spark.examples.SparkPi",
          "args" : ["10"],
          "name":"JavaSparkPi",
          "file":"oss://xxx/jars/xxx.jar",
          "conf": {
               "spark.adb.userName" : "username",
               "spark.adb.password" : "password"
          }
      }
      Table 2. Parameter description
      Field Data type Required Description
      className String Yes The entry class in Java or Scala of the DataFrame API-based Spark job. Example: org.apache.spark.examples.SparkPi.
      args List<String> No The parameters that you want to configure for the Spark job.
      name String No The name of the Spark job. We recommend that you set this parameter to something easy to identify to facilitate subsequent job management.
      file String Yes The directory in which the JAR files of the Spark job are saved. The JAR files contain the entry classes. For more information about how to obtain the directory of JAR files, see Manage clusters.
      conf Map No The parameters that are used to configure the Spark job. The parameters are the same as those configured in Apache Spark. The parameters are in the key: value format.
      Note If the SQL statement contains an AnalyticDB for MySQL Edition table or data that needs to be accessed, the spark.adb.userName and spark.adb.password parameters are required.
      driverMemory String No The memory size that is used for each driver process. Default value: 4. Unit: GB.
      driverCores Integer No The number of CPU cores that are used for each driver process. Default value: 1. Unit: vcore.
      executorMemory String No The memory size that is used for each executor process. Default value: 4. Unit: GB.
      executorCores Integer No The number of CPU cores that are used for each executor process. Default value: 1. Unit: vcore.
      numExecutors Integer No The number of executors that are used in the Spark job. By default, this parameter is dynamically assigned based on the size of the job.
  7. Click Test Running to check whether the JAR files that are uploaded are running as expected.
    Note After the test is complete, you can also view the test result of the JAR files from the Yarn Cluster UI. For more information about how to view the Yarn Cluster UI, see Manage clusters.