All Products
Search
Document Center

AnalyticDB:Use Notebook to develop Spark SQL jobs

Last Updated:May 07, 2025

Notebook is an interactive data analysis and development platform that provides features such as job editing, data analysis, and data visualization. If you want to use Notebook to develop Spark SQL jobs, you can use the Notebook feature of Data Management (DMS) to complete job development.

Prerequisites

Limits

The Notebook feature is supported only in the China (Hangzhou) region.

Procedure

  1. Create and go to a workspace.

    1. Log on to the DMS console V5.0.
    2. Move the pointer over the 2023-01-28_15-57-17.png icon in the upper-left corner and choose All Features > Data+AI > Notebook.

      Note

      If you use the DMS console in normal mode, choose Data+AI > Notebook in the top navigation bar.

    3. Click Create Workspace. In the Create Workspace dialog box, configure the Workspace Name and Region parameters and click OK.

    4. Click Go to Workspace in the Actions column of the workspace to go to the workspace.

  2. (Optional) Add workspace members. If a workspace has multiple users, you must perform this step to assign different roles to each user.

  3. Configure the code storage.

    1. On the image tab, click Storage Management.

    2. Configure the OSS path in the Code Storage section.

  4. Add a resource.

    1. On the image tab, click Resource Configuration.

    2. Click Add Resource. In the Add Resource panel, configure the parameters that are described in the following table.

      Parameter

      Required

      Description

      Resource Name

      Yes

      The name of the resource. You can enter a custom name.

      Resource Introduction

      Yes

      The description of the resource. You can enter the purpose of the resource.

      Image

      Yes

      Only Spark3.5+Python3.9 is supported.

      AnalyticDB Instance

      Yes

      The ID of the AnalyticDB for MySQL cluster.

      Note

      If you cannot find the desired cluster, check whether the cluster has been registered with DMS.

      AnalyticDB Resource Group

      Yes

      The name of the job resource group.

      Executor Spec

      Yes

      The resource specifications of Spark executors. In this example, the default medium specification is used.

      Different resource types correspond to different specifications. For more information, see the Type column in the Spark application configuration parameters topic.

      Max Executors

      Min Executors

      Yes

      The number of Spark executors.

      After you select the Spark3.5+Python3.9 image, the Min Executors parameter is automatically set to 2, and the Max Executors parameter is automatically set to 8.

      Notebook Spec

      Yes

      The Notebook specifications. In this example, General_Tiny_v1 (1 core, 4 GB) is used.

      VPC ID

      Yes

      The virtual private cloud (VPC) in which the AnalyticDB for MySQL cluster resides, which ensures that Notebook can communicate with the AnalyticDB for MySQL cluster.

      Important

      If you change the VPC and vSwitch of the AnalyticDB for MySQL cluster, you must change the VPC ID and VSwitch ID parameters of the resource to the new VPC and vSwitch. Otherwise, job submissions may fail.

      Zone ID

      Yes

      The zone in which the AnalyticDB for MySQL cluster resides.

      VSwitch ID

      Yes

      The vSwitch to which the AnalyticDB for MySQL cluster is connected.

      Security Group ID

      Yes

      An available security group, which ensures that Notebook can communicate with the AnalyticDB for MySQL cluster.

      Release Resource

      Yes

      The period of time for which the resource remains idle before it is automatically released.

      Dependent Jars

      No

      The OSS storage path of the JAR package. Specify this parameter only if you submit a job in Python and use a JAR package.

      SparkConf

      No

      The configuration parameters that are similar to those of Apache Spark. The parameters must be in the key: value format. For information about the configuration parameters that are different from those of Apache Spark or the configuration parameters that are specific to AnalyticDB for MySQL, see Spark application configuration parameters.

    3. Click Save.

    4. Click Start in the Actions column of the resource to start the resource.

  5. Initialize data.

    1. Move the pointer over the 2023-01-28_15-57-17.png icon in the upper-left corner and choose All Features > Data Assets > Instances.

    2. Click +New. In the Add Instance dialog box, configure the parameters that are described in the following table.

      Parameter

      Description

      Data Source

      On the Alibaba Cloud tab, select OSS.

      Basic Information

      File and Log Storage

      This parameter is automatically set to OSS.

      Instance Region

      The region in which the AnalyticDB for MySQL cluster resides.

      Connection Method

      This parameter is automatically set to Connection String Address.

      Connection String Address

      Set the value to oss-cn-hangzhou.aliyuncs.com.

      Bucket

      The name of the bucket.

      Access mode

      The access method. In this example, Security Hosting - Manual is used.

      AccessKey ID

      The AccessKey ID of an Alibaba Cloud account or a RAM user that has permissions to access OSS.

      For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions.

      AccessKey Secret

      The AccessKey secret of an Alibaba Cloud account or a RAM user that has permissions to access OSS.

      For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions.

      Advanced Information

      The optional parameters. For more information, see the "Advanced Information" section of the Register an Alibaba Cloud database instance topic.

    3. After you configure the preceding parameters, click Test Connection in the lower-left corner.

      Note

      If the connection test fails, check the instance information based on the error message.

    4. After the Successful connection message is displayed, click Submit.

    5. Go to the workspace and click the image tab.

    6. On the Data Lake Data tab, click Add OSS and select the bucket that you specified in Step b.

  6. Create a notebook.

    On the image tab, click the image icon and select Notebook.

    image

  7. Develop a Spark SQL job on the Notebook page.

    Note

    For information about the buttons on the Notebook page, see the "Notebook UI" section of the Use Notebook to query and analyze data topic.

    1. Run the following command to download Python dependencies:

      pip install delta
    2. Switch the cell type to SQL and execute the following statement to create a database.

      Note

      The db_delta database created in Step b and the sample_data external table created in Step c are automatically displayed in the AnalyticDB for MySQL cluster. You can analyze the sample_data table in the AnalyticDB for MySQL console.

      image

      CREATE DATABASE db_delta 
      LOCATION 'oss://testBucketName/db_delta/';    -- Specify the storage path for data in the db_delta database.
    3. Switch the cell type to Code and run the following code to create an external table named sample_data and insert data into the table. The data of the sample_data external table is stored in the OSS path specified in Step b.

      # -*- coding: utf-8 -*-
      
      import pyspark
      from delta import *
      from pyspark.sql.types import *
      from pyspark.sql.functions import *
      
      
      print("Starting Delta table creation")
      
      data = [
          ("Robert", "Baratheon", "Baratheon", "Storms End", 48),
          ("Eddard", "Stark", "Stark", "Winterfell", 46),
          ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29),
          ("Robert", "Baratheon", "Baratheon", "Storms End", 48),
          ("Eddard", "Stark", "Stark", "Winterfell", 46),
          ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29),
          ("Robert", "Baratheon", "Baratheon", "Storms End", 48),
          ("Eddard", "Stark", "Stark", "Winterfell", 46),
          ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29)
              ]
      
      schema = StructType([
          StructField("firstname", StringType(), True),
          StructField("lastname", StringType(), True),
          StructField("house", StringType(), True),
          StructField("location", StringType(), True),
          StructField("age", IntegerType(), True)
      ])
      
      sample_dataframe = spark.createDataFrame(data=data, schema=schema)
      
      sample_dataframe.delta').mode("overwrite").option('mergeSchema','true').saveAsTable("db_delta.sample_data")
    4. Switch the cell type to SQL and execute the following statement to query data from the sample_data table:

      SELECT * FROM db_delta.sample_data;
  8. To use Spark SQL to analyze the sample_data table in the AnalyticDB for MySQL console, perform the following steps:

    1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Clusters page, click an edition tab. Find the cluster that you want to manage and click the cluster ID.

    2. In the left-side navigation pane, choose Job Development > SQL Development. On the page that appears, select the Spark engine and an interactive resource group.

    3. Query data from the sample_data table.

      SELECT * FROM db_delta.sample_data LIMIT 1000;

References

Notebook: describes the information about Notebook.