Data Science Workshop (DSW) is a cloud-based Integrated Development Environment (IDE) for machine learning provided by PAI. It supports multiple languages and development environments. You can connect to an AnalyticDB for MySQL cluster from a DSW instance and use IDEs, such as Notebook and Terminal, to write PySpark scripts and submit Spark jobs. This topic describes how to submit a Spark job from a DSW instance.
Prerequisites
An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A Job resource group is created in the AnalyticDB for MySQL cluster, and the Spark parameter
spark.adb.version=3.2is configured for the resource group.A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you need to only create a privileged account.
If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
AnalyticDB for MySQL is authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources.
The log storage path of Spark applications is configured for the AnalyticDB for MySQL cluster.
NoteLog on to the AnalyticDB for MySQL console. Find the cluster that you want to manage and click the cluster ID. In the left-side navigation pane, choose . Click Log Settings. In the dialog box that appears, select the default path or specify a custom storage path. You cannot set the custom storage path to the root directory of OSS. Make sure that the custom storage path contains at least one layer of folders.
Step 1: Create and configure a PAI DSW instance
Activate PAI and create a workspace. For more information, see Activate PAI and Create and manage workspaces.
PAI must be in the same region as AnalyticDB for MySQL.
Create a DSW instance.
You can use one of the following methods to create a DSW instance:
You can create a DSW instance in the console. For more information, see Create a DSW instance.
You must set Image config to Image Address and enter the Livy image URL for AnalyticDB for MySQL Spark:
registry.cn-hangzhou.aliyuncs.com/adb-public-image/adb-spark-public-image:livy.0.5.pre. You can configure other parameters as needed.In Tutorials, click Open In DSW, and then select a DSW instance that meets the requirements or create a new one. For more information, see Create a DSW instance.
On the DSW instance creation page, the image URL and DSW instance type are pre-filled. You only need to enter an Instance Name and click Yes to create the DSW instance.
Access the DSW instance. For more information, see Access from the console.
In the top menu bar, click Terminal and run the following statement to start the Apache Livy proxy.
cd /root/proxy python app.py --db <ClusterID> --rg <Resource Group Name> --e <URL> -i <AK> -k <SK> -t <STS> &Parameters:
Parameter
Required
Description
ClusterID
Yes
The ID of the AnalyticDB for MySQL cluster.
Resource Group Name
Yes
The name of the Job resource group in the AnalyticDB for MySQL cluster.
URL
Yes
The service endpoint of the AnalyticDB for MySQL cluster.
For information about how to view the service endpoint of an AnalyticDB for MySQL cluster, see Service endpoints.
AK, SK
Conditionally required
The AccessKey ID and AccessKey secret of an Alibaba Cloud account or a RAM user that has the permissions to access AnalyticDB for MySQL.
For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions.
NoteYou need to specify AK and SK only when you use an Alibaba Cloud account or a RAM user.
STS
Required under specific conditions
The temporary identity credential of a RAM role, which is the Security Token Service (STS) token.
An authorized RAM user can use an AccessKey pair to call the AssumeRole operation. This way, the RAM user obtains an STS token of a RAM role and can use the STS token to access Alibaba Cloud resources.
NoteYou need to specify STS only when you use a RAM role.
If the following information is returned, the proxy has started successfully:
2024-11-15 11:04:52,125-ADB-INFO: ADB Client Init 2024-11-15 11:04:52,125-ADB-INFO: Aliyun ADB Proxy is readyCheck whether a process is listening on port 5000.
After Step 4 is complete, you can run the
netstat -anlp | grep 5000statement to check whether a process is listening on port 5000.
Step 2: Develop a PySpark application
Access the DSW instance. For more information, see Access from the console.
In the top navigation bar, click Notebook to open the Notebook page.
In the top menu bar, choose . In the Select Kernel dialog box, select Python 3 (ipykernel) and click Select.
Run the following statements in sequence to install and load sparkmagic.
!pip install sparkmagic %load_ext sparkmagic.magicsRun the
%manage_sparkstatement.After you run the statement, the Create Session tab appears.
On the Create Session tab, set Language to python and then click Create Session.
ImportantClick Create Session only once. Do not click it repeatedly.
After you click Create Session, the status at the bottom of the Notebook page changes to Busy. The session is created when the status changes to Idle and the session ID appears on the Manage Session tab.

Run the PySpark script.
When you run the PySpark script, you must add the
%%sparkcommand before the service code to specify that the remote Spark instance is used.%%spark db_sql = """ CREATE DATABASE IF NOT exists test_db comment 'demo db' location 'oss://testBucketName/test' WITH dbproperties(k1='v1', k2='v2') """ tb_sql = """ CREATE TABLE IF NOT exists test_db.test_tbl(id int, name string, age int) using parquet location 'oss://testBucketName/test/test_tbl/' tblproperties ('parquet.compress'='SNAPPY'); """ insert_sql = """ INSERT INTO test_db.test_tbl VALUES(1, 'adb', 10); """ select_sql = """ SELECT * FROM test_db.test_tbl; """ spark.sql(db_sql).show() spark.sql(tb_sql).show() spark.sql(insert_sql).show() spark.sql(select_sql).show()