Submit Spark applications by using DSW - AnalyticDB - Alibaba Cloud Documentation Center

Data Science Workshop (DSW) is a cloud-based Integrated Development Environment (IDE) for machine learning provided by PAI. It supports multiple languages and development environments. You can connect to an AnalyticDB for MySQL cluster from a DSW instance and use IDEs, such as Notebook and Terminal, to write PySpark scripts and submit Spark jobs. This topic describes how to submit a Spark job from a DSW instance.

Prerequisites

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A Job resource group is created in the AnalyticDB for MySQL cluster, and the Spark parameter spark.adb.version=3.2 is configured for the resource group.
A database account is created for the AnalyticDB for MySQL cluster.
- If you use an Alibaba Cloud account, you need to only create a privileged account.
- If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
AnalyticDB for MySQL is authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources.
The log storage path of Spark applications is configured for the AnalyticDB for MySQL cluster.
Note
Log on to the AnalyticDB for MySQL console. Find the cluster that you want to manage and click the cluster ID. In the left-side navigation pane, choose Job Development > Spark JAR Development. Click Log Settings. In the dialog box that appears, select the default path or specify a custom storage path. You cannot set the custom storage path to the root directory of OSS. Make sure that the custom storage path contains at least one layer of folders.

Step 1: Create and configure a PAI DSW instance

Activate PAI and create a workspace. For more information, see Activate PAI and Create and manage workspaces.
PAI must be in the same region as AnalyticDB for MySQL.
Create a DSW instance.
You can use one of the following methods to create a DSW instance:
- You can create a DSW instance in the console. For more information, see Create a DSW instance.
  You must set Image config to Image Address and enter the Livy image URL for AnalyticDB for MySQL Spark: registry.cn-hangzhou.aliyuncs.com/adb-public-image/adb-spark-public-image:livy.0.5.pre. You can configure other parameters as needed.
- In Tutorials, click Open In DSW, and then select a DSW instance that meets the requirements or create a new one. For more information, see Create a DSW instance.
  On the DSW instance creation page, the image URL and DSW instance type are pre-filled. You only need to enter an Instance Name and click Yes to create the DSW instance.
Access the DSW instance. For more information, see Access from the console.

In the top menu bar, click Terminal and run the following statement to start the Apache Livy proxy.

cd /root/proxy
python app.py --db <ClusterID> --rg <Resource Group Name> --e <URL> -i <AK> -k <SK> -t <STS> &

Parameters:

Parameter	Required	Description
ClusterID	Yes	The ID of the AnalyticDB for MySQL cluster.
Resource Group Name	Yes	The name of the Job resource group in the AnalyticDB for MySQL cluster.
URL	Yes	The service endpoint of the AnalyticDB for MySQL cluster. For information about how to view the service endpoint of an AnalyticDB for MySQL cluster, see Service endpoints.
AK, SK	Conditionally required	The AccessKey ID and AccessKey secret of an Alibaba Cloud account or a RAM user that has the permissions to access AnalyticDB for MySQL. For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions. Note You need to specify AK and SK only when you use an Alibaba Cloud account or a RAM user.
STS	Required under specific conditions	The temporary identity credential of a RAM role, which is the Security Token Service (STS) token. An authorized RAM user can use an AccessKey pair to call the AssumeRole operation. This way, the RAM user obtains an STS token of a RAM role and can use the STS token to access Alibaba Cloud resources. Note You need to specify STS only when you use a RAM role.

If the following information is returned, the proxy has started successfully:

2024-11-15 11:04:52,125-ADB-INFO: ADB Client Init
2024-11-15 11:04:52,125-ADB-INFO: Aliyun ADB Proxy is ready

Check whether a process is listening on port 5000.
After Step 4 is complete, you can run the netstat -anlp | grep 5000 statement to check whether a process is listening on port 5000.

Step 2: Develop a PySpark application

Access the DSW instance. For more information, see Access from the console.
In the top navigation bar, click Notebook to open the Notebook page.
In the top menu bar, choose File > New > Notebook. In the Select Kernel dialog box, select Python 3 (ipykernel) and click Select.
Run the following statements in sequence to install and load sparkmagic.
```
!pip install sparkmagic
%load_ext sparkmagic.magics
```
Run the %manage_spark statement.
After you run the statement, the Create Session tab appears.
On the Create Session tab, set Language to python and then click Create Session.
Important
Click Create Session only once. Do not click it repeatedly.
After you click Create Session, the status at the bottom of the Notebook page changes to Busy. The session is created when the status changes to Idle and the session ID appears on the Manage Session tab.

Run the PySpark script.

When you run the PySpark script, you must add the %%spark command before the service code to specify that the remote Spark instance is used.

%%spark
db_sql = """
CREATE DATABASE IF NOT exists test_db comment 'demo db' 
location 'oss://testBucketName/test'  
WITH dbproperties(k1='v1', k2='v2')
"""

tb_sql = """
CREATE TABLE IF NOT exists test_db.test_tbl(id int, name string, age int) 
using parquet 
location 'oss://testBucketName/test/test_tbl/' 
tblproperties ('parquet.compress'='SNAPPY');
"""

insert_sql = """
INSERT INTO test_db.test_tbl VALUES(1, 'adb', 10);
"""

select_sql = """
SELECT * FROM test_db.test_tbl;
"""

spark.sql(db_sql).show()
spark.sql(tb_sql).show()
spark.sql(insert_sql).show()
spark.sql(select_sql).show()