You can use VS Code, Lingma, and the Serverless Spark spark-submit tool to quickly generate and submit Spark jobs. This topic describes how to submit a Serverless Spark job using these tools.
Prerequisites
You have created a workspace. For more information, see Manage workspaces.
You have activated OSS and created a bucket. For more information, see Quick Start and Create buckets.
Step 1: Prepare the environment
Install VS Code and the Lingma extension
Download and install Visual Studio Code (version 1.68 or later is recommended).
Open VS Code. In the Extensions view, search for and install the following extensions. For detailed instructions, see Install Lingma in Visual Studio Code.
Python (from Microsoft) for syntax highlighting and debugging
Tongyi Lingma, the official AI programming extension from Alibaba Cloud
After the installation is complete, log on to Lingma with your Alibaba Cloud account.
In the status bar, click the Lingma icon and switch the mode in the dialog box to Agent Mode.
Configure a supported shell for VS Code.
Press
Cmd + Shift + P(macOS) orCtrl + Shift + P(Windows/Linux).Enter
Terminal: Select Default Profileand select it.Select a supported shell.
Linux/macOS: bash, fish, pwsh, zsh
Windows: Git Bash, pwsh
Completely exit and then reopen VS Code.
Install ossutil and the spark-submit tool
Download and install ossutil. For detailed instructions, see Install ossutil.
Click emr-serverless-spark-tool-0.9.0-SNAPSHOT-bin.zip to download the spark-submit tool, and then unzip the file.
Go to the
conf/directory. Open theconnection.propertiesfile in VS Code and configure the following parameters.accessKeyId=<ALIBABA_CLOUD_ACCESS_KEY_ID> accessKeySecret=<ALIBABA_CLOUD_ACCESS_KEY_SECRET> regionId=cn-hangzhou endpoint=emr-serverless-spark.cn-hangzhou.aliyuncs.com workspaceId=w-xxxxxxxxxxxxImportantThe Resource Access Management (RAM) user or RAM role that is associated with the AccessKey must be granted RAM authorization and added to the Serverless Spark workspace.
For RAM authorization, see Grant permissions to a RAM user.
To manage users and roles in a Serverless Spark workspace, see Manage users and roles.
The parameters are described in the following table.
Parameter
Required
Description
accessKeyId
Yes
The AccessKey ID and AccessKey secret of the Alibaba Cloud account or RAM user that runs the Spark job.
ImportantWhen you configure the
accessKeyIdandaccessKeySecretparameters, make sure the user corresponding to the AccessKey has read and write permissions on the OSS bucket attached to the workspace. To view the attached OSS bucket, go to the Spark page and click Details in the Actions column of the workspace.accessKeySecret
Yes
regionId
Yes
The region ID. This topic uses the China (Hangzhou) region as an example.
endpoint
Yes
The endpoint of EMR Serverless Spark. For more information about endpoints, see Service endpoints.
This topic uses the public endpoint in the China (Hangzhou) region as an example. The parameter is set to
emr-serverless-spark.cn-hangzhou.aliyuncs.com.NoteIf the ECS instance does not have public network access, use the VPC endpoint.
workspaceId
Yes
The ID of the EMR Serverless Spark workspace.
Step 2: Generate sample data and job code
Generate sample data
Create a new project folder on your local machine, such as
spark-with-lingma, and open the folder in VS Code.Open the Lingma input box and enter the following prompt:
Create a new file named employees.csv. Generate 20 rows of data in CSV format. The file should include columns for employee name, department name, and salary. Use common English names. The departments are Engineering, Marketing, Sales, HR, and Finance. The salary should be between 5000 and 30000.Lingma automatically generates the following content. Click Accept to save it as employees.csv in the current project folder.
In the Lingma input box, enter the following prompt:
Upload the employees.csv file to oss://spark-demoLingma uploads
employees.csvto the oss://spark-demo path.
Generate the job code
Generate PySpark code to calculate the average salary for each department.
In the Lingma input box, enter the following natural language instruction:
Generate an avg_salary_by_dept.py file with the following content: 1. Use PySpark to read the CSV file from the OSS path oss://spark-demo/employees.csv. The file has a header and the data types should be inferred. Define a clear data structure. (The data content is the same as the employees.csv file you just generated). 2. Show some sample data. 3. Calculate the average salary for each department. When calculating the average salary, exclude the header data (the column name is department). 4. Print the aggregated results. 5. Add the necessary import statements.Lingma generates code similar to the following. Click Accept to save the code as
avg_salary_by_dept.pyin the current project folder.Use Lingma to upload
avg_salary_by_dept.pyto the oss://spark-demo path.
Step 3: Submit the job using spark-submit
Use the spark-submit tool to submit the job to the Serverless Spark cluster.
Build the submission command
Use Lingma to build the spark-submit command.
Enter the following prompt in Lingma:
The following is a spark-submit example. Use it as a reference to give me a command to submit a job: ./bin/spark-submit --name SparkPi \ --queue dev_queue \ --num-executors 5 \ --driver-memory 1g \ --executor-cores 2 \ --executor-memory 2g \ --class org.apache.spark.examples.SparkPi \ oss://<yourBucket>/path/to/spark-examples_2.12-3.3.1.jar \ 10000 My tool is in /yourPath. The job name is AvgSalaryJob. Use the root_queue queue. The job file path is oss://spark-demo/avg_salary_by_dept.py.Lingma provides the following output:
Submit the job
Use Lingma to run the job submission command.
Enter the following prompt in Lingma.
Run the command /yourPath/emr-serverless-spark-tool-0.9.0-SNAPSHOT/bin/spark-submit --name AvgSalaryJob --queue root_queue --num-executors 5 --driver-memory 1g --executor-cores 2 --executor-memory 2g oss://spark-demo/avg_salary_by_dept.pyLingma provides the following output. Click Run.
View the execution result.
Step 4: Monitor the job execution status
You can check the job status using the spark-submit tool and Lingma.
Check the job status:
/yourPath/emr-serverless-spark-tool-0.9.0-SNAPSHOT/bin/spark-submit --status <jr-xxxxxxxxxxxx>View job details:
/yourPath/emr-serverless-spark-tool-0.9.0-SNAPSHOT/bin/spark-submit --detail <jr-xxxxxxxxxxxx>(Optional) Stop the job:
/yourPath/emr-serverless-spark-tool-0.9.0-SNAPSHOT/bin/spark-submit --kill <jr-xxxxxxxxxxxx>
You can also log on to the Serverless Spark workspace to view detailed logs and the job status in the job history list.
References
For more information about how to use Lingma Agent Mode, see Agent.
For answers to frequently asked questions about using Lingma, see the FAQ.
For more information about how to use the spark-submit tool, see Submit a job using spark-submit.