AnalyticDB for MySQL Spark allows you to use a Docker image to start the interactive JupyterLab development environment. This environment helps you connect to AnalyticDB for MySQL Spark and perform interactive testing and computing based on elastic resources.
Prerequisites
An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a Data Lakehouse Edition cluster.
A job resource group is created in the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.
If you use an Alibaba Cloud account, you must create a privileged account. For more information, see the "Create a privileged account" section of the Create a database account topic.
If you use a Resource Access Management (RAM) user, you must create both a privileged account and a standard account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
AnalyticDB for MySQL is authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources. For more information, see Perform authorization.
Usage notes
AnalyticDB for MySQL Spark supports interactive Jupyter jobs only in Python 3.7 or Scala 2.12.
If an interactive Jupyter job remains idle for a time-to-live (TTL) period of 1,200 seconds after the last code snippet is executed, the job is automatically released. You can use the
spark.adb.sessionTTLSeconds
parameter to specify the TTL period for interactive Jupyter jobs.
Start the interactive JupyterLab development environment
Install and start a Docker image. For more information, see the Docker documentation.
Pull the Jupyter image of AnalyticDB for MySQL. Sample command:
docker pull registry.cn-hangzhou.aliyuncs.com/adb-public-image/adb-spark-public-image:livy.0.5.pre
Start the interactive JupyterLab development environment.
Command syntax:
docker run -it -p {Host port}:8888 -v {Host file path}:{Docker file path} registry.cn-hangzhou.aliyuncs.com/adb-public-image/adb-spark-public-image:livy.0.5.pre -d {AnalyticDB for MySQL instance ID} -r {Resource group name} -e {API endpoint} -i {AccessKey ID} -k {AccessKey secret}
The following table describes the parameters.
Parameter
Required
Description
-p
No
Maps a host port to a container port. Specify the parameter in the
-p {Host port}:{Container port}
format.Specify a random value for the host port and set the container port to
8888
. Example:-p 8888:8888
.-v
No
If you do not mount the host path and disable the Docker container, the editing files may be lost. After you disable the Docker container, the container attempts to terminate all interactive Spark jobs that are running. You can use one of the following methods to prevent loss of the editing files:
When you start the interactive JupyterLab development environment, mount the host path to the Docker container and store the job files in the corresponding file path. Specify the parameter in the
-v {Host file path}:{Docker file path}
format. Specify a random value for the file path of the Docker container. Recommended value:/root/jupyter
.Before you disable the Docker container, make sure that all files are copied and stored.
Example:
-v /home/admin/notebook:/root/jupyter
. In this example, the host files that are stored in the/home/admin/notebook
path are mounted to the/root/jupyter
path of the Docker container.NoteSave the editing notebook files to the
/tmp
folder. After you disable the Docker container, you can view the corresponding files in the/home/admin/notebook
path of the host. After you re-enable the Docker container, you can modify and execute the files. For more information, see Volumes.-d
Yes
The ID of the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.
You can log on to the AnalyticDB for MySQL console and go to the Clusters page to view cluster IDs.
-r
Yes
The name of the resource group in the AnalyticDB for MySQL cluster.
You can log on to the AnalyticDB for MySQL console, choose Cluster Management > Resource Management in the left-side navigation pane, and then click the Resource Groups tab to view resource group names.
-e
Yes
The endpoint of the AnalyticDB for MySQL cluster.
For more information, see Endpoints.
-i
Yes
The AccessKey ID of your Alibaba Cloud account or RAM user.
For information about how to view the AccessKey ID, see Accounts and permissions.
-k
Yes
The AccessKey secret of your Alibaba Cloud account or RAM user.
For information about how to view the AccessKey secret, see Accounts and permissions.
Example:
docker run -it -p 8888:8888 -v /home/admin/notebook:/root/jupyter registry.cn-hangzhou.aliyuncs.com/adb-public-image/adb-spark-public-image:livy.0.5.pre -d amv-bp164l3xt9y3**** -r test -e adb.aliyuncs.com -i LTAI55stlJn5GhpBDtN8**** -k DlClrgjoV5LmwBYBJHEZQOnRF7****
After you start the interactive JupyterLab development environment, the following information is returned. You can copy and paste the
http://127.0.0.1:8888/lab?token=1e2caca216c1fd159da607c6360c82213b643605f11ef291
URL to your browser and use JupyterLab to connect to AnalyticDB for MySQL Spark.[I 2023-11-24 09:55:09.852 ServerApp] nbclassic | extension was successfully loaded. [I 2023-11-24 09:55:09.852 ServerApp] sparkmagic extension enabled! [I 2023-11-24 09:55:09.853 ServerApp] sparkmagic | extension was successfully loaded. [I 2023-11-24 09:55:09.853 ServerApp] Serving notebooks from local directory: /root/jupyter [I 2023-11-24 09:55:09.853 ServerApp] Jupyter Server 1.24.0 is running at: [I 2023-11-24 09:55:09.853 ServerApp] http://419e63fc7821:8888/lab?token=1e2caca216c1fd159da607c6360c82213b643605f11ef291 [I 2023-11-24 09:55:09.853 ServerApp] or http://127.0.0.1:8888/lab?token=1e2caca216c1fd159da607c6360c82213b643605f11ef291 [I 2023-11-24 09:55:09.853 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
NoteIf an error message is returned when you start the interactive JupyterLab development environment, you can view the
proxy_{timestamp}.log
file for troubleshooting.
Modify Spark application configuration parameters
After you use JupyterLab to connect to AnalyticDB for MySQL Spark, you can directly run a Spark job on the notebook development page of JupyterLab. The following default configuration parameters are used to run the Spark job:
{
"kind": "pyspark",
"heartbeatTimeoutInSecond": "60",
"spark.driver.resourceSpec": "medium",
"spark.executor.resourceSpec": "medium",
"spark.executor.instances": "1",
"spark.dynamicAllocation.shuffleTracking.enabled": "true",
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.minExecutors": "0",
"spark.dynamicAllocation.maxExecutors": "1",
"spark.adb.sessionTTLSeconds": "1200"
}
To modify the Spark application configuration parameters, execute the %%configure
statement.
Restart the kernel.
Use JupyterLab to connect to AnalyticDB for MySQL Spark.
In the top navigation bar, choose
. Make sure that no running Spark applications are displayed on the notebook development page of JupyterLab.
Specify custom Spark application configuration parameters in the code editor.
ImportantWhen you specify custom Spark application configuration parameters, you must set the spark.dynamicAllocation.enabled parameter to false.
%%configure -f { "spark.driver.resourceSpec":"small", "spark.sql.hive.metastore.version":"adb", "spark.executor.resourceSpec":"small", "spark.adb.executorDiskSize":"100Gi", "spark.executor.instances":"1", "spark.dynamicAllocation.enabled":"false", "spark.network.timeout":"30000", "spark.memory.fraction":"0.75", "spark.memory.storageFraction":"0.3" }
For more information about Spark application configuration parameters, see Spark application configuration parameters.
Click the
button.
If the following results are returned, the Spark application configuration parameters are modified.
After you close the notebook development page of JupyterLab, the specified custom configuration parameters no longer take effect. If you do not specify Spark application parameters after you re-open the notebook development page of JupyterLab, the default configuration parameters are used to run a Spark job.
When you run a Spark job on the notebook development page of JupyterLab, all configurations of the job are written directly to a JSON structure, instead of the
conf
object of a JSON structure required when you submit a batch job.