Alibaba Cloud Data Lake Analytics (DLA) introduces solutions to support the Spark read-eval-print loop (REPL) feature. You can install JupyterLab and the Livy proxy of DLA on your on-premises machine or use a Docker image to start JupyterLab in an efficient manner. This helps you connect JupyterLab to the serverless Spark engine of DLA. After the connection is established, you can perform interactive testing and data computing by using the elastic resources of DLA.
DLA is discontinued. AnalyticDB for MySQL Data Lakehouse Edition supports the features of DLA and provides more features and better performance. For more information about how to develop an interactive Jupyter job by using AnalyticDB for MySQL Spark, see Develop an interactive Jupyter job.
Usage notes
The serverless Spark engine of DLA supports interactive Jupyter jobs that are programmed in Python 3.0 or Scala 2.11.
JupyterLab of the latest version supports Python 3.6 and later.
To develop an interactive Jupyter job, we recommend that you use a Docker image to start JupyterLab. For more information, see the Use a Docker image to start JupyterLab section of this topic.
Interactive Jupyter jobs are automatically released after they are idle for a specific period of time. By default, an interactive Jupyter job is released 1,200 seconds after the last code block of the job is run. You can use the
spark.dla.session.ttl
parameter to specify the idle time of an interactive Jupyter job before the job is automatically released.
Install JupyterLab and the Livy proxy of DLA on your on-premises machine
Install the Livy proxy of DLA.
Install Alibaba Cloud SDK for Python.
NoteThe version of Alibaba Cloud SDK for Python must be
2.0.4
or later.Run the following command to install the Livy proxy of DLA:
pip install aliyun-dla-livy-proxy-0.0.5.zip
NoteYou must install the Livy proxy of DLA as a root user. Otherwise, the command may fail to be registered to the path in which the command is run. After the Livy proxy of DLA is installed, you can find the
dlaproxy
command on the CLI.Start the Livy proxy of DLA.
The Livy proxy of DLA is used to interpret an interface of DLA as an
Apache Livy
interface required bySparkmagic
. After you start the Livy proxy of DLA, you can deploy a local HTTP proxy to listen on a port and forward requests. By default, port5000
is used.# View the dlaproxy command. $dlaproxy -h usage: dlaproxy [-h] --vcname VCNAME -i AK -k SECRET --region REGION [--host HOST] [--port PORT] [--loglevel LOGLEVEL] Proxy AliYun DLA as Livy optional arguments: -h, --help show this help message and exit --vcname VCNAME Virtual Cluster Name -i AK, --access-key-id AK Aliyun Access Key Id -k SECRET, --access-key-secret SECRET Aliyun Access Key Secret --region REGION Aliyun Region Id --host HOST Proxy Host Ip --port PORT Proxy Host Port --loglevel LOGLEVEL python standard log level # Start the Livy proxy of DLA. dlaproxy --vcname <vcname> -i akid -k aksec --region <regionid>
The following table describes the parameters in the preceding code.
Parameter
Description
--vcname
The name of the Spark virtual cluster in DLA.
NoteTo view the cluster name, log on to the DLA console. In the left-side navigation pane, click . On the Virtual Cluster management page, find the cluster that you want to manage and click Details in the Actions column to view the cluster name.
-i
The AccessKey ID of the Resource Access Management (RAM) user.
NoteIf you have created an AccessKey pair for the RAM user, you can view the AccessKey ID and AccessKey secret in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair.
-k
The AccessKey secret of the RAM user.
NoteIf you have created an AccessKey pair for the RAM user, you can view the AccessKey ID and AccessKey secret in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair.
--region
The ID of the region in which the cluster is deployed. For more information, see Regions and Zones.
--host
The proxy host IP address of DLA. Default value: 127.0.0.1. This IP address is used to forward only local requests. You can change the value to 0.0.0.0 or a different IP address to listen to requests from the Internet or an internal network. We recommend that you use the default value.
--port
The listening port. Default value: 5000. We recommend that you use the default value. You can also modify this parameter.
--loglevel
The log level. Valid values: ERROR, WARNING, INFO, and DEBUG. Default value: INFO. We recommend that you use the default value.
Install JupyterLab.
Optional. Install a virtual environment.
NoteWe recommend that you install JupyterLab in a virtual environment. This prevents subsequent installations from damaging the public Python environment within your Alibaba Cloud account.
Run the following commands to install JupyterLab:
pip install jupyterlab # Install JupyterLab. jupyter lab # Check whether JupyterLab is installed. If the boot log of JupyterLab is displayed, JupyterLab is installed.
Perform the following steps to install Sparkmagic:
Install the Sparkmagic library.
pip install sparkmagic
Enable nbextension and make sure that ipywidgets can be used.
jupyter nbextension enable --py --sys-prefix widgetsnbextension
If you use JupyterLab, install JupyterLab labextension.
jupyter labextension install "@jupyter-widgets/jupyterlab-manager"
Run the
pip show sparkmagic
command to query the path in which Sparkmagic is installed. Then, install kernels in the same path.jupyter-kernelspec install sparkmagic/kernels/sparkkernel jupyter-kernelspec install sparkmagic/kernels/pysparkkernel jupyter-kernelspec install sparkmagic/kernels/sparkrkernel
Modify the configuration file config.json in the
~/.sparkmagic/
path. For more information about sample configurations, see example_config.json.Enable Sparkmagic.
jupyter serverextension enable --py sparkmagic
After you install Sparkmagic, you must manually create the configuration file config.json in the
~/.sparkmagic
path and direct theURL
to the local proxy server. The following sample code provides an example of the config.json file:{ "kernel_python_credentials" : { "username": "", "password": "", "url": "http://127.0.0.1:5000", "auth": "None" }, "kernel_scala_credentials" : { "username": "", "password": "", "url": " http://127.0.0.1:5000", "auth": "None" }, "kernel_r_credentials": { "username": "", "password": "", "url": "http://localhost:5000" }, "logging_config": { "version": 1, "formatters": { "magicsFormatter": { "format": "%(asctime)s\t%(levelname)s\t%(message)s", "datefmt": "" } }, "handlers": { "magicsHandler": { "class": "hdijupyterutils.filehandler.MagicsFileHandler", "formatter": "magicsFormatter", "home_path": "~/.sparkmagic" } }, "loggers": { "magicsLogger": { "handlers": ["magicsHandler"], "level": "DEBUG", "propagate": 0 } } }, "wait_for_idle_timeout_seconds": 15, "livy_session_startup_timeout_seconds": 600, "fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.", "ignore_ssl_errors": false, "session_configs": { "conf": { "spark.dla.connectors": "oss" } }, "use_auto_viz": true, "coerce_dataframe": true, "max_results_sql": 2500, "pyspark_dataframe_encoding": "utf-8", "heartbeat_refresh_seconds": 30, "livy_server_heartbeat_timeout_seconds": 0, "heartbeat_retry_seconds": 10, "server_extension_default_kernel_name": "pysparkkernel", "custom_headers": {}, "retry_policy": "configurable", "retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5], "configurable_retry_policy_max_retries": 8 }
NoteThe value of the
session_configs
parameter in the sample code is the same as that of theconf
parameter in the configurations of the job that you submit to the serverless Spark engine of DLA. If you want to load the JAR packages of the job, you must use the serverless Spark engine to access the metadata service of DLA. For more information, see Configure a Spark job.After you start the Livy proxy of DLA, the default URL for listening is
http://127.0.0.1:5000
. If you change the host IP address or port number in the default URL, you must change the value of theurl
parameter in the configuration file. For example, if you set the--host
parameter to 192.168.1.3 and the --port parameter to 8080 when you start the Livy proxy of DLA, you must change the value of theurl
parameter tohttp://192.168.1.3:8080
. The following sample code provides an example of the modified config.json file:{ "kernel_python_credentials" : { "username": "", "password": "", "url": "http://192.168.1.3:8080", "auth": "None" }, "kernel_scala_credentials" : { "username": "", "password": "", "url": "http://192.168.1.3:8080", "auth": "None" }, "kernel_r_credentials": { "username": "", "password": "", "url": "http://192.168.1.3:8080" }, "logging_config": { "version": 1, "formatters": { "magicsFormatter": { "format": "%(asctime)s\t%(levelname)s\t%(message)s", "datefmt": "" } }, "handlers": { "magicsHandler": { "class": "hdijupyterutils.filehandler.MagicsFileHandler", "formatter": "magicsFormatter", "home_path": "~/.sparkmagic" } }, "loggers": { "magicsLogger": { "handlers": ["magicsHandler"], "level": "DEBUG", "propagate": 0 } } }, "wait_for_idle_timeout_seconds": 15, "livy_session_startup_timeout_seconds": 600, "fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.", "ignore_ssl_errors": false, "session_configs": { "conf": { "spark.dla.connectors": "oss" } }, "use_auto_viz": true, "coerce_dataframe": true, "max_results_sql": 2500, "pyspark_dataframe_encoding": "utf-8", "heartbeat_refresh_seconds": 30, "livy_server_heartbeat_timeout_seconds": 0, "heartbeat_retry_seconds": 10, "server_extension_default_kernel_name": "pysparkkernel", "custom_headers": {}, "retry_policy": "configurable", "retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5], "configurable_retry_policy_max_retries": 8 }
Start JupyterLab.
# Restart JupyterLab. jupyter lab # Start the Livy proxy of DLA. dlaproxy --vcname vcname -i akid -k aksec --region <regionid>
After you start JupyterLab, the URL that is used to access JupyterLab is displayed in the boot log of JupyterLab. The following figure shows an example.
If the
Aliyun DLA Proxy is ready
message appears, the Livy proxy of DLA is started. After the Livy proxy of DLA is started, you can use JupyterLab. For more information about how to use JupyterLab, see JupyterLab Documentation.When you run a Jupyter job, DLA automatically creates a Spark job. To view and manage the Spark job, log on to the DLA console. In the left-side navigation pane, choose to view and manage the Spark job. The Spark jobs whose names start with
notebook_
are interactive Jupyter jobs. The following figure shows an example.After you start JupyterLab, you can still modify the configurations of a Spark job by using the
magic
command. If you run the magic command, the new configurations overwrite the original configurations. Then, JupyterLab restarts the Spark job based on the new configurations.# Restart JupyterLab. jupyter lab # Start the Livy proxy of DLA. dlaproxy --vcname vcname -i akid -k aksec --region <regionid>
%%configure -f { "conf": { "spark.sql.hive.metastore.version": "dla", "spark.dla.connectors": "oss" } }
Use the following configuration format for custom dependencies:
%%configure -f { "conf": { ... }, "pyFiles": "oss://{your bucket name}/{path}/*.zip" # module }
Terminate a Jupyter job.
Click Restart Kernel in the main menu bar of JupyterLab Kernel.
Use a Docker image to start JupyterLab
You can use a Docker image provided by DLA to quickly start JupyterLab. For more information about how to install and use a Docker image, see official Docker documentation.
After you install and start Docker, run the following command to pull the JupyterLab image of DLA:
docker pull registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.5
After you pull the image, run the following command to view the help information about the image:
docker run -ti registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.5 Used to run jupyter lab for Aliyun DLA Usage example: docker run -it -p 8888:8888 dla-jupyter:0.1 -i akid -k aksec -r cn-hanghzou -c spark-vc -l INFO -i Aliyun AkId -k Aliyun AkSec -r Aliyun Region Id -c Aliyun DLA Virtual cluster name -l LogLevel
The parameters in the preceding code are similar to those of the DLA Livy proxy. The following table describes the parameters.
Parameter
Description
-c
The name of the Spark virtual cluster in DLA.
NoteTo view the cluster name, log on to the DLA console. In the left-side navigation pane, click . On the Virtual Cluster management page, find the cluster that you want to manage and click Details in the Actions column to view the cluster name.
-i
The AccessKey ID of the RAM user.
NoteIf you have created an AccessKey pair for the RAM user, you can view the AccessKey ID and AccessKey secret in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair.
-k
The AccessKey secret of the RAM user.
NoteIf you have created an AccessKey pair for the RAM user, you can view the AccessKey ID and AccessKey secret in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair.
-r
The ID of the region in which the cluster is deployed. For more information, see Regions and Zones.
-l
The log level. Valid values: ERROR, WARNING, INFO, and DEBUG. Default value: INFO. We recommend that you use the default value.
After you set the parameters to appropriate values, run the following command to start JupyterLab:
docker run -it -p 8888:8888 registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.5 -i {AkId} -k {AkSec} -r {RegionId} -c {VcName}
If the information in the following figure is displayed, JupyterLab is started. You can copy the URL that is framed in red in the following figure and paste it to the address bar of a browser to connect to DLA by using JupyterLab.
Usage notes
You must check the
dlaproxy.log
file for troubleshooting. If the information in the following figure appears in the log file, JupyterLab is started.You must mount the host path to the path of the Docker image. Otherwise, the system automatically deletes the notebooks that are in Edit mode when you terminate the Docker image. If you terminate the Docker image, the system also automatically attempts to terminate all interactive Jupyter jobs that are running. In this case, you can use one of the following solutions:
Before you terminate the Docker image, make sure that you copy all files and keep them secure.
Mount the host path to the path of the Docker image and save job files to the path of the Docker image.
For example, if you want to mount the host path
/home/admin/notebook
to the path of the Docker image/root/notebook
in Linux, run the following command:docker run -it --privileged=true -p 8888:8888 -v /home/admin/notebook:/root/notebook registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.5 -i {AkId} -k {AkSec} -r {RegionId} -c {VcName}
You must save the notebooks that are in Edit mode to the
/tmp
path. This ensures that you can view the related files in the/home/admin/notebook
path on the host and continue to use the notebooks the next time you start the Docker image.NoteFor more information, see Volumes in official Docker documentation.
FAQ
Q: What do I do if JupyterLab fails to start and the following error messages appear?
[C 09:53:15.840 LabApp] Bad config encountered during initialization:
[C 09:53:15.840 LabApp] Could not decode '\xe6\x9c\xaa\xe5\x91\xbd\xe5\x90\x8d' for unicode trait 'untitled_notebook' of a LargeFileManager instance.
A: Run the LANG=zn jupyter lab command.
Q: What do I do if the error message
$ jupyter nbextension enable --py --sys-prefix widgetsnbextension Enabling notebook extension jupyter-js-widgets/extension... - Validating: problems found: - require? X jupyter-js-widgets/extension
appears?A: Run the
jupyter nbextension install --py widgetsnbextension --user
andjupyter nbextension enable widgetsnbextension --user --py
commands.Q: What do I do if the error message
ValueError: Please install nodejs >=12.0.0 before continuing. nodejs may be installed using conda or directly from the nodejs website
appears?A: Run the conda install nodejs command. For more information about how to install Conda, see official Miniconda documentation.
Q: What do I do if Sparkmagic fails to be installed and the error message in the following figure appears?
A: Install Rust.
Q: What do I do if I fail to create a chart by using Matplotlib and the error message in the following figure appears after I run the
%matplotlib inline
command?A: If you use PySpark in the cloud, run the
%matplot plt
command with theplt.show()
function to create a chart. The following figure shows an example.