All Products
Search
Document Center

Data Lake Analytics - Deprecated:Develop an interactive Jupyter job

Last Updated:Feb 19, 2024

Alibaba Cloud Data Lake Analytics (DLA) introduces solutions to support the Spark read-eval-print loop (REPL) feature. You can install JupyterLab and the Livy proxy of DLA on your on-premises machine or use a Docker image to start JupyterLab in an efficient manner. This helps you connect JupyterLab to the serverless Spark engine of DLA. After the connection is established, you can perform interactive testing and data computing by using the elastic resources of DLA.

Important

DLA is discontinued. AnalyticDB for MySQL Data Lakehouse Edition supports the features of DLA and provides more features and better performance. For more information about how to develop an interactive Jupyter job by using AnalyticDB for MySQL Spark, see Develop an interactive Jupyter job.

Usage notes

  • The serverless Spark engine of DLA supports interactive Jupyter jobs that are programmed in Python 3.0 or Scala 2.11.

  • JupyterLab of the latest version supports Python 3.6 and later.

  • To develop an interactive Jupyter job, we recommend that you use a Docker image to start JupyterLab. For more information, see the Use a Docker image to start JupyterLab section of this topic.

  • Interactive Jupyter jobs are automatically released after they are idle for a specific period of time. By default, an interactive Jupyter job is released 1,200 seconds after the last code block of the job is run. You can use the spark.dla.session.ttl parameter to specify the idle time of an interactive Jupyter job before the job is automatically released.

Install JupyterLab and the Livy proxy of DLA on your on-premises machine

  1. Install the Livy proxy of DLA.

    1. Install Alibaba Cloud SDK for Python.

      Note

      The version of Alibaba Cloud SDK for Python must be 2.0.4 or later.

    2. Run the following command to install the Livy proxy of DLA:

      pip install aliyun-dla-livy-proxy-0.0.5.zip
      Note

      You must install the Livy proxy of DLA as a root user. Otherwise, the command may fail to be registered to the path in which the command is run. After the Livy proxy of DLA is installed, you can find the dlaproxy command on the CLI.

    3. Start the Livy proxy of DLA.

      The Livy proxy of DLA is used to interpret an interface of DLA as an Apache Livy interface required by Sparkmagic. After you start the Livy proxy of DLA, you can deploy a local HTTP proxy to listen on a port and forward requests. By default, port 5000 is used.

      # View the dlaproxy command. 
      $dlaproxy -h
      usage: dlaproxy [-h] --vcname VCNAME -i AK -k SECRET --region REGION [--host HOST] [--port PORT] [--loglevel LOGLEVEL]
      
      Proxy AliYun DLA as Livy
      
      optional arguments:
        -h, --help            show this help message and exit
        --vcname VCNAME       Virtual Cluster Name
        -i AK, --access-key-id AK
                              Aliyun Access Key Id
        -k SECRET, --access-key-secret SECRET
                              Aliyun Access Key Secret
        --region REGION       Aliyun Region Id
        --host HOST           Proxy Host Ip
        --port PORT           Proxy Host Port
        --loglevel LOGLEVEL   python standard log level
        
      # Start the Livy proxy of DLA. 
      dlaproxy --vcname <vcname> -i akid -k aksec --region <regionid>

      The following table describes the parameters in the preceding code.

      Parameter

      Description

      --vcname

      The name of the Spark virtual cluster in DLA.

      Note

      To view the cluster name, log on to the DLA console. In the left-side navigation pane, click Virtual Cluster management. On the Virtual Cluster management page, find the cluster that you want to manage and click Details in the Actions column to view the cluster name.

      -i

      The AccessKey ID of the Resource Access Management (RAM) user.

      Note

      If you have created an AccessKey pair for the RAM user, you can view the AccessKey ID and AccessKey secret in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair.

      -k

      The AccessKey secret of the RAM user.

      Note

      If you have created an AccessKey pair for the RAM user, you can view the AccessKey ID and AccessKey secret in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair.

      --region

      The ID of the region in which the cluster is deployed. For more information, see Regions and Zones.

      --host

      The proxy host IP address of DLA. Default value: 127.0.0.1. This IP address is used to forward only local requests. You can change the value to 0.0.0.0 or a different IP address to listen to requests from the Internet or an internal network. We recommend that you use the default value.

      --port

      The listening port. Default value: 5000. We recommend that you use the default value. You can also modify this parameter.

      --loglevel

      The log level. Valid values: ERROR, WARNING, INFO, and DEBUG. Default value: INFO. We recommend that you use the default value.

  2. Install JupyterLab.

    1. Optional. Install a virtual environment.

      Note

      We recommend that you install JupyterLab in a virtual environment. This prevents subsequent installations from damaging the public Python environment within your Alibaba Cloud account.

    2. Run the following commands to install JupyterLab:

      pip install jupyterlab # Install JupyterLab. 
      jupyter lab  # Check whether JupyterLab is installed. If the boot log of JupyterLab is displayed, JupyterLab is installed.
    3. Perform the following steps to install Sparkmagic:

      1. Install the Sparkmagic library.

         pip install sparkmagic
      2. Enable nbextension and make sure that ipywidgets can be used.

         jupyter nbextension enable --py --sys-prefix widgetsnbextension
      3. If you use JupyterLab, install JupyterLab labextension.

         jupyter labextension install "@jupyter-widgets/jupyterlab-manager"
      4. Run the pip show sparkmagic command to query the path in which Sparkmagic is installed. Then, install kernels in the same path.

         jupyter-kernelspec install sparkmagic/kernels/sparkkernel
         jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
         jupyter-kernelspec install sparkmagic/kernels/sparkrkernel
      5. Modify the configuration file config.json in the ~/.sparkmagic/ path. For more information about sample configurations, see example_config.json.

      6. Enable Sparkmagic.

         jupyter serverextension enable --py sparkmagic

    After you install Sparkmagic, you must manually create the configuration file config.json in the ~/.sparkmagic path and direct the URL to the local proxy server. The following sample code provides an example of the config.json file:

    {
      "kernel_python_credentials" : {
        "username": "",
        "password": "",
        "url": "http://127.0.0.1:5000",
        "auth": "None"
      },
    
      "kernel_scala_credentials" : {
        "username": "",
        "password": "",
        "url": " http://127.0.0.1:5000",
        "auth": "None"
      },
      "kernel_r_credentials": {
        "username": "",
        "password": "",
        "url": "http://localhost:5000"
      },
    
      "logging_config": {
        "version": 1,
        "formatters": {
          "magicsFormatter": { 
            "format": "%(asctime)s\t%(levelname)s\t%(message)s",
            "datefmt": ""
          }
        },
        "handlers": {
          "magicsHandler": { 
            "class": "hdijupyterutils.filehandler.MagicsFileHandler",
            "formatter": "magicsFormatter",
            "home_path": "~/.sparkmagic"
          }
        },
        "loggers": {
          "magicsLogger": { 
            "handlers": ["magicsHandler"],
            "level": "DEBUG",
            "propagate": 0
          }
        }
      },
    
      "wait_for_idle_timeout_seconds": 15,
      "livy_session_startup_timeout_seconds": 600,
    
      "fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",
    
      "ignore_ssl_errors": false,
    
      "session_configs": {
        "conf": {
          "spark.dla.connectors": "oss"
        }
      },
    
      "use_auto_viz": true,
      "coerce_dataframe": true,
      "max_results_sql": 2500,
      "pyspark_dataframe_encoding": "utf-8",
      
      "heartbeat_refresh_seconds": 30,
      "livy_server_heartbeat_timeout_seconds": 0,
      "heartbeat_retry_seconds": 10,
    
      "server_extension_default_kernel_name": "pysparkkernel",
      "custom_headers": {},
      
      "retry_policy": "configurable",
      "retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
      "configurable_retry_policy_max_retries": 8
    }
    Note

    The value of the session_configs parameter in the sample code is the same as that of the conf parameter in the configurations of the job that you submit to the serverless Spark engine of DLA. If you want to load the JAR packages of the job, you must use the serverless Spark engine to access the metadata service of DLA. For more information, see Configure a Spark job.

    After you start the Livy proxy of DLA, the default URL for listening is http://127.0.0.1:5000. If you change the host IP address or port number in the default URL, you must change the value of the url parameter in the configuration file. For example, if you set the --host parameter to 192.168.1.3 and the --port parameter to 8080 when you start the Livy proxy of DLA, you must change the value of the url parameter to http://192.168.1.3:8080. The following sample code provides an example of the modified config.json file:

    { 
      "kernel_python_credentials" : {
        "username": "",
        "password": "",
        "url": "http://192.168.1.3:8080",
        "auth": "None"
      },
      "kernel_scala_credentials" : {
        "username": "",
        "password": "",
        "url": "http://192.168.1.3:8080",
        "auth": "None"
      },
      "kernel_r_credentials": {
        "username": "",
        "password": "",
        "url": "http://192.168.1.3:8080"
      },
     "logging_config": {
        "version": 1,
        "formatters": {
          "magicsFormatter": { 
            "format": "%(asctime)s\t%(levelname)s\t%(message)s",
            "datefmt": ""
          }
        },
        "handlers": {
          "magicsHandler": { 
            "class": "hdijupyterutils.filehandler.MagicsFileHandler",
            "formatter": "magicsFormatter",
            "home_path": "~/.sparkmagic"
          }
        },
        "loggers": {
          "magicsLogger": { 
            "handlers": ["magicsHandler"],
            "level": "DEBUG",
            "propagate": 0
          }
        }
      },
    
      "wait_for_idle_timeout_seconds": 15,
      "livy_session_startup_timeout_seconds": 600,
    
      "fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",
    
      "ignore_ssl_errors": false,
    
      "session_configs": {
        "conf": {
          "spark.dla.connectors": "oss"
        }
      },
    
      "use_auto_viz": true,
      "coerce_dataframe": true,
      "max_results_sql": 2500,
      "pyspark_dataframe_encoding": "utf-8",
      
      "heartbeat_refresh_seconds": 30,
      "livy_server_heartbeat_timeout_seconds": 0,
      "heartbeat_retry_seconds": 10,
    
      "server_extension_default_kernel_name": "pysparkkernel",
      "custom_headers": {},
      
      "retry_policy": "configurable",
      "retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
      "configurable_retry_policy_max_retries": 8
    }
  3. Start JupyterLab.

    # Restart JupyterLab. 
    jupyter lab
    
    # Start the Livy proxy of DLA. 
    dlaproxy --vcname vcname -i akid -k aksec --region <regionid>

    After you start JupyterLab, the URL that is used to access JupyterLab is displayed in the boot log of JupyterLab. The following figure shows an example.日志打印

    If the Aliyun DLA Proxy is ready message appears, the Livy proxy of DLA is started. After the Livy proxy of DLA is started, you can use JupyterLab. For more information about how to use JupyterLab, see JupyterLab Documentation.

    When you run a Jupyter job, DLA automatically creates a Spark job. To view and manage the Spark job, log on to the DLA console. In the left-side navigation pane, choose Serverless Spark > Submit job to view and manage the Spark job. The Spark jobs whose names start with notebook_ are interactive Jupyter jobs. The following figure shows an example.作业管理

    After you start JupyterLab, you can still modify the configurations of a Spark job by using the magic command. If you run the magic command, the new configurations overwrite the original configurations. Then, JupyterLab restarts the Spark job based on the new configurations.

    # Restart JupyterLab. 
    jupyter lab
    # Start the Livy proxy of DLA. 
    dlaproxy --vcname vcname -i akid -k aksec --region <regionid>
    %%configure -f
    { 
        "conf": {
          "spark.sql.hive.metastore.version": "dla",
          "spark.dla.connectors": "oss"
        }
    }

    Use the following configuration format for custom dependencies:

    %%configure -f
    { 
        "conf": {
          ...
        },
        "pyFiles": "oss://{your bucket name}/{path}/*.zip" # module
    }
  4. Terminate a Jupyter job.

    Click Restart Kernel in the main menu bar of JupyterLab Kernel.

Use a Docker image to start JupyterLab

You can use a Docker image provided by DLA to quickly start JupyterLab. For more information about how to install and use a Docker image, see official Docker documentation.

  1. After you install and start Docker, run the following command to pull the JupyterLab image of DLA:

    docker pull registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.5
  2. After you pull the image, run the following command to view the help information about the image:

    docker run -ti registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.5
    
    Used to run jupyter lab for Aliyun DLA 
    Usage example: docker run -it -p 8888:8888 dla-jupyter:0.1 -i akid -k aksec -r cn-hanghzou -c spark-vc -l INFO 
        -i Aliyun AkId
        -k Aliyun AkSec
        -r Aliyun Region Id
        -c Aliyun DLA Virtual cluster name
        -l LogLevel

    The parameters in the preceding code are similar to those of the DLA Livy proxy. The following table describes the parameters.

    Parameter

    Description

    -c

    The name of the Spark virtual cluster in DLA.

    Note

    To view the cluster name, log on to the DLA console. In the left-side navigation pane, click Virtual Cluster management. On the Virtual Cluster management page, find the cluster that you want to manage and click Details in the Actions column to view the cluster name.

    -i

    The AccessKey ID of the RAM user.

    Note

    If you have created an AccessKey pair for the RAM user, you can view the AccessKey ID and AccessKey secret in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair.

    -k

    The AccessKey secret of the RAM user.

    Note

    If you have created an AccessKey pair for the RAM user, you can view the AccessKey ID and AccessKey secret in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair.

    -r

    The ID of the region in which the cluster is deployed. For more information, see Regions and Zones.

    -l

    The log level. Valid values: ERROR, WARNING, INFO, and DEBUG. Default value: INFO. We recommend that you use the default value.

  3. After you set the parameters to appropriate values, run the following command to start JupyterLab:

     docker run -it -p 8888:8888 registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.5 -i {AkId} -k {AkSec} -r {RegionId} -c {VcName}                           

    If the information in the following figure is displayed, JupyterLab is started. You can copy the URL that is framed in red in the following figure and paste it to the address bar of a browser to connect to DLA by using JupyterLab.启动成功

Usage notes

  • You must check the dlaproxy.log file for troubleshooting. If the information in the following figure appears in the log file, JupyterLab is started.启动信息

  • You must mount the host path to the path of the Docker image. Otherwise, the system automatically deletes the notebooks that are in Edit mode when you terminate the Docker image. If you terminate the Docker image, the system also automatically attempts to terminate all interactive Jupyter jobs that are running. In this case, you can use one of the following solutions:

    • Before you terminate the Docker image, make sure that you copy all files and keep them secure.

    • Mount the host path to the path of the Docker image and save job files to the path of the Docker image.

      For example, if you want to mount the host path /home/admin/notebook to the path of the Docker image /root/notebook in Linux, run the following command:

       docker run -it --privileged=true -p 8888:8888  -v /home/admin/notebook:/root/notebook registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.5 -i {AkId} -k {AkSec} -r {RegionId} -c {VcName}                           

      You must save the notebooks that are in Edit mode to the /tmp path. This ensures that you can view the related files in the /home/admin/notebook path on the host and continue to use the notebooks the next time you start the Docker image.

      Note

      For more information, see Volumes in official Docker documentation.

FAQ

  • Q: What do I do if JupyterLab fails to start and the following error messages appear?

    • [C 09:53:15.840 LabApp] Bad config encountered during initialization:

    • [C 09:53:15.840 LabApp] Could not decode '\xe6\x9c\xaa\xe5\x91\xbd\xe5\x90\x8d' for unicode trait 'untitled_notebook' of a LargeFileManager instance.

    A: Run the LANG=zn jupyter lab command.

  • Q: What do I do if the error message $ jupyter nbextension enable --py --sys-prefix widgetsnbextension Enabling notebook extension jupyter-js-widgets/extension... - Validating: problems found: - require? X jupyter-js-widgets/extension appears?

    A: Run the jupyter nbextension install --py widgetsnbextension --user and jupyter nbextension enable widgetsnbextension --user --py commands.

  • Q: What do I do if the error message ValueError: Please install nodejs >=12.0.0 before continuing. nodejs may be installed using conda or directly from the nodejs website appears?

    A: Run the conda install nodejs command. For more information about how to install Conda, see official Miniconda documentation.

  • Q: What do I do if Sparkmagic fails to be installed and the error message in the following figure appears?报错截图

    A: Install Rust.

  • Q: What do I do if I fail to create a chart by using Matplotlib and the error message in the following figure appears after I run the %matplotlib inline command?

    matplot报错示意图

    A: If you use PySpark in the cloud, run the %matplot plt command with the plt.show() function to create a chart. The following figure shows an example.

    matplot正确用法