To support the Spark read-eval-print loop (REPL) feature, Alibaba Cloud Data Lake Analytics (DLA) introduces a solution to connect a local JupyterLab server to the serverless Spark engine of DLA. This solution allows you to perform interactive testing and compute data by using elastic resources of DLA.

Usage notes

  • The serverless Spark engine of DLA supports JupyterLab interactive jobs that are programmed in Python 3.0 or Scala 2.11.
  • JupyterLab of the latest version supports Python 3.6 or later.

Procedure

  1. Install the Livy proxy of DLA.
    1. Install aliyun-python-sdk-openanalytics-open 2.0.5.
      Note The SDK for Python must be 2.0.4 or later.
    2. Run the following command to install the Livy proxy of DLA:
      pip install aliyun-dla-livy-proxy-0.0.2.zip
      Note You must install the Livy proxy of DLA as a root user. Non-root users may not be able to register the command to the directory where the command is run. After the Livy proxy of DLA is installed, you can find the dlaproxy command in the command line.
    3. Start the Livy proxy of DLA.
      The Livy proxy of DLA is used to interpret an interface of DLA as an Apache Livy interface required by Sparkmagic. This way, a local HTTP proxy can be deployed to listen to a port and forward requests. The default port is 5000.
      # View the usage of the dlaproxy command. 
      $dlaproxy -h
      usage: dlaproxy [-h] --vcname VCNAME -i AK -k SECRET --region REGION [--host HOST] [--port PORT] [--loglevel LOGLEVEL]
      
      Proxy AliYun DLA as Livy
      
      optional arguments:
        -h, --help            show this help message and exit
        --vcname VCNAME       Virtual Cluster Name
        -i AK, --access-key-id AK
                              Aliyun Access Key Id
        -k SECRET, --access-key-secret SECRET
                              Aliyun Access Key Secret
        --region REGION       Aliyun Region Id
        --host HOST           Proxy Host Ip
        --port PORT           Proxy Host Port
        --loglevel LOGLEVEL   python standard log level
        
      # Start the Livy proxy of DLA. 
      dlaproxy --vcname <vcname> -i akid -k aksec --region <regionid>

      The following table describes the parameters in the preceding code.

      Parameter Description
      --vcname The name of the Spark virtual cluster (VC) in DLA.
      Note To view the name, you can log on to the DLA console. In the left-side navigation pane, choose Virtual Cluster management. On the Virtual Cluster management page, find the VC and click Details in the Actions column.
      -i The AccessKey ID of the RAM user.
      Note If an AccessKey pair is created, you can view the value of this parameter on the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair for a RAM user.
      -k The AccessKey secret of the RAM user.
      Note If an AccessKey pair is created, you can view the value of this parameter on the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair for a RAM user.
      --region The ID of the region where DLA resides. For more information, see Regions and zones.
      --host The proxy host IP address of DLA. The default value is 127.0.0.1. This IP address is used to forward only local requests. You can change the value to 0.0.0.0 or another address to listen to requests from the Internet or an internal network. We recommend that you use the default value.
      --port The listening port. The default value is 5000. You can change it to another port. We recommend that you use the default value.
      --loglevel The log level. The default value is INFO. Valid values: ERROR, WARNING, INFO, and DEBUG. We recommend that you use the default value.
  2. Install JupyterLab.
    1. Optional:Install the virtual environment.
      Note We recommend that you install the entire environment in the virtual environment. This way, the subsequent installation will not damage the public Python environment within the Alibaba Cloud account.
    2. Run the following commands to install JupyterLab:
      pip install jupyterlab # Install JupyterLab. 
      jupyter lab  # Check whether JupyterLab is installed. If it is installed, you can view the start log. 
    3. Install Sparkmagic. For more information about how to install Sparkmagic, see Sparkmagic installation method.
      Note You must perform all the optional steps that are described in the preceding document.
    After Sparkmagic is installed, you must manually create the configuration file config.json in the ~/.sparkmagic directory and direct the URL to the local proxy server. Sample code in the config.json file:
    {
      "kernel_python_credentials" : {
        "username": "",
        "password": "",
        "url": "http://127.0.0.1:5000",
        "auth": "None"
      },
    
      "kernel_scala_credentials" : {
        "username": "",
        "password": "",
        "url": " http://127.0.0.1:5000",
        "auth": "None"
      },
      "kernel_r_credentials": {
        "username": "",
        "password": "",
        "url": "http://localhost:5000"
      },
    
      "logging_config": {
        "version": 1,
        "formatters": {
          "magicsFormatter": { 
            "format": "%(asctime)s\t%(levelname)s\t%(message)s",
            "datefmt": ""
          }
        },
        "handlers": {
          "magicsHandler": { 
            "class": "hdijupyterutils.filehandler.MagicsFileHandler",
            "formatter": "magicsFormatter",
            "home_path": "~/.sparkmagic"
          }
        },
        "loggers": {
          "magicsLogger": { 
            "handlers": ["magicsHandler"],
            "level": "DEBUG",
            "propagate": 0
          }
        }
      },
    
      "wait_for_idle_timeout_seconds": 15,
      "livy_session_startup_timeout_seconds": 600,
    
      "fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",
    
      "ignore_ssl_errors": false,
    
      "session_configs": {
        "conf": {
          "spark.sql.hive.metastore.version": "dla",
          "spark.dla.connectors": "oss"
        }
      },
    
      "use_auto_viz": true,
      "coerce_dataframe": true,
      "max_results_sql": 2500,
      "pyspark_dataframe_encoding": "utf-8",
      
      "heartbeat_refresh_seconds": 30,
      "livy_server_heartbeat_timeout_seconds": 0,
      "heartbeat_retry_seconds": 10,
    
      "server_extension_default_kernel_name": "pysparkkernel",
      "custom_headers": {},
      
      "retry_policy": "configurable",
      "retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
      "configurable_retry_policy_max_retries": 8
    }
    Note The settings of session_configs in the sample code are the same as those of conf that are submitted to serverless Spark engine of DLA. If JAR packages need to be loaded, access the metadata service of DLA. For more information, see Configure a Spark job.
  3. Run JupyterLab.
    # Restart JupyterLab. 
    jupyter lab
    
    # Start the Livy proxy of DLA. 
    dlaproxy --vcname vcname -i akid -k aksec --region <regionid>
    The URL to access JupyterLab is displayed in the start log of JupyterLab, as shown in the following figure. 日志打印
    If Aliyun DLA Proxy is ready appears, the Livy proxy of DLA is started. After the Livy proxy of DLA starts, you can use JupyterLab. For more information about how to use JupyterLab, see the JupyterLab official documentation.
    When you run a JupyterLab task, DLA automatically creates a Spark job. To view and manage the Spark job, you can log on to the DLA console. In the left-side navigation pane, choose Serverless Spark > Submit job.The Spark jobs whose names start with notebook_ are JupyterLab interactive jobs.
    After JupyterLab runs, you can still dynamically modify the configuration of the Spark job by using magic commands. After you run a magic command, the new configuration overwrites the original configuration. Then, JupyterLab restarts the Spark job based on the new configuration.
    %%configure -f
    { 
         "jars": "oss://test/test.jar",
        "conf": {
          "spark.sql.hive.metastore.version": "dla",
          "spark.dla.connectors": "oss"
        }
    }

FAQ

  • Problem description: JupyterLab fails to start and the following error messages appear:
    • [C 09:53:15.840 LabApp] Bad config encountered during initialization:
    • [C 09:53:15.840 LabApp] Could not decode '\xe6\x9c\xaa\xe5\x91\xbd\xe5\x90\x8d' for unicode trait 'untitled_notebook' of a LargeFileManager instance.

    Solution: Run LANG=zn jupyter lab.

  • Problem description: The error message $ jupyter nbextension enable --py --sys-prefix widgetsnbextension Enabling notebook extension jupyter-js-widgets/extension... - Validating: problems found: - require? X jupyter-js-widgets/extension appears.

    Solution: Run the jupyter nbextension install --py widgetsnbextension --user and jupyter nbextension enable widgetsnbextension --user --py commands.

  • The error message ValueError: Please install nodejs >=12.0.0 before continuing. nodejs may be installed using conda or directly from the nodejs website. appears.

    Solution: Run the conda install nodejs command. For more information about how to install Conda, see the Conda official documentation.

  • Problem description: Sparkmagic fails to be installed and the error message shown in the following figure appears. Error message

    Solution: Install Rust.