All Products
Search
Document Center

Upload the compressed package of a virtual environment

Last Updated: May 28, 2021

This topic describes how to use the serverless Spark engine of Data Lake Analytics (DLA) to upload the compressed package of a virtual environment or data files, such as machine learning model files.

Background information

If you want to run a Spark job in a self-managed cluster, you can upload the compressed package of a virtual environment to the environment on which the Spark job runs. When the Spark job is run, the file in the compressed package can be directly used. The archives parameter specifies the format of the compressed package that you want to upload.

Notice

The format of the compressed package can be ZIP, TGZ, TAR, or TAR.GZ.

Run a custom virtual environment for PySpark

  1. Package a Python virtual environment in Linux.

    You can use a tool, such as virtualenv or conda, to package a virtual environment. Before you package a virtual environment, select an appropriate tool and install the tool in Linux.

    Notice
    • The serverless Spark engine of DLA supports Python 3.7 and earlier Python major versions.

    • The serverless Spark engine runs on CentOS 7. We recommend that you use CentOS 7 on which Docker is installed to package a virtual environment.

    The following example demonstrates how to use virtualenv to generate a compressed package of a virtual environment. In this example, the package is venv.zip and contains a specific version of scikit-spark.

    # create directory venv at current path with python3
    virtualenv --copies --download --python Python3.7 venv
    
    # active environment
    source venv/bin/activate
    
    # install third part modules
    pip install scikit-spark==0.4.0
    
    # check the result
    pip list
    
    # zip the environment
    zip -r venv.zip venv

    If you want to use conda to generate a compressed package of a virtual environment, follow instructions in Managing environments to perform this operation.

  2. Run the virtual environment in the serverless Spark engine of DLA.

    When you submit a Spark job, you can use the following code configuration to run the virtual environment. The spark.pyspark.python parameter specifies the executable file in the uploaded compressed package. For more information about the parameters, see Configure a Spark job

Note

venv.zip#PY3 has the same semantics as that defined in the open source Spark community. It indicates that the package is decompressed to the PY3 folder under the working directory of a compute node for local data access. If you do not use the number sign (#) to specify the folder name, the name of the executable file in the compressed package is automatically used as the folder name.

{
    "name": "venv example",
    "archives": [
        "oss://test/venv.zip#PY3"
    ],
    "conf": {
        "spark.driver.resourceSpec": "medium",
        "spark.dla.connectors": "oss",
        "spark.executor.instances": 1,
        "spark.dla.job.log.oss.uri": "oss://test/spark-logs",
        "spark.pyspark.python": "./PY3/venv/bin/python3",
        "spark.executor.resourceSpec": "medium"
    },
    "file": "oss://test/example.py"
}