All Products
Search
Document Center

Create a virtual environment and upload the compressed package of the virtual environment to a Spark job

Last Updated: Mar 15, 2021

This topic describes how to specify the archives parameter to upload the compressed package of a virtual environment or data files such as machine learning model files to the execution environment of a Spark job in the serverless Spark engine of Data Lake Analytics (DLA).

Background information

If you run a Spark job in a self-managed cluster, you can specify the archives parameter to upload a compressed package to the execution environment of the job.

Notice

You can specify the archives parameter to upload packages of the following formats: ZIP, TGZ, TAR, and TAR.GZ.

Run a custom virtual environment by using PySpark

Notice

This operation must be performed in Linux.

1. Package a virtual environment that is created in the Python execution environment in Linux.

You can use a tool such as virtualenv or conda to package a virtual environment as required. Before you package a virtual environment, install the tool in Linux based on the official documentation of the tool.

Notice

The latest major version of Python supported by the serverless Spark engine of DLA is Python 3.7.

The following example shows how to use virtualenv to generate a compressed package of a virtual environment. In this example, the package is venv.zip and contains a specific version of scikit-spark.

# create directory venv at current path with python3
virtualenv --copies --download --python Python3.7 venv

# active environment
source venv/bin/activate

# install third part modules
pip install scikit-spark==0.4.0

# check the result
pip list

# zip the environment
zip -r venv.zip venv

For more information about how to use conda to generate a package of a virtual environment, see Managing environments in the conda official documentation.

2. Upload the compressed package of the virtual environment to the execution environment of a Spark job and run the virtual environment.

When you submit a Spark job, you can use the following configurations to run the virtual environment. In the following sample code, spark.pyspark.python is used to specify the package that you uploaded. venv.zip#PY3 has the same semantics as the open source Apache Spark. It indicates that the package is decompressed to the PY3 folder under the working directory of a compute node for local data access.

{
    "name": "venv example",
    "archives": [
        "oss://test/venv.zip#PY3"
    ],
    "conf": {
        "spark.driver.resourceSpec": "medium",
        "spark.dla.connectors": "oss",
        "spark.executor.instances": 1,
        "spark.dla.job.log.oss.uri": "oss://test/spark-logs",
        "spark.pyspark.python": ". /PY3/venv/bin/python3",
        "spark.executor.resourceSpec": "medium"
    },
    "file": "oss://test/example.py"
}

If you do not rename the file by using the number sign (#), you must include the package name in the working directory in which the decompressed package is saved.

{
    "name": "venv example",
    "archives": [
        "oss://test/venv.zip"
    ],
    "conf": {
        "spark.driver.resourceSpec": "medium",
        "spark.dla.connectors": "oss",
        "spark.executor.instances": 1,
        "spark.dla.job.log.oss.uri": "oss://test/spark-logs",
        "spark.pyspark.python": "./venv.zip/venv/bin/python3",
        "spark.executor.resourceSpec": "medium"
    },
    "file": "oss://test/example.py"
}