DataWorks Notebook User Guide - MaxCompute - Alibaba Cloud Documentation Center

This topic describes how to configure PySpark in DataWorks Notebook. It covers custom configurations, OSS storage access, third-party Python packages, JAR resources, and Livy parameters.

Use Custom Configurations

Important

If you run %%maxcompute_spark and see a message indicating that the dataworks-magic version must be upgraded, upgrade it first. Otherwise, the following features may not work.

Set SparkConf at startup. For example, set the maximum idle time for Livy (config contains Livy-related parameters) and enable schema-level SQL syntax (spark_conf contains Spark-related parameters). Apply the same pattern for other parameters.

%%maxcompute_spark
{
  "config": {
    # CPU and memory default to 1 vCPU and 4 GiB
    "cpu": 1,
    "memory": "4096M",
    "livy.server.max.idle.time": "10m"
  },
  "quota": "XXX",
  "spark_conf": {
    "spark.sql.catalog.odps.enableNamespaceSchema": "true"
  }
}

To change configurations, run the code again. Restarting the Spark session clears the previous session state.

Use OSS Storage

Configure OSS access parameters:

Important

Use an internal endpoint for oss_endpoint, such as oss-cn-shanghai-internal.aliyuncs.com.
Do not set spark.hadoop.fs.oss.impl or spark.hadoop.fs.AbstractFileSystem.oss.impl in Notebook. Doing so may prevent OSS access.

%%maxcompute_spark
{
  "spark_conf": {
    "spark.hadoop.odps.cupid.trusted.services.access.list": "<bucket-name>.<oss-endpoint>",
    "spark.hadoop.fs.oss.accessKeyId": "***",
    "spark.hadoop.fs.oss.accessKeySecret": "***",
    "spark.hadoop.fs.oss.endpoint": "<oss-endpoint>"
  }
}

Use Third-Party Python Packages

Pack dependencies: Use pyodps-pack. Specify Python 3.11. Livy Spark sessions use Python 3.11 by default. Matching versions avoids compatibility issues. For more information, see PySpark Python Version and Dependencies.
```
# Install pyodps
pip install pyodps

# Prepare requirements.txt, then pack
pyodps-pack -r requirements.txt --python-version=3.11 -o <package-name>
```

Upload the package to your MaxCompute project:

-- Run in odpscmd
ADD archive /path/to/<package-name> -f;

Configure Spark to use the package:
Note
- Separate multiple cupid.resources values with commas (,). For details, see Data Interoperability Configuration.
- Separate multiple PYTHONPATH entries with colons (:). In this example, PYTHONPATH includes the packages subdirectory because pyodps-pack adds it automatically. If you use another packing method, verify the directory structure before setting PYTHONPATH. For more information about PYTHONPATH, see Reference User-Defined Python Packages.
```
%%maxcompute_spark
{
  "spark_conf": {
    "spark.hadoop.odps.cupid.resources": "<your_project>.<package-name>",
    "spark.executorEnv.PYTHONPATH": "./<your_project>.<package-name>/packages",
    "spark.yarn.appMasterEnv.PYTHONPATH": "./<your_project>.<package-name>/packages"
  }
}
```

Use JAR or Other Resource Packages

Upload a resource, such as a tar.gz file: add archive /path/to/<package-name> -f;

Configure Spark to load resources. Separate multiple resources with commas (,):

%%maxcompute_spark
{
  "spark_conf": {
    "spark.hadoop.odps.cupid.resources": "<your_project>.<package-name>"
  }
}

You can combine this with Python package configuration by adding both to the spark_conf parameter.

Advanced Configuration

Specify a Spark Version

Supported versions:
- spark-3.1.1-odps0.47.0
- spark-3.4.2-odps0.48.0 (default)
- spark-3.5.2-odps0.49.0

Configuration method:

%%maxcompute_spark
{
  "spark_conf": {
    "spark.hadoop.odps.spark.version": "spark-3.4.2-odps0.48.0"
  }
}

Disable the Default Python 3.11 Environment

To use a custom Python environment:

%%maxcompute_spark
{
  "spark_conf": {
    "spark.hadoop.odps.spark.alinux3.enabled": "false"
  }
}

Use Matplotlib for Plotting

Pack and upload Matplotlib first. See Use Third-Party Python Packages. Then plot in Notebook:

%%spark

import matplotlib.pyplot as plt
import io, base64, json
import matplotlib
# Optional
matplotlib.use('Agg')

# Create figure first (required step)
fig = plt.figure()

# Example plot
x = [1, 2, 3, 4]
y = [20, 22, 19, 23]

plt.plot(x, y, marker='o', linestyle='--', color='b')
plt.title("Temperature Change Over Time")
plt.xlabel("Time")
plt.ylabel("Temperature")

# Render plot in Notebook (Livy magic command)
%matplot plt

Livy Configuration

The following table lists configurable Livy parameters:

Parameter Name	Default Value	Description
`livy.server.access-control.enabled`	true	Enable access control. Enabled by default. Do not disable.
`livy.server.max.alive.time`	3d	Maximum lifetime of the Livy server. Default is 3 days. Units: s, m, h, d.
`livy.server.max.idle.time`	1h	Maximum idle time for the Livy server. Default is 1 hour. The server shuts down if no Spark session runs during this time.
`livy.server.session.timeout`	12h	Maximum idle time for a Spark session. Default is 12 hours. The session shuts down if no Spark task runs during this time.
`livy.server.session.state-retain.sec`	12h	Maximum retention time for an expired Spark session. Default is 12 hours. The session is destroyed permanently if not restarted within this time.
`livy.server.session.max-creation`	50	Maximum number of concurrent sessions per Livy server. Default is 50.