This topic describes how to configure PySpark in DataWorks Notebook. It covers custom configurations, OSS storage access, third-party Python packages, JAR resources, and Livy parameters.
Use Custom Configurations
If you run %%maxcompute_spark and see a message indicating that the dataworks-magic version must be upgraded, upgrade it first. Otherwise, the following features may not work.
Set SparkConf at startup. For example, set the maximum idle time for Livy (
configcontains Livy-related parameters) and enable schema-level SQL syntax (spark_confcontains Spark-related parameters). Apply the same pattern for other parameters.%%maxcompute_spark { "config": { # CPU and memory default to 1 vCPU and 4 GiB "cpu": 1, "memory": "4096M", "livy.server.max.idle.time": "10m" }, "quota": "XXX", "spark_conf": { "spark.sql.catalog.odps.enableNamespaceSchema": "true" } }To change configurations, run the code again. Restarting the Spark session clears the previous session state.
Use OSS Storage
Configure OSS access parameters:
Use an internal endpoint for
oss_endpoint, such asoss-cn-shanghai-internal.aliyuncs.com.Do not set
spark.hadoop.fs.oss.implorspark.hadoop.fs.AbstractFileSystem.oss.implin Notebook. Doing so may prevent OSS access.
%%maxcompute_spark
{
"spark_conf": {
"spark.hadoop.odps.cupid.trusted.services.access.list": "<bucket-name>.<oss-endpoint>",
"spark.hadoop.fs.oss.accessKeyId": "***",
"spark.hadoop.fs.oss.accessKeySecret": "***",
"spark.hadoop.fs.oss.endpoint": "<oss-endpoint>"
}
}Use Third-Party Python Packages
Pack dependencies: Use pyodps-pack. Specify Python 3.11. Livy Spark sessions use Python 3.11 by default. Matching versions avoids compatibility issues. For more information, see PySpark Python Version and Dependencies.
# Install pyodps pip install pyodps # Prepare requirements.txt, then pack pyodps-pack -r requirements.txt --python-version=3.11 -o <package-name>Upload the package to your MaxCompute project:
-- Run in odpscmd ADD archive /path/to/<package-name> -f;Configure Spark to use the package:
NoteSeparate multiple
cupid.resourcesvalues with commas (,). For details, see Data Interoperability Configuration.Separate multiple
PYTHONPATHentries with colons (:). In this example,PYTHONPATHincludes thepackagessubdirectory because pyodps-pack adds it automatically. If you use another packing method, verify the directory structure before settingPYTHONPATH. For more information aboutPYTHONPATH, see Reference User-Defined Python Packages.
%%maxcompute_spark { "spark_conf": { "spark.hadoop.odps.cupid.resources": "<your_project>.<package-name>", "spark.executorEnv.PYTHONPATH": "./<your_project>.<package-name>/packages", "spark.yarn.appMasterEnv.PYTHONPATH": "./<your_project>.<package-name>/packages" } }
Use JAR or Other Resource Packages
Upload a resource, such as a tar.gz file:
add archive /path/to/<package-name> -f;Configure Spark to load resources. Separate multiple resources with commas (,):
%%maxcompute_spark { "spark_conf": { "spark.hadoop.odps.cupid.resources": "<your_project>.<package-name>" } }
You can combine this with Python package configuration by adding both to the spark_conf parameter.
Advanced Configuration
Specify a Spark Version
Supported versions:
spark-3.1.1-odps0.47.0spark-3.4.2-odps0.48.0(default)spark-3.5.2-odps0.49.0
Configuration method:
%%maxcompute_spark { "spark_conf": { "spark.hadoop.odps.spark.version": "spark-3.4.2-odps0.48.0" } }
Disable the Default Python 3.11 Environment
To use a custom Python environment:
%%maxcompute_spark
{
"spark_conf": {
"spark.hadoop.odps.spark.alinux3.enabled": "false"
}
}Use Matplotlib for Plotting
Pack and upload Matplotlib first. See Use Third-Party Python Packages. Then plot in Notebook:
%%spark
import matplotlib.pyplot as plt
import io, base64, json
import matplotlib
# Optional
matplotlib.use('Agg')
# Create figure first (required step)
fig = plt.figure()
# Example plot
x = [1, 2, 3, 4]
y = [20, 22, 19, 23]
plt.plot(x, y, marker='o', linestyle='--', color='b')
plt.title("Temperature Change Over Time")
plt.xlabel("Time")
plt.ylabel("Temperature")
# Render plot in Notebook (Livy magic command)
%matplot pltLivy Configuration
The following table lists configurable Livy parameters:
Parameter Name | Default Value | Description |
| true | Enable access control. Enabled by default. Do not disable. |
| 3d | Maximum lifetime of the Livy server. Default is 3 days. Units: s, m, h, d. |
| 1h | Maximum idle time for the Livy server. Default is 1 hour. The server shuts down if no Spark session runs during this time. |
| 12h | Maximum idle time for a Spark session. Default is 12 hours. The session shuts down if no Spark task runs during this time. |
| 12h | Maximum retention time for an expired Spark session. Default is 12 hours. The session is destroyed permanently if not restarted within this time. |
| 50 | Maximum number of concurrent sessions per Livy server. Default is 50. |