All Products
Search
Document Center

MaxCompute:DataWorks Notebook User Guide

Last Updated:Mar 13, 2026

This topic describes how to configure PySpark in DataWorks Notebook. It covers custom configurations, OSS storage access, third-party Python packages, JAR resources, and Livy parameters.

Use Custom Configurations

Important

If you run %%maxcompute_spark and see a message indicating that the dataworks-magic version must be upgraded, upgrade it first. Otherwise, the following features may not work.

  1. Set SparkConf at startup. For example, set the maximum idle time for Livy (config contains Livy-related parameters) and enable schema-level SQL syntax (spark_conf contains Spark-related parameters). Apply the same pattern for other parameters.

    %%maxcompute_spark
    {
      "config": {
        # CPU and memory default to 1 vCPU and 4 GiB
        "cpu": 1,
        "memory": "4096M",
        "livy.server.max.idle.time": "10m"
      },
      "quota": "XXX",
      "spark_conf": {
        "spark.sql.catalog.odps.enableNamespaceSchema": "true"
      }
    }
  2. To change configurations, run the code again. Restarting the Spark session clears the previous session state.

Use OSS Storage

Configure OSS access parameters:

Important
  • Use an internal endpoint for oss_endpoint, such as oss-cn-shanghai-internal.aliyuncs.com.

  • Do not set spark.hadoop.fs.oss.impl or spark.hadoop.fs.AbstractFileSystem.oss.impl in Notebook. Doing so may prevent OSS access.

%%maxcompute_spark
{
  "spark_conf": {
    "spark.hadoop.odps.cupid.trusted.services.access.list": "<bucket-name>.<oss-endpoint>",
    "spark.hadoop.fs.oss.accessKeyId": "***",
    "spark.hadoop.fs.oss.accessKeySecret": "***",
    "spark.hadoop.fs.oss.endpoint": "<oss-endpoint>"
  }
}

Use Third-Party Python Packages

  1. Pack dependencies: Use pyodps-pack. Specify Python 3.11. Livy Spark sessions use Python 3.11 by default. Matching versions avoids compatibility issues. For more information, see PySpark Python Version and Dependencies.

    # Install pyodps
    pip install pyodps
    
    # Prepare requirements.txt, then pack
    pyodps-pack -r requirements.txt --python-version=3.11 -o <package-name>
  2. Upload the package to your MaxCompute project:

    -- Run in odpscmd
    ADD archive /path/to/<package-name> -f;
  3. Configure Spark to use the package:

    Note
    • Separate multiple cupid.resources values with commas (,). For details, see Data Interoperability Configuration.

    • Separate multiple PYTHONPATH entries with colons (:). In this example, PYTHONPATH includes the packages subdirectory because pyodps-pack adds it automatically. If you use another packing method, verify the directory structure before setting PYTHONPATH. For more information about PYTHONPATH, see Reference User-Defined Python Packages.

    %%maxcompute_spark
    {
      "spark_conf": {
        "spark.hadoop.odps.cupid.resources": "<your_project>.<package-name>",
        "spark.executorEnv.PYTHONPATH": "./<your_project>.<package-name>/packages",
        "spark.yarn.appMasterEnv.PYTHONPATH": "./<your_project>.<package-name>/packages"
      }
    }

Use JAR or Other Resource Packages

  1. Upload a resource, such as a tar.gz file: add archive /path/to/<package-name> -f;

  2. Configure Spark to load resources. Separate multiple resources with commas (,):

    %%maxcompute_spark
    {
      "spark_conf": {
        "spark.hadoop.odps.cupid.resources": "<your_project>.<package-name>"
      }
    }

You can combine this with Python package configuration by adding both to the spark_conf parameter.

Advanced Configuration

Specify a Spark Version

  • Supported versions:

    • spark-3.1.1-odps0.47.0

    • spark-3.4.2-odps0.48.0 (default)

    • spark-3.5.2-odps0.49.0

  • Configuration method:

    %%maxcompute_spark
    {
      "spark_conf": {
        "spark.hadoop.odps.spark.version": "spark-3.4.2-odps0.48.0"
      }
    }

Disable the Default Python 3.11 Environment

To use a custom Python environment:

%%maxcompute_spark
{
  "spark_conf": {
    "spark.hadoop.odps.spark.alinux3.enabled": "false"
  }
}

Use Matplotlib for Plotting

Pack and upload Matplotlib first. See Use Third-Party Python Packages. Then plot in Notebook:

%%spark

import matplotlib.pyplot as plt
import io, base64, json
import matplotlib
# Optional
matplotlib.use('Agg')

# Create figure first (required step)
fig = plt.figure()

# Example plot
x = [1, 2, 3, 4]
y = [20, 22, 19, 23]

plt.plot(x, y, marker='o', linestyle='--', color='b')
plt.title("Temperature Change Over Time")
plt.xlabel("Time")
plt.ylabel("Temperature")

# Render plot in Notebook (Livy magic command)
%matplot plt

Livy Configuration

The following table lists configurable Livy parameters:

Parameter Name

Default Value

Description

livy.server.access-control.enabled

true

Enable access control. Enabled by default. Do not disable.

livy.server.max.alive.time

3d

Maximum lifetime of the Livy server. Default is 3 days. Units: s, m, h, d.

livy.server.max.idle.time

1h

Maximum idle time for the Livy server. Default is 1 hour. The server shuts down if no Spark session runs during this time.

livy.server.session.timeout

12h

Maximum idle time for a Spark session. Default is 12 hours. The session shuts down if no Spark task runs during this time.

livy.server.session.state-retain.sec

12h

Maximum retention time for an expired Spark session. Default is 12 hours. The session is destroyed permanently if not restarted within this time.

livy.server.session.max-creation

50

Maximum number of concurrent sessions per Livy server. Default is 50.