Use Magic Commands to Connect Spark in DataWorks Notebooks - DataWorks - Alibaba Cloud - DataWorks

This guide covers how to improve development efficiency through engineering practices such as code reuse, dataset mounting, and parameter management, as well as best practices and debugging techniques for connecting to compute engines like MaxCompute Spark, EMR Serverless Spark, and AnalyticDB for Spark.

Note

We recommend reading Basic notebook development first.

Development and production environments

DataWorks Notebook is a schedulable development and analysis tool. This means it operates in two distinct runtime environments:

Development environment: On the notebook node editing page in DataStudio, you can run cells to execute code directly in a personal development environment instance. This environment is for rapid validation and debugging of code logic.
Production environment: After you commit and publish a notebook node, its execution is triggered by periodic scheduling, data backfill, or other similar actions. The code runs in an isolated, ephemeral task instance. This environment is for stable and reliable production runs.

Understanding the significant feature differences between these two environments is key to efficient development.

Quick reference: development vs. production environments

Feature	Development environment (running a cell)	Production environment (scheduled runs)
Referencing project resources (`.py`)	Initial reference: Automatically downloaded and takes effect. After an update: Click the Restart button in the toolbar to reload the updated `.py` module. Important In DataStudio settings, set the `Dataworks › Notebook › Resource Reference: Download Strategy` to autoOverwrite to handle resource conflicts.	Takes effect automatically.
Reading and writing datasets (OSS/NAS)	Mount the dataset in the personal development environment.	Mount the dataset in the scheduling configuration.
Referencing workspace parameters (`${...}`)	Supported. Text substitution is automatically performed before code execution.	Supported. Text substitution is automatically performed before task execution.
Spark session management	By default, the Spark session is automatically released after 2 hours of inactivity.	A short-lived session is created for each task instance and destroyed along with it.

Reuse code and data in production

Reference project resources (.py files)

To make your code more modular, reusable, and maintainable, group common functions or classes in standalone .py files. You can then reference these files as MaxCompute resources using the ##@resource_reference{"custom_name.py"} syntax.

Create and publish a Python resource
1. In the left navigation pane of DataWorks DataStudio, click and navigate to Resource Management.
2. In the Resource Management directory tree, right-click the target directory or click + in the upper-right corner. Select New Resource > MaxCompute Python and name the file my_utils.py.
3. In the Document Content section, click Online Editing, paste your utility function code into the code editor, and click Save.
```
# my_utils.py
def greet(name):
    return f"Hello, {name} from resource file!"
```
4. In the toolbar, click Save and then Publish the resource. This makes the resource available to tasks in both the development and production environments.

Reference the resource in a notebook

In the first line of a Python cell in your notebook, use the ##@resource_reference syntax to reference the published resource.

##@resource_reference{"my_utils.py"}
# If the resource is in a subdirectory, such as my_folder/my_utils.py, reference it by filename only, without the directory path: ##@resource_reference{"my_utils.py"}
from my_utils import greet

message = greet('DataWorks')
print(message)

Debug in the development environment
Run the Python cell. The output is:
```
Hello, DataWorks from resource file!
```
Important
During debugging in the development environment, the system detects the ##@resource_reference declaration and automatically downloads the corresponding file from Resource Management to the workspace/_dataworks/resource_references path in your personal directory, making it accessible to your code.
If a ModuleNotFoundError occurs, click the Restart button in the editor toolbar to reload the resource, and then try again.
Publish to the production environment and verify
After you Save and Publish the notebook node, navigate to Operation and Maintenance Center > Auto Triggered Task and click Test to run the task. After the task succeeds, the output Hello, DataWorks from resource file! appears in the logs.
Important
If a There is no file with id ... error occurs, ensure that you have published the Python resource to the production environment.

For more information, see MaxCompute resources and functions.

Read and write datasets (OSS/NAS)

When notebook tasks run, you can easily read from and write to large-scale files stored on OSS or NAS.

Development environment debugging

Mount the dataset: On your personal development environment's details page, navigate to Storage Settings > Datasets to configure it.

Access the dataset in your code: The dataset is mounted to a specific mount path in your personal development environment. You can read from or write to this path directly in your code.

# Assume the dataset is mounted to the /mnt/data/dataset path.
import pandas as pd

# Use the mount path directly.
file_path = '/mnt/data/dataset/testfile.csv'
df = pd.read_csv(file_path)

# Use PyODPS to write data to MaxCompute.
o = %odps
o.write_table('mc_test_table', df, overwrite=True)
print(f"Successfully wrote data to the MaxCompute table mc_test_table.")

Production environment deployment

Mount the dataset: On the notebook node editing page, navigate to Scheduling Settings > Scheduling Policy in the right-hand navigation pane and add the same dataset.

Access the dataset in your code: After you commit and publish the node, the dataset is mounted in the production environment. Use the same mount path in your code to access it.

# Assume the dataset is mounted to the /mnt/data/dataset path.
import pandas as pd

# Use the mount path directly.
file_path = '/mnt/data/dataset/testfile.csv'
df = pd.read_csv(file_path)

# Use PyODPS to write data to MaxCompute.
o = %odps
o.write_table('mc_test_table', df, overwrite=True)
print(f"Successfully wrote data to the MaxCompute table mc_test_table.")

For more information, see Use datasets in a personal development environment.

Use workspace parameters

Important

This feature is available only in DataWorks Professional Edition or higher.

DataWorks provides workspace parameters, which extend existing scheduling parameters. These parameters enable the reuse of global configurations and environment isolation across tasks and nodes. You can reference a workspace parameter in a SQL cell or Python cell using the ${workspace.param} format, where param is the name of your workspace parameter.

1. Create a workspace parameter: Before you start, navigate to Operation and Maintenance Center > Scheduling Settings > Workspace Parameters in DataWorks to create the required parameters.

2. Reference a workspace parameter:

Reference a workspace parameter in a SQL cell.
```
SELECT '${workspace.param}';
```
When the cell runs, it prints the resolved value of the workspace parameter.
Reference a workspace parameter in a Python cell.
```
print('${workspace.param}')
```
When the cell runs, it prints the resolved value of the workspace parameter.

For more details, see Use workspace parameters.

Use magic commands with compute engines

Magic commands are special commands prefixed with % or %% that simplify interactions between a Python cell and various compute resources.

Connect to MaxCompute

Note

Before you connect to a MaxCompute compute resource, make sure that you have bound a MaxCompute compute resource.

%odps: Get a PyODPS entry object
This command returns an authenticated PyODPS object bound to the current MaxCompute project. This method is recommended for interacting with MaxCompute because it avoids hard-coding AccessKeys in your code.
1. Use a magic command to create a MaxCompute connection. Enter %odps. A MaxCompute compute resource selector appears in the lower-right corner and automatically selects a compute resource. Click the MaxCompute project name in the lower-right corner to switch projects.
```
o=%odps 
```
2. Use the retrieved MaxCompute compute resource to run a PyODPS script.
  For example, to get all tables in the current project:
```
with o.execute_sql('show tables').open_reader() as reader:
    print(reader.raw)
```

%maxframe: Establish a MaxFrame connection

This command creates a MaxFrame session, which provides distributed, pandas-like data processing capabilities for MaxCompute.

# Connect to and access a MaxCompute MaxFrame Session
mf_session = %maxframe

df = mf_session.read_odps_table('your_mc_table')
print(df.head())

# After development and debugging, manually destroy the session to release resources
mf_session.destroy()

Connect to Spark compute resources

DataWorks Notebook supports connections to multiple Spark engines. These engines differ in connection method, execution context, and resource management.

Important

A single notebook node can use a magic command to connect to only one type of compute resource at a time.

Engine comparison

Feature	MaxCompute Spark	EMR Serverless Spark	AnalyticDB for Spark
Command	`%maxcompute_spark`	`%emr_serverless_spark`	`%adb_spark add`
Command	Note After you run the command, the execution context of the entire notebook kernel switches to the remote PySpark environment. You can then write PySpark code directly in subsequent cells.
Prerequisites	Bind a MaxCompute compute resource.	Bind an EMR compute resource and create a Livy Gateway.	Bind an ADB Spark compute resource.
Development mode	Automatically creates or reuses a Livy session.	Connects to an existing Livy Gateway and creates a session.	Automatically creates or reuses a Spark Connect Server.
Production mode	Livy mode: Submits Spark jobs through the Livy service.	spark-submit batch mode: Pure batch processing; session state is not retained.	Spark Connect Server mode: Interacts through the Spark connection service.
Resource release	The system automatically releases the session after the task instance ends.	The system automatically cleans up resources after the task instance ends.	The system automatically releases resources after the task instance ends.
Use cases	General-purpose batch processing and ETL tasks that are tightly integrated with the MaxCompute ecosystem.	Complex analysis tasks that require flexible configurations and interaction with open-source big data ecosystems, such as Hudi and Iceberg.	High-performance interactive queries and analysis on C-Store tables in AnalyticDB for MySQL.

MaxCompute Spark

Note

Before you connect to a MaxCompute compute resource, make sure that you have bound a MaxCompute compute resource.

Connect to the Spark engine built into a MaxCompute project through Livy.

Establish connection: Run the following command in a Python cell. The system automatically creates or reuses a Spark session.
```
# Create a Spark Session.
%maxcompute_spark
```

Execute PySpark code: After the connection is established, use the %%spark cell magic in a new Python cell to execute PySpark code.

# When using MaxCompute Spark, the Python cell must start with %%spark.
%%spark

df = spark.sql("SELECT * FROM your_mc_table LIMIT 10")
df.show()

Manually release the connection: After you finish debugging, you can manually stop or delete the session. When running in a production environment, the system automatically stops and deletes the Livy session for the current task instance, so no manual action is required.
```
# Clean up the Spark Session and stop Livy.
%maxcompute_spark stop

# Clean up the Spark Session, stop Livy, and then delete Livy.
%maxcompute_spark delete
```

EMR Serverless Spark

Note

Before you establish a connection to the compute resource, bind an EMR Serverless Spark compute resource to your workspace and create a Livy Gateway.

Connect to a pre-existing Livy Gateway to interact with EMR Serverless Spark.

Establish a connection: Before running the command, select the EMR compute resource and Livy Gateway in the lower-right corner of the cell.
```
# Basic connection
%emr_serverless_spark

# Or, pass custom Spark parameters when connecting. Note that two percent signs (%%)
# are required when you pass custom Spark parameters.
%%emr_serverless_spark
{
  "spark_conf": {
    "spark.emr.serverless.environmentId": "<EMR Serverless Spark runtime environment ID>",
    "spark.emr.serverless.network.service.name": "<EMR Serverless Spark network connection ID>",
    "spark.driver.cores": "1",
    "spark.driver.memory": "8g",
    "spark.executor.cores": "1",
    "spark.executor.memory": "2g",
    "spark.driver.maxResultSize": "32g"
  }
}
```
Note
Relationship between custom parameters and global configuration
- Default behavior: Custom parameters defined here apply only to the current connection (session) and are one-time. If you do not provide custom parameters, the system uses the global parameters configured in the Admin Center.
- Recommended usage: For configurations that need to be reused across multiple tasks or by multiple users, configure them globally in Admin Center > Serverless Spark > SPARK parameters to ensure consistency and simplify management.
- Priority rule: When the same parameter is set in both custom parameters and the global configuration, which setting takes effect depends on the Global Configuration Priority option in the Admin Center.
  - Selected: The global configuration overrides the custom parameters for this session.
  - Not selected: The custom parameters for this session override the global configuration.
(Optional) Reconnect: If an administrator accidentally deletes the token from the Livy Gateway page, use this command to recreate it.
```
# Reconnect and refresh the Livy token for the current personal development environment.
%emr_serverless_spark refresh_token
```
Execute PySpark or SQL code: After the connection is established, the kernel switches. You can write PySpark code directly in a Python cell or write SQL in an EMR Spark SQL cell.
1. Submit and execute SQL code via an EMR Spark SQL cell
  After you establish a connection with %emr_serverless_spark, you can write SQL statements directly in an EMR Spark SQL cell without selecting a compute resource in the cell.
  The EMR Spark SQL cell reuses the %emr_serverless_spark connection to submit the job to the target compute resource for execution.
2. Submit and execute PySpark code via a Python cell
  After you establish a connection with %emr_serverless_spark, you can submit and execute PySpark code in a new Python cell. You do not need to add the %%spark prefix to the cell.
Manually release the connection
Important
If multiple users share a Livy Gateway, the stop or delete command affects all users who are using that gateway. Use these commands with caution.
```
# Clean up the Spark Session and stop Livy.
%emr_serverless_spark stop

# Clean up the Spark Session, stop Livy, and then delete Livy.
%emr_serverless_spark delete
```

AnalyticDB for Spark

Note

Before you establish a connection to the compute resource, bind an AnalyticDB for Spark compute resource to your workspace.

Connect to the AnalyticDB for Spark engine by creating a Spark Connect Server.

Establish a connection: To ensure network connectivity, you must correctly configure the vSwitch ID and security group ID in the connection parameters. Before you run the command, select the ADB Spark compute resource in the lower-right corner of the cell.
```
# You must configure the vSwitch ID and security group ID to establish a network connection.
%adb_spark add \
 --spark-conf spark.adb.version=3.5 \
 --spark-conf spark.adb.eni.enabled=true \
 --spark-conf spark.adb.eni.vswitchId=<vSwitch ID of ADB> \
 --spark-conf spark.adb.eni.securityGroupId=<security group ID of the personal development environment>
```
How do I find the vSwitch ID and security group ID?
- vSwitch ID (vswitchId): Go to the Alibaba Cloud AnalyticDB for MySQL console. On the instance details page, view the vSwitch ID under Network Information.
- Security Group ID (securityGroupId): Go to Network Settings on the details page of your personal development environment to find the ID of the selected Security Group. The ID is the one that starts with sg-.
  Important
  To ensure network connectivity, we recommend that you select the same VPC and vSwitch as your AnalyticDB for Spark instance when you create your personal development environment.
Execute PySpark code: After the connection is established, execute PySpark code in a new Python cell.
```
# You can run operations only on C-Store tables.
df = spark.sql("SELECT * FROM my_adb_cstore_table LIMIT 10")
df.show()
```
Note: The AnalyticDB for Spark engine can currently process only C-Store tables that have the 'storagePolicy'='COLD' attribute.
Manually release the connection: After you finish debugging in the development environment, manually clean up the connection session to save resources. When running in a production environment, the system automatically cleans up resources.
```
%adb_spark cleanup
```

Connect to Lindorm Ray

The RAY resource group of the Lindorm compute engine provides distributed computing services and supports end-to-end AI workloads. You can use a magic command to seamlessly connect to Lindorm Ray resources in a notebook for interactive development and debugging, and then publish the notebook as a scheduled production task.

Prerequisites

When you purchase a Lindorm instance, select Yes under Enable Compute Engine.
Add your Lindorm cluster as a DataWorks compute resource. For more information, see Bind a Lindorm compute resource.
In the Lindorm console, enable a RAY resource group for your cluster. When you create the resource group, make sure that you specify the correct image in Advanced Settings to ensure a consistent environment.
How do I configure the Ray resource group image?
When you create a RAY resource group in the Lindorm console, find Advanced Settings and enter the following JSON content. You must replace region in the image address with the region where your Lindorm cluster is located. For example, replace beijing with shanghai.
```
{
  "IMAGE": "spark-repo-beijing-registry-vpc.cn-beijing.cr.aliyuncs.com/lindorm-compute/ray:2.39.0-0.7.0-py311-cpu"
}
```
Make sure that your personal development environment, Serverless resource group network configuration, and Lindorm cluster are in the same VPC to ensure network connectivity.

Establish a connection: Run the %lindorm_ray command in a Python cell. A compute resource selector appears in the lower-right corner of the cell. Select your Lindorm compute resource and the created RAY resource group.
```
# Connect to the specified Lindorm Ray resource group.
%lindorm_ray
```
Important
- After you connect to a Lindorm Ray compute resource, you can no longer run SQL cells in the same notebook. The Lindorm Ray engine exclusively executes Python and Ray code.
- If you run the same code cell multiple times, the system automatically terminates the previous Ray job and starts a new one. This helps prevent resource waste and task conflicts.
Execute Ray code: After the connection is established, you can write and execute Ray code directly in a new Python cell. Logs stream back to the cell's output area in real time, which facilitates interactive debugging.
The following example defines a simple remote task (using the @ray.remote decorator) that executes on a Ray cluster and returns the logs and final result to the output area of the cell.
```
import ray
import time

@ray.remote
def hello_world():
  print("Hello from Lindorm Ray!")
  time.sleep(5)
  return "Task finished."
# Submit the remote task
result_ref = hello_world.remote()
print(ray.get(result_ref))
```
(Optional) Specify custom startup parameters: If you need to specify additional configurations for the Ray environment, such as installing third-party Python packages or uploading local code files, use the %%lindorm_ray command to establish the connection.
- Example 1: Install dependencies
  Use the pip parameter to install the jieba package in the Ray environment.
```
%%lindorm_ray
{
  "runtime_env": {
    "pip": ["jieba"]
  }
}
```
  Once the environment is ready, import and use the package in subsequent Ray jobs. The following example shows how to call jieba in a remote function to perform Chinese word segmentation:
```
import ray 

@ray.remote
def do_work(x):
    import jieba

    return "/".join(jieba.cut(x))

print(ray.get(do_work.remote("Welcome to the DataWorks+LindormRay solution")))
```
- Example 2: Upload and use DataWorks resources
  The working_dir parameter is used to upload resources from DataWorks Resource Management to a Ray cluster so that they can be imported and called in tasks.
  Important
  - When you use working_dir to upload resources, files are uploaded directly from your development environment to the Ray cluster and are subject to a 100 MB size limit. If a resource package is too large, the upload may fail or the Ray nodes may become unstable.
  - For large files or dependencies (>100 MB), upload them to OSS and access them from your code, or package them into a custom image. This approach provides better stability and performance.
```
# Reference a resource uploaded in DataStudio Resource Management and declare its path.
%%lindorm_ray
{
    "runtime_env": {
        "working_dir": "/mnt/workspace/_dataworks/resource_references"
    }
}
```
  Suppose you upload a ray_resource.py file to Resource Management in DataStudio. When you write and execute the following cell, the system automatically parses the ##@resource_reference declaration in the subsequent code and downloads the corresponding resource to the /mnt/workspace/_dataworks/resource_references path.
  Sample code for ray_resource.py:
```
def fun():
    print("This is a test function in ray_resource.py")
```
  Important
  In a development environment, after you execute the cell that contains ##@resource_reference, you must rerun the %%lindorm_ray cell above to upload the downloaded resources in the working_dir to the Ray cluster. In a production environment, you do not need to rerun.
```
import ray 

##@resource_reference{"ray_resource.py"}

@ray.remote
def do_work(x):
    print('Ray says:', x)

    from ray_resource import fun
    fun()
    return x

worker = do_work.remote("Welcome to the DataWorks+LindormRay solution")
print(ray.get(worker))
```
Production scheduling and O&M: After development and debugging, you can commit and publish the notebook node. It will then be periodically scheduled as a Lindorm Ray node in a DAG.
- Parameterization: Your code can use standard DataWorks scheduling parameters, such as ${bizdate}.
- Log viewing: In the production environment, to prevent excessive logs from affecting performance, the system loads only the first 1 MB of logs by default. If logs are truncated, the output includes a link that directs you to the Lindorm console to view the complete task logs.
- Resource release: After a scheduled production task ends, the Lindorm Ray task enters a terminal state and no longer occupies resources. During interactive development, you can terminate the Lindorm Ray task by restarting the kernel or closing the notebook.

Appendix: Magic command quick reference

Magic command	Description	Compute engine
`o = %odps`	Get a PyODPS entry object	MaxCompute
`mf_session = %maxframe`	Establish a MaxFrame connection	MaxCompute
`%maxcompute_spark`	Create a Spark session	MaxCompute Spark
`%maxcompute_spark stop`	Clean up the Spark session and stop Livy.
`%maxcompute_spark delete`	Clean up the Spark session, then stop and delete Livy.
`%%spark`	In a Python cell, connect to an established Spark compute resource.
`%emr_serverless_spark`	Create a Spark session	EMR Serverless Spark
`%emr_serverless_spark info`	View detailed information about the Livy Gateway.
`%emr_serverless_spark stop`	Clean up the Spark session and stop Livy.
`%emr_serverless_spark delete`	Clean up the Spark session, then stop and delete Livy.
`%emr_serverless_spark refresh_token`	Refresh the Livy token for the personal development environment.
`%adb_spark add`	Create and connect to a reusable ADB Spark session.	AnalyticDB for Spark
`%adb_spark info`	View Spark session information.
`%adb_spark cleanup`	Stop and clean up the current Spark connection session.
`%lindorm_ray`	Establish a Lindorm Ray connection.	Lindorm Ray
`%%lindorm_ray`	Establish a Lindorm Ray connection and configure a custom runtime environment, such as by installing dependencies or uploading code.	Lindorm Ray

FAQ

Q: Why do I receive a ModuleNotFoundError or There is no file with id ... error when referencing a workspace resource?
A: Follow these steps to troubleshoot the issue:
- Go to Data Development > Resource Management to ensure the MaxCompute Python resource has been saved. If this error occurs in the production environment, verify that the resource is published to it.
- In the Notebook editor toolbar, click Restart to reload the resource.
Q: Why are the old resources still referenced after I update the workspace resources?
A: When you republish a modified resource, set the Dataworks › Notebook › Resource Reference: Download Strategy to autoOverwrite in your Data Studio settings, and then click Restart Kernel in the Notebook toolbar.
Q: Why do I receive a FileNotFoundError error in the development environment when referencing a dataset?
A: Ensure that the dataset is mounted in the currently selected personal development environment.
Q: Why does referencing a dataset work in the development environment but fail in the production environment with an Execute mount dataset exception! Please check your dataset config error?
A: Ensure that the dataset is mounted in the Scheduling Settings of the Notebook node and that you have granted the necessary permissions to the OSS dataset.
Q: How do I check the version of my personal development environment?
A: In your personal development environment, press Cmd+Shift+P and enter ABOUT to view the current version. If an update to version 0.5.69 or later is required, an upgrade prompt appears. Click One-click Upgrade to update your instance.
Q: Why does the connection to the Spark engine fail?
A: Follow these steps:
- General checks: On the compute resource list of the workspace details page, confirm that the target compute resource (MaxCompute, EMR, or ADB) is correctly bound to your workspace and that your account has the necessary permissions.
- EMR Serverless Spark: Verify that the Livy Gateway exists and is healthy.
- AnalyticDB for Spark: Focus on network issues. Confirm that the vswitchId and securityGroupId are correctly configured to ensure network connectivity between your personal development environment and the ADB Spark instance. Verify that your security group rules allow traffic on the required ports.