All Products
Search
Document Center

DataWorks:Notebook development

Last Updated:Feb 02, 2026

DataWorks Notebook provides an interactive, modular environment for data analysis and development. You can use Python, SQL, and Markdown cells to connect to compute engines such as MaxCompute, EMR, and AnalyticDB. This lets you manage the entire workflow, from data processing and exploratory analysis to visualization and model development. This topic describes how to efficiently use Notebooks for data development and scheduling tasks.

Run your first Notebook

This section guides you through a simple process. You will create a Notebook, pass parameters from Python to SQL, and query data from a MaxCompute table.

Before you begin, make sure that you meet the following requirements:

  • The current workspace has the new Data Studio enabled.

  • A serverless resource group is created. Notebooks require a Serverless resource group to run in the production environment.

  • A personal development environment instance is created. Notebooks require a personal development environment instance to run and debug in the development environment.

    If you have not created one, see Create a personal development environment instance.

Procedure:

  1. Create a Notebook node

    1. In Data Studio, create a Notebook node under Workspace Directories.

    2. Enter a name for the Notebook, such as hello_notebook, and then submit it.

  2. Select a personal development environment

    In the top navigation bar, click Personal Development Environment, and then select your personal development environment instance from the drop-down list.

  3. Write a Python cell to define parameters

    Add a Python cell and enter the following code. This step defines a city variable for the subsequent SQL query.

    # Define a variable for the subsequent SQL query
    city = 'Beijing'
    print(f"Defined city variable city = {city}")
  4. Write an SQL cell to query data

    1. Below the first cell, add an SQL cell.

    2. In the lower-right corner of the cell, switch the SQL type to MaxCompute SQL.

    3. Enter the following SQL code. The code references the city variable defined in the previous Python cell using the ${city} syntax.

      -- Query using the variable defined in Python  
      SELECT '${city}' AS city;
  5. Run the cells and view the results

    1. Click the Run All button on the toolbar at the top of the Notebook.

    2. Observe the execution of each cell:

      • The output below the Python cell shows Defined city variable city = Beijing.

      • The query results are displayed in a table below the SQL cell.

You have now successfully created and run a Notebook that includes Python and SQL interaction.

Core concepts

Understanding the following core concepts is key to ensuring consistent Notebook behavior between development and production environments.

Development environment vs. production environment

Item

Development environment

Production environment

Runtime environment

Personal development environment instance

Specified in the Scheduling: Resource Group and Image

Core difference

You use a dedicated development instance where you can freely install Python libraries for debugging.

Tasks run on the resource group specified in the scheduling configuration. This applies to tasks scheduled periodically from the Operation Center and workflows manually triggered in DataStudio. The environment, including libraries and network settings, depends on the image and resource group you select.

How to ensure consistency

If you install Python packages in your personal development environment instance using methods such as pip install, you must create a DataWorks image from your personal development environment to ensure the production environment has the same dependencies. Then, select this custom image in the Scheduling. For more information, see Create a DataWorks image from a personal development environment.

Important

Network connectivity: By default, a personal development environment not attached to a VPC gets a random public IP address with limited bandwidth for public network access. When a Notebook node is deployed, its network is determined by the resource group in its scheduling configuration. To ensure network consistency, attach the same resource group to your personal development environment.

Compute resources and kernels

  • Compute resource: The backend compute engine connected to the Notebook, such as MaxCompute or EMR Serverless Spark. An SQL cell must be bound to a compute resource to execute.

  • Python kernel: The backend environment that executes Python code. In a DataWorks Notebook, this is typically your personal development environment instance. You can use Magic Commands, such as %odps, in your Python code to connect to compute resources to submit tasks or manipulate data.

Folder types and scenarios

The location where you create a Notebook determines its collaboration model, permissions, and deployment flow.

Folder type

Use cases

Collaboration and deployment

Workspace Directories

Team collaboration and scheduled tasks. Nodes in this folder are shared within the workspace and follow the standard development, submission, and deployment flow.

Allows multiple users to collaborate on development. Nodes must be deployed to the production environment to be scheduled periodically.

Personal Directory

Personal development and debugging. This folder is not visible to other workspace members and is used for personal scripts and temporary tasks.

Visible only to you. To be scheduled, a node must first be submitted to the workspace directory and then deployed.

Develop and debug a Notebook

Important
  • DataStudio does not automatically save your code by default. We recommend that you manually save your code frequently during development to prevent data loss. You can also enable auto-save in the Data Studio editor by navigating to Settings > Files: Auto Save.

  • If the Notebook lags or becomes unresponsive, you can click the Restart button on the top toolbar to restart the kernel.

Manage cells

  • Add cell: Hover over the top or bottom edge of an existing cell and click one of the buttons that appear, such as + SQL. You can also use the buttons on the top toolbar.

  • Switch cell type: Click the type identifier in the lower-right corner of a cell, such as Python, and select a new type from the menu, such as SQL or Markdown. When you switch the type, the code in the cell is retained. You must manually modify it to fit the new type.

  • Move a cell: Hover over the blue vertical line on the left of a cell, and then click and drag to change its order.

  • Run a cell:

    • Run a single cell: Click Run.

    • Run all cells: Click Run All in the Notebook toolbar.

Pass parameters

Passing Python variables to SQL

Variables defined in a Python cell can be directly referenced in a subsequent SQL cell using the ${variable_name} format.

Example:

  1. Python cell

    table_name = "dwd_user_info_d"
    limit_num = 10
  2. SQL cell

    SELECT * FROM ${table_name} LIMIT ${limit_num};

Pass results from SQL to Python

When an SQL cell executes a SELECT query, its result is automatically saved to a DataFrame variable. This variable can be used in subsequent Python cells.

Important

If there are multiple SQL statements, only the result of the last statement is saved to the DataFrame variable.

  • Variable naming: By default, the variable name starts with df_. You can click the variable name in the lower-left corner of the SQL cell to rename it.

  • Variable type:

    If multiple variable types are supported, click the DataFrame in the lower-left corner to switch the type.
    • For MaxCompute SQL, Pandas DataFrame and MaxCompute MaxFrame objects are supported.

    • For AnalyticDB for Spark SQL, Pandas DataFrame and PySpark MaxFrame objects are supported.

    • For other SQL types, a Pandas DataFrame object is generated.

Example:

image

Copilot-assisted programming

DataWorks Copilot is a built-in intelligent programming assistant that helps you generate and explain code.

How to use:

  • Click the Copilot image icon in the upper-left corner of the selected cell.

  • Right-click inside an SQL cell and select Copilot.

  • Use the keyboard shortcut Cmd+I (Mac) or Ctrl+I (Windows).

Schedule and deploy a Notebook

To run a Notebook on a recurring schedule, you must configure its scheduling settings and deploy it to the production environment.

1. Configure scheduling parameters

If you need the parameters in a Notebook to change dynamically for each scheduled run, such as processing data from different partitions daily, you can set up parameterized scheduling.

  1. Mark the parameter cell: In the Python cell that contains the core parameter definitions, click the ... menu in the upper-right corner and select Mark Cell as Parameters. A parameters tag is then added to the cell, marking it as the parameter entry point for the scheduling task.

    image

  2. Configure scheduling parameters:

    1. In the right pane of the Notebook, click Scheduling.

    2. In the Scheduling Parameters area, assign values to the variables (such as var) defined in your code.

      image

When the task is automatically executed by the scheduling system, the actual value of the var parameter in the code is dynamically replaced by the value configured in the scheduling parameters.

2. Configure the runtime environment and resources

  1. Configure the image: In the Scheduling, select an image that contains all the dependencies required for the Notebook to run. This is key to ensuring successful execution in the production environment.

    Important

    If you install Python packages in your personal development environment instance using methods such as pip install, you must create a DataWorks image from your personal development environment to ensure the production environment has the same dependencies. Then, select this custom image in the Scheduling. For more information, see Create a DataWorks image from a personal development environment.

  2. Configure the resource group: Select the resource group to use for executing the task. For a Serverless resource group, we recommend configuring no more than 16 CU to prevent task startup failures due to insufficient resources. A single task supports a maximum of 64 CU.

  3. Configure an associated role: For fine-grained permission control, you can associate a specific RAM role with the node. The node then runs with the identity of that role. For more information, see Configure an associated role for a node.

3. Deploy the node

Only nodes in the workspace directories can be deployed and scheduled periodically.

  • For Notebooks in the workspace directories: After completing the configuration, click the Deploy button on the top toolbar.

  • For Notebooks in the personal directory: you must click the Save button and then click Commit to Workspace Directory before deployment.

After the Notebook is deployed, you can monitor and manage your Notebook task on the Auto Triggered Nodes page in the Operation Center.

FAQ

  • Q: Why can my code access the public network during development but fails during a scheduled run?

    A: This is because the network policies for the development and production environments are different.

    • Development environment: For debugging convenience, a personal development environment instance that is not configured with a VPC provides limited public network access by default. This lets you temporarily install packages or call APIs.

    • Production environment: To ensure security and stability, tasks run in a virtual private cloud (VPC) by default and cannot directly access the public network. A task's network configuration is determined by the resource group that you select in the Scheduling. If the resource group's VPC is not configured with a NAT Gateway, the task cannot access the public network.

    • Solution: Ensure that the personal development environment instance and the Serverless resource group are set up in the same VPC.

  • Q: Why does my code run successfully in the development environment but fails to find third-party packages during a scheduled run?

    A: Ensure that you package all dependencies (such as Python libraries) into a custom image and specify this image in the Scheduling. For more information, see Create a DataWorks image from a personal development environment.

  • Q: How do I change the Python kernel version?

    A: You can manually install the required Python version in the terminal image of your personal development environment. Then, click the image button on the right side of the Notebook toolbar to switch to another Python kernel version. We do not recommend installing additional Python kernels. New versions may lack the dependencies required by SQL cells, which can prevent them from functioning correctly.

References