Data Studio is an intelligent data lakehouse development platform that incorporates 15 years of Alibaba's big data experience. It is compatible with various Alibaba Cloud compute services and provides intelligent extract, transform, and load (ETL), data catalog management, and cross-engine workflow orchestration. Data Studio supports Python development, Notebook analysis, and Git integration through personal development environments. It also features a rich plug-in ecosystem to integrate real-time and offline computing, data lakehouses, and big data with AI. This helps you manage the entire 'Data+AI' lifecycle.
Introduction to Data Studio
Data Studio is an intelligent data lakehouse development platform built on 15 years of Alibaba's big data methodologies. It is deeply integrated with various big data and AI compute services from Alibaba Cloud, such as MaxCompute, E-MapReduce, Hologres, Realtime Compute for Apache Flink, and PAI. It provides intelligent ETL development services for data warehouses, data lakes, and OpenLake data lakehouse architectures. Data Studio supports the following features:
Data catalog: A data catalog with metadata management capabilities for data lakehouses.
Workflow: A development model that supports the orchestration of workflows that contain real-time, offline, and AI nodes for various engine types.
Personal development environment: Provides support for Python node development and debugging, interactive analysis using Notebook, and integration with Git for code management and NAS or OSS for storage.
Notebook: An intelligent and interactive tool for data development and analysis. It supports SQL or Python analysis for various data engines, lets you run or debug code instantly, and provides visualized data results.
Enable the Data Studio (new version)
You can enable the Data Studio (new version) in one of the following ways:
When you create a workspace, select Use Data Studio (New Version). For more information, see Create a workspace.
In the legacy DataStudio, click the Upgrade To New Version button at the top of the page. Follow the on-screen instructions to migrate your data to the Data Studio (new version).

The Data Studio (new version) is available in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), and US (Virginia).
If you encounter problems when you use the Data Studio (new version), you can join the exclusive DingTalk group for DataWorks upgrade support.
Data in the Data Studio (new version) and the DataStudio (legacy version) is independent and not interoperable.
The upgrade from the DataStudio (legacy version) to the new version is an irreversible operation. You cannot roll back to the legacy version after a successful upgrade. Before you switch, we recommend that you create a test workspace with the Data Studio (new version) enabled. This lets you ensure that the new version meets your business requirements before you upgrade.
Starting from February 19, 2025, when an Alibaba Cloud account is used to activate DataWorks and create a workspace for the first time in a region that supports the Data Studio (new version), the new version is enabled by default. The legacy version will no longer be supported.
Go to Data Studio
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
This entry point is visible only for workspaces where the Use Data Studio (New Version) feature is enabled. For more information, see Enable the Data Studio (new version).
Data Studio is supported only on Chrome 69 or later on a PC.
Main features of Data Studio
The main features of are described in this section. For more information, see Appendix: Data Studio concepts.
Type | Description |
Flow control | DataWorks Data Studio provides a Workflow development model. A workflow is a new development method that provides a visualized directed acyclic graph (DAG) interface from a business perspective. This makes it easy to manage complex node projects. For more information, see Auto triggered workflows, Event-triggered workflows, and Manually triggered workflows. Note In DataWorks Data Studio, the following limits apply to the number of inner nodes and objects that can be created in each workspace:
If the number of workflows and objects in your workspace reaches the limit, you cannot create new ones. |
Task development |
For more information about the node types that DataWorks supports, see Node development. |
Task Scheduling |
For more information about scheduling, see Node scheduling configuration. |
Quality control | Provides a standardized node publishing mechanism and various quality control methods. These include but are not limited to the following scenarios:
|
Others |
|
Data Studio interface
You can use the Data Studio feature guide to learn about the Data Studio interface and the features of each module.
Node development process
Data Studio in DataWorks supports the creation of real-time sync tasks, offline scheduling tasks (including offline sync tasks and offline processing tasks), and manually triggered tasks for various engine types. For more information about data synchronization, see Data Integration.
DataWorks workspaces are available in standard mode and basic mode. The node development process differs between the two modes. The following diagrams show the development processes for both modes.
Development process in a standard mode workspace
Development process in a basic mode workspace
Basic process: In standard mode, for example, the development process for a scheduling node includes development, debugging, scheduling configuration, publishing, and O&M. For more information about the general development process, see Data development process guide.
Flow control: During node development, you can use features such as the built-in code review in Data Studio, preset checks in Data Management, and custom logic validation using extension programs from the Open Platform to ensure that development nodes comply with your standards.
Data development methods
Data Studio lets you customize the development process. You can quickly build data processing flows using workflows, or you can manually create individual task nodes and then configure their dependencies.
For more information, see Workflow orchestration.
Collection of nodes supported by Data Studio
Data Studio supports various node types, including data integration, MaxCompute, Hologres, EMR, Flink, Python, Notebook, and AnalyticDB for MySQL nodes. Many of these node types support recurring scheduling. You can select the appropriate nodes for your development operations as needed. For a list of nodes that DataWorks supports, see Supported node types.
Appendix: Data Studio concepts
Task Development
Concept | Description |
Workflow | A new development method that provides a visualized DAG interface from a business perspective. This makes it easy to manage complex node projects. A workflow supports orchestrating dozens of node types, such as data integration, MaxCompute, Hologres, EMR, Flink, Python, Notebook, and AnalyticDB for MySQL nodes. It also supports workflow-level scheduling configuration. Recurring and event-triggered workflows are supported. |
Manually triggered workflow | A collection of nodes, tables, resources, and functions for a specific business requirement. The difference between a manually triggered workflow and a recurring workflow is that nodes in a manually triggered workflow must be triggered manually, while nodes in a recurring workflow are triggered on a schedule. |
Task node | A task node is the basic execution unit in DataWorks. Data Studio provides various node types. These include data integration nodes for data synchronization, compute engine nodes for data cleansing (such as ODPS SQL, Hologres SQL, and EMR Hive), and general-purpose nodes for complex logic processing (such as zero load nodes for managing multiple nodes and do-while nodes for looping code). You can combine these nodes to meet your data processing needs. |
Node scheduling concepts
Concept | Description |
Dependency | Dependencies between nodes define their execution order. If node B can run only after node A runs, we say that A is an upstream dependency of B, or B depends on A. In a DAG, dependencies are represented by arrows between nodes. |
Output name | The name of the output point for each task. It is a virtual entity used to connect upstream and downstream tasks when you set up dependencies within a single tenant (Alibaba Cloud account). When you set up an upstream or downstream dependency for a task, you must use the output name, not the node name or ID. After setup, the output name of a task also serves as the input name for its downstream nodes. |
Output table name | We recommend that you set the output table name to the output table of the current node. Correctly specifying the output table name helps downstream nodes confirm whether the data comes from the expected ancestor table. We recommend that you do not manually modify the output table name if it is automatically parsed. The output table name is only an identifier. Modifying it does not affect the actual output table name in the SQL script. The actual output table name is determined by the SQL logic. Note The output name of a node must be globally unique, but the output table name does not have this restriction. |
Schedule resource group | Refers to the resource group used for node scheduling. |
Scheduling parameter | Scheduling parameters are variables in code that are dynamically assigned values at runtime. If you want your code to obtain information from the runtime environment during repeated runs, such as the date or time, you can use the scheduling parameters defined by the DataWorks CDN mapping system to dynamically assign values to variables in your code. |
Data timestamp | This usually refers to the date directly related to business activities, reflecting the actual time when the business data was generated. This concept is particularly important in offline computing scenarios. For example, in a retail business, you might need to calculate the turnover for October 10, 2024. This calculation often starts in the early morning of October 11, 2024. The calculated data actually represents the turnover for October 10, 2024. In this case, October 10, 2024 is the data timestamp. |
Scheduled time | The time point, accurate to the minute, that a user sets for a recurring task to run. Important Many factors can affect when a node runs. A node does not necessarily run immediately at its scheduled time. Before a node runs, DataWorks checks whether its upstream nodes have run successfully, whether the scheduled time has been reached, and whether schedule resources are sufficient. The node is triggered only after all these conditions are met. |