What is the new Data Development in DataWorks - DataWorks

Data Studio is an intelligent data lakehouse development platform from Alibaba Cloud, built on more than a decade of big data expertise. It supports a wide range of Alibaba Cloud computing services, offering capabilities for intelligent ETL, data catalog management, and cross-engine workflow orchestration. Data Studio provides personal development environment instances with support for Python, Notebook analysis, and Git integration. Complemented by a rich plugin ecosystem, it unifies real-time and offline processing, data lakehouse architectures, and big data with AI to power the entire Data+AI data management lifecycle.

Data Studio

Data Studio is an intelligent data lakehouse development platform that embeds Alibaba's big data best practices. It deeply integrates with dozens of Alibaba Cloud big data and AI computing services, including MaxCompute, E-MapReduce, Hologres, Flink, and PAI. It provides intelligent ETL development services for data warehouses, data lakes, and OpenLake data lakehouse architectures. Key features include:

Data lakehouse and multi-engine support
Seamlessly access data across data lakes (such as OSS) and data warehouses (such as MaxCompute) through a unified data catalog and a rich set of engine nodes to enable multi-engine hybrid development.
Flexible workflows and scheduling
Use a rich set of flow control nodes to visually orchestrate cross-engine tasks in a workflow. The platform supports both time-driven periodic scheduling and event-driven trigger-based scheduling.
Open Data+AI development environment
Build a flexible AI research and development workstation with a personal development environment that allows for custom dependencies, a Notebook that supports mixed SQL and Python coding, and features like dataset management and Git integration.
Intelligent assistance and AI engineering
The powerful built-in Copilot assists you throughout the code development process. Professional PAI algorithm nodes and large model nodes provide native support for end-to-end AI engineering.

Key concepts

Concept	Definition	Core value	Keywords
Workflow	A unit for organizing and orchestrating tasks	Enables dependency management and automated scheduling for complex tasks. A workflow serves as the container for development and scheduling.	Visual, DAG, periodic/trigger-based, orchestration
Node	The smallest execution unit in a workflow	Where you write code and implement specific business logic. A node is the atomic operation for data processing.	SQL, Python, Shell, Data Integration
Custom image	A standardized snapshot of an environment	Ensures environment extensibility, consistency, and reproducibility.	Environment freezing, standardization, replication, consistency
Scheduling	Rules for automatically triggering tasks	Automates data production by converting manual tasks into automated, production-ready workloads.	Periodic scheduling, trigger-based scheduling, dependency, automation
Data Catalog	A unified metadata workbench	Organizes and manages data assets (such as tables) and compute resources (such as functions and resources) in a structured manner.	Metadata, table management, data profiling
Dataset	A logical mapping to external storage	Bridges the connection to external unstructured data (images/documents) and serves as a key data bridge for AI development.	OSS/NAS integration, data mounting, unstructured data
Notebook	An interactive Data+AI development canvas	Combines SQL and Python code to accelerate data exploration and algorithm validation.	Interactive, multi-language, visual, exploratory analysis

Data Studio process guide

Data Studio provides processes for data warehouse development and AI development. The following section describes two common paths. You can explore more paths based on your actual needs.

Common path: Data warehouse development process (periodic ETL tasks)

This process is suitable for building enterprise-grade data warehouses with stable, automated batch data processing.

Target audience: Data engineers and ETL developers.
Core objective: Build a stable, standardized, and automatically scheduled enterprise-grade data warehouse for batch data processing and report generation.
Key technologies: Data Catalog, scheduled workflows, SQL nodes, and schedule settings.

Step	Phase	Core operations and objectives	Key path and references
1	Associate a compute engine	Associate one or more core compute engines (such as MaxCompute) with the workspace as the execution environment for all SQL tasks.	Console > Workspace Settings For more information, see Associate compute engines.
2	Data Catalog management	Create or explore the table structures required for each data warehouse layer (ODS, DWD, ADS, etc.) in the Data Catalog to define the inputs and outputs for data processing. We recommend that you use the Data Modeling module to build the data warehouse system.	Data Studio > Data Catalog For more information, see Data Catalog.
3	Create a scheduled workflow	Create a scheduled workflow in the project directory as a container for organizing and managing related ETL tasks.	Data Studio > Project Directory > Scheduled Workflow For more information, see Scheduled workflows.
4	Node development and debugging	Create ODPS SQL nodes or other nodes, write core ETL logic (data cleansing, transformation, and aggregation) in the editor, and debug the nodes.	Data Studio > Node Development > Node Editor Data Studio > Node Development > Run Configuration For more information, see Node development.
5	Copilot-assisted development	Use DataWorks Copilot capabilities to generate, correct, rewrite, and convert SQL and Python code.	Data Studio > Node Development > Copilot Data Studio > Copilot > Agent For more information, see Copilot.
6	Node orchestration and scheduling	In the DAG canvas of the workflow, define upstream and downstream dependencies between nodes by dragging and connecting them. Various flow control nodes are supported for complex orchestration. Configure the schedule settings for the production environment at the workflow or node level, including the schedule, time, and dependencies. The platform supports ultra-large-scale scheduling of tens of millions of instances per day.	Data Studio > Workflow > Workflow Canvas Data Studio > Node Development > Schedule Settings For more information, see General flow control nodes and Schedule settings.
7	Workflow/node deployment and O&M	Deployment: Deploy the debugged nodes or workflows to the production environment through the deployment process. O&M: Monitor, configure alerts for, backfill data for, and perform periodic validation on production tasks in Operation Center. Use intelligent baselines to ensure that tasks are completed on time, and use monitoring alerts to promptly handle task exceptions.	Data Studio > Node/Workflow Details > Node/Workflow Deployment Operation and Maintenance Center > Auto Triggered Task O&M > Auto Triggered Task For more information, see Deploy nodes and workflows and Operation Center.

Note

For a related getting-started tutorial, see Getting started tutorial.

Advanced path: Big data AI development process

This process is suitable for AI model development, data science exploration, and building real-time AI applications. It emphasizes environment flexibility and interactivity. The actual process may vary based on your needs.

Target audience: AI engineers, data scientists, and algorithm engineers.
Core objective: Perform data exploration, model training, and algorithm validation, or build real-time AI applications (such as RAG and real-time inference services).
Key technologies: Personal development environment, Notebook, trigger-based workflows, datasets, and custom images.

Step	Phase	Core operations and objectives	Key path and references
1	Create a personal development environment	Create an isolated, customizable cloud container instance as the environment for installing complex Python dependencies and performing professional AI development.	Data Studio > Personal Development Environment For more information, see Personal development environment.
2	Create a trigger-based workflow	Create an event-driven workflow in the project directory as an orchestration container for real-time AI applications.	Data Studio > Project Directory > Trigger-based Workflow For more information, see Trigger-based workflows.
3	Create and configure a trigger	Configure a trigger in Operation Center to define which external events (such as OSS events or Kafka message events) start the workflow.	Create: Operation Center > Trigger Management Use: Data Studio > Trigger-based Workflow > Schedule Settings For more information, see Trigger management and Trigger-based workflow schedule settings.
4	Create a Notebook node	Create the core development unit for writing AI/Python code. We recommend that you first explore in a Notebook in the personal directory.	Project Directory > Trigger-based Workflow > Notebook Node For more information, see Notebook node development.
5	Create and use datasets	Register unstructured data (images, documents, etc.) stored on OSS or NAS as datasets, and mount them to the development environment or tasks for code access.	Create: Data Map > Data Catalog > Dataset Use: Data Studio > Personal Development Environment > Dataset Configuration For more information, see Create a dataset and Use datasets.
6	Develop and debug Notebooks/nodes	Write algorithm logic in the interactive environment provided by the personal development environment. Perform data exploration, model validation, and rapid iteration.	Data Studio > Notebook Editor For more information, see Notebook.
7	Install custom dependency packages	In the terminal of the personal development environment or in a Notebook cell, use tools such as `pip` to install all the third-party Python libraries required by the model.	Data Studio > Personal Development Environment > Terminal For more information, see Install custom dependency packages.
8	Build a custom image	Freeze the personal development environment with all dependencies configured into a standardized image to ensure that the production environment is fully consistent with the development environment. If you have not installed custom dependency packages, skip this step.	Data Studio > Personal Development Environment > Manage Environment Console > Custom Image For more information, see Custom images.
9	Node schedule settings	In the schedule settings of the production node, you must specify the custom image created in the previous step as the runtime environment and mount the required datasets.	Data Studio > Notebook Node > Schedule Settings For more information, see Schedule settings.
10	Node/workflow deployment and O&M	Deployment: Deploy the configured trigger-based workflow to the production environment. O&M: Trigger a real event (such as uploading a file) to verify that the end-to-end process works correctly, and perform trigger validation.	Data Studio > Node/Workflow Details > Node/Workflow Deployment Operation and Maintenance Center > Manually Triggered Task O&M > Manually Triggered Task For more information, see Deploy nodes and workflows and Manually triggered tasks.

Data Studio core modules

Core module	Main capabilities
Workflow orchestration	Provides a visual DAG canvas that allows you to easily build and manage complex task projects by dragging and connecting nodes. Supports scheduled workflows, trigger-based workflows, and manual workflows to meet automation needs in different scenarios.
Execution environments and modes	Provides flexible, open development environments that improve development efficiency and collaboration. Execution environments: Supports the default development environment, personal development environments, and custom images to meet personalized development needs. Also supports Git integration for code version management , allowing you to use familiar tools. Development modes: Provides a project directory (team collaboration), a personal directory (personal development and testing), and a manual directory (ad hoc tasks) to effectively isolate and manage development assets.
Node development	Supports a wide range of node types and compute engines for flexible data processing and analysis. Compute engines: Seamlessly integrates with big data compute engines such as MaxCompute, EMR, Hologres, and Flink, as well as AI computing services such as PAI. Node types: Provides Data Integration, SQL, Python, Shell, Notebook, large model nodes, and various AI interactive nodes to meet diverse needs such as data synchronization, cleansing, processing, and AI training. For more information, see Compute engines and Node development.
Node scheduling	Provides powerful and flexible automated scheduling capabilities to ensure that tasks are executed on time and in order. Scheduling mechanism: Supports time-based periodic scheduling (by year, month, day, hour, minute, or second) and event-driven or OpenAPI-triggered scheduling. Scheduling dependencies: Supports complex same-cycle, cross-cycle, cross-workflow, and cross-workspace dependencies, as well as dependencies between tasks with different schedules and types. Scheduling policies: Supports advanced policies such as effective time ranges, rerun on failure, dry runs, and freezing. Scheduling parameters: Supports workflow parameters, workspace parameters, context parameters, and node parameters. For more information, see Schedule settings.
Development resource management	Provides unified management of various assets involved in the data development process. Data Catalog: Provides lakehouse metadata management capabilities, supporting the creation, viewing, and management of data tables. Functions and resources: Supports the management and referencing of UDFs and various resource files (such as JAR and Python files). Datasets: Supports mounting and managing datasets from external storage such as OSS and NAS. For more information, see Data Catalog, Functions and resources, and Datasets.
Quality control	Provides multiple built-in control mechanisms to ensure the standardization of data production processes and the accuracy of output data. Code review: Supports manual code review before task deployment to ensure code quality. Process control: Combines smoke testing, governance item checks, and extensions to perform automated validation during task submission and deployment. Data Quality: Associates data quality monitoring rules that are automatically triggered after task execution to detect data issues as early as possible. For more information, see Code review, Process control, Extensions, and Data quality rules configuration.
Openness and extensibility	Provides rich open interfaces and extension points for integration with external systems and secondary development. OpenAPI: Provides comprehensive API interfaces for programmatically managing and operating development tasks. Event messages: Supports subscribing to data development event messages for integration with external systems. For more information, see Open API, Open Event, and Extensions.

Data Studio billing

Charges on the DataWorks side (included in the DataWorks bill)
- Resource group fees: Node development and personal development environment usage rely on resource groups. Depending on the resource group type, you are charged serverless resource group fees or exclusive resource group for scheduling fees.
  If you use large model services, serverless resource group fees are also incurred.
- Task scheduling fees: If a task is deployed to the production environment for scheduled execution, task scheduling fees (when using a serverless resource group) or exclusive resource group for scheduling fees (when using an exclusive resource group) are incurred.
- Data Quality fees: If you configure quality monitoring for a scheduled task and an instance is successfully triggered, Data Quality instance fees are incurred.
- Intelligent baseline fees: If you configure an intelligent baseline for a scheduled task, an enabled intelligent baseline incurs intelligent baseline instance fees.
- Alert SMS and phone call fees: If you configure monitoring alerts for a scheduled task and SMS or phone alerts are successfully triggered, alert SMS and phone call fees are incurred.
  Note
  Modules involved: Data Studio, Data Quality, and Operation Center.
Charges on non-DataWorks side (not included in the DataWorks bill)
When you run data development node tasks, compute engine compute and storage fees (such as OSS storage fees) that may be incurred are not charged by DataWorks.

Get started with Data Studio

Create or enable the new Data Studio

When creating a workspace, select Use Data Studio (New Version). For more information, see Create a workspace.
In legacy Data Studio, you can click the Upgrade to New Version button at the top of the DataStudio page and follow the on-screen instructions to migrate your data to the new Data Studio. For more information, see Upgrade to the new Data Studio.

Open the new Data Studio

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

FAQ

Q: How do I distinguish between the new Data Studio and legacy Data Studio?
A: The two have completely different page styles. The new Data Studio features a dark-themed top bar and a tree-structured directory on the left side, while legacy Data Studio uses a light-colored background and a traditional panel layout.
Q: After upgrading to the new Data Studio, can I roll back to legacy Data Studio?
A: Upgrading from legacy Data Studio to the new version is an irreversible operation. After the upgrade is complete, you cannot roll back to legacy Data Studio. Before upgrading, we recommend that you first create a workspace with the new Data Studio enabled for testing to ensure that it meets your business requirements. Additionally, data in the new Data Studio and legacy Data Studio is independent of each other.
Q: Why don't I see the Use Data Studio (New Version) option when creating a workspace?
A: If this option is not available on the page, your workspace already has the new Data Studio enabled by default.
Important
If you encounter any issues while using the new Data Studio, you can join the DataWorks Data Studio Upgrade Support Group on DingTalk for assistance.