The DataStudio service of DataWorks allows you to define the development and scheduling properties of auto triggered nodes. DataStudio works with Operation Center to provide a visualized development interface for nodes of various types of compute engines, such as MaxCompute, Hologres, and E-MapReduce (EMR). You can configure settings on the visualized development interface to perform intelligent code development, multi-engine node orchestration in workflows, and standardized node deployment. This way, you can build offline data warehouses, real-time data warehouses, and ad hoc analysis systems to ensure efficient and stable data production.

Go to the DataStudio page

  1. Log on to the DataWorks console.
  2. In the left-side navigation pane, click Workspaces. In the top navigation bar, select the region where the desired workspace resides.
  3. On the Workspaces page, find your workspace and click DataStudio in the Actions column. The DataStudio page appears.

Main features of DataStudio

The following figure shows the main features provided by DataStudio. For more information, see Terms related to data development. Main features of DataStudio
FeatureDescription
Object organization and managementDataStudio provides a mechanism to organize and manage objects in DataWorks. For more information, see Create a workflow and Node organization and management modes.
  • Object organization: The following two-level management mode is provided: solution > workflow. DataWorks allows you to organize objects in the directory tree of a workflow or on the configuration tab of a workflow. You can create required objects in the directory tree of a workflow or drag components on the configuration tab of the workflow to build a data processing workflow. You can use solutions to manage workflows.
  • Object management: You can create and manage nodes, tables, resources, and functions in a visualized manner.
Node development
  • Various capabilities
    • DataStudio provides a wide range of compute engine nodes and fully encapsulates compute engine capabilities.
    • DataStudio provides general nodes. You can combine general nodes and nodes of a specific compute engine type in DataWorks to process complex business logic. For example, you can enable external scheduling systems to trigger the scheduling of nodes in DataWorks, check whether files exist, route results based on logical conditions, execute the code of specific nodes in loops, and pass output between nodes.
    • DataWorks allows you to develop custom nodes based on custom wrappers. This way, you can use more computing task types and access custom computing services.
  • Simple operations
    • DataStudio allows you to develop data on the configuration tab of a workflow. You can drag components to implement hybrid orchestration of different types of compute engine nodes.
    • DataStudio provides an intelligent SQL editor. The SQL editor provides features such as code hinting, display of the code structure by using SQL operators, and permission verification.
For information about the node types that are supported by DataWorks, see DataWorks nodes.
Node scheduling
  • Trigger methods: The scheduling of nodes can be triggered by using an external scheduling system, based on events, or based on the output of ancestor nodes. The output of ancestor nodes is parsed based on inner lineage.
  • Dependencies: You can configure same-cycle and cross-cycle dependencies. You can also configure dependencies between different types of nodes whose scheduling frequencies are different.
  • Execution control: You can determine whether to rerun a node and manage the scheduling time of a node based on the output of its ancestor node. You can specify a validity period during which a node is automatically run as scheduled and the scheduling type of a node. For example, you can specify a node as a dry-run node or freeze a node. After you set a node as a dry-run node, the system returns a success response for the node without running the node. The scheduling of descendant nodes of the node is not blocked. After you freeze a node, the system does not run the node, and the scheduling of descendant nodes of the node is blocked.
  • Idempotence: DataStudio provides a rerun mechanism that you can use to customize rerun conditions and rerun times.
For more information about node scheduling, see Configure time properties and Scheduling dependency configuration guide.
Node debuggingYou can debug a node or a workflow. For more information, see Debugging procedure.
Process controlDataStudio provides a standardized node deployment mechanism and various methods to perform process control. You can perform operations that include but are not limited to the following operations for process control:
  • Review code and perform smoke testing before a node is deployed. This helps block the execution of the process in which an error occurs in the production environment. For information about code review, see Code review.
  • Customize process control on node committing and deployment to the production environment, in combination with governance items provided by Data Governance Center and verification logic customized based on extensions.
Other features
  • Openness: DataWorks Open Platform provides various API operations and a large number of built-in extension points. You can subscribe to event messages related to data development on DataWorks Open Platform.
  • Permission control: You can manage the permissions on service modules of DataWorks and the data access permissions. For more information, see Manage permissions on workspace-level services.
  • Viewing of operation records: DataWorks is integrated with ActionTrail. This allows you to query recent DataWorks behavior events of your Alibaba Cloud account in ActionTrail. For more information, see View operation records on the DataStudio page.

Introduction to the DataStudio page

You can follow the instructions that are described in Features on the DataStudio page to use the features of each module on the DataStudio page.

Node development process

DataWorks allows you to create different compute engine types of real-time synchronization nodes, batch synchronization nodes, batch processing nodes, and manually triggered nodes on the DataStudio page. For more information about data synchronization, see Overview of Data Integration. The configuration requirements on nodes of different compute engine types vary. Take note of the precautions and related instructions on the development of nodes of different compute engine types in DataWorks before you develop nodes based on the node type.
  • Instructions on the development of nodes of different compute engine types: You can associate different compute engines with your DataWorks workspace to develop nodes in DataWorks. The configuration requirements on nodes of different compute engine types vary. For more information, see the following topics:
  • Common development process: The following two workspace modes are available: standard mode and basic mode. The node development process varies based on the workspace mode.
    Node development process in a workspace in standard modeNode development process in a workspace in standard mode
    Node development process in a workspace in basic modeNode development process in a workspace in basic mode
    • Basic process: For example, you want to develop nodes in a workspace in standard mode. The development process includes the following stages: development, debugging, configuration of scheduling settings, node committing, node deployment, and O&M. For more information, see General development process.
    • Process control: During node development, you can perform operations such as Code review and smoke testing provided by DataStudio and use check items preset in Data Governance Center and verification logic customized based on extensions in Open Platform to ensure that specified standards and requirements on node development are met.
      Note The process control operations vary based on the workspace mode. The actual process control operations shall prevail.

Node organization and management modes

A workflow is a basic unit for code development and resource management. A workflow is an abstract business entity that allows you to develop code based on your business requirements. Workflows and nodes in different workspaces are separately developed. For more information about workflows, see Create a workflow.

Workflows can be displayed in a directory tree or in a panel. The display modes enable you to organize code from the business perspective and show the resource classification and business logic in a more efficient manner.
  • The directory tree allows you to organize your code by node type.
  • The panel shows the business logic in a workflow.
Organizational structure

Appendix: Node types supported by DataStudio

The DataStudio service of DataWorks allows you to create various types of nodes. You can enable DataWorks to periodically schedule instances that are generated for nodes. You can also select a specific type of node to develop data based on your business requirements. For more information about the node types that are supported by DataWorks, see DataWorks nodes.

Appendix: Terms related to data development

  • Terms related to node development
    TermDescription
    SolutionA collection of workflows. A solution is a group of workflows that are dedicated to a specific business goal. A workflow can be added to multiple solutions. After you develop a solution and add a workflow to the solution, other users can reference and modify the workflow in their solutions for collaborative development.
    WorkflowAn abstract business entity and a collection of nodes, tables, resources, and functions for a specific business requirement. Nodes in this type of workflow are triggered to run as scheduled.
    Manually triggered workflowA collection of nodes, tables, resources, and functions for a specific business requirement.

    Nodes in this type of workflow are manually triggered to run.

    DAGThe abbreviation of directed acyclic graph. A DAG is used to display nodes and their dependencies. In DataStudio, all nodes in a workflow are displayed in the same DAG. This facilitates node development and dependency configuration.
    TaskA basic execution unit of DataWorks. DataWorks runs tasks in sequence based on the dependencies between the tasks.
    NodeA task in a DAG. DataWorks runs nodes in sequence based on the dependencies between the nodes.
  • Terms related to node scheduling
    TermDescription
    DependencyUsed to define the sequence in which nodes are run. If Node B can run only after Node A finishes running, Node A is the ancestor node of Node B, and Node B depends on Node A. In a DAG, dependencies are represented by arrows between nodes.
    Output nameThe identifier used to distinguish the current node from other nodes. An output name is globally unique. A node can contain multiple output names. Scheduling dependencies between nodes are configured based on output names.
    Resource group for schedulingA group of Elastic Compute Service (ECS) instances on which nodes are scheduled. The following two types of resource groups for scheduling are supported: shared resource group for scheduling and exclusive resource group for scheduling.
    • Shared resource group for scheduling: This resource group is shared by all tenants in DataWorks. During peak hours, nodes may wait for resources. The resource group is suitable for scenarios in which a small number of nodes need to be run and you do not have a high requirement for the timeliness of data output.
    • Exclusive resource group for scheduling: This type of resource group is tenant-specific and is suitable for scenarios in which a large number of nodes need to be run and you have a high requirement for the timeliness of data output. To schedule Shell nodes, you must use an exclusive resource group for scheduling.
    Scheduling parameterConfigured for a node when the node is scheduled to run. The values of scheduling parameters are dynamically replaced at the scheduling time of the node. If you want to obtain information about the runtime environment, such as the date and time, during repeated running of code, you can dynamically assign values to variables in the code based on the definition of scheduling parameters in DataWorks.
    Data timestampThe previous day of the scheduling time (the time when you want to schedule the node). In offline computing scenarios, a data timestamp represents the date on which a business transaction is conducted. The value of a data timestamp is accurate to the day. For example, if you collect statistical data on the turnover of the previous day on the current day, the previous day is the date on which the business transaction is conducted and represents the data timestamp.
    Scheduling timeThe time when you want to schedule the node to process business data. The scheduling time is accurate to the second. The scheduling time can be different from the actual time at which the node is scheduled to run. The actual time at which a node is run is affected by multiple factors.