All Products
Search
Document Center

DataWorks:Usage notes for development of Hologres tasks in DataWorks

Last Updated:Feb 05, 2024

DataWorks allows you to easily build a real-time data warehouse and an ad hoc analysis system based on Hologres. In the DataWorks console, you can configure Hologres tasks, enable periodic scheduling of the tasks, and manage the metadata of the tasks to ensure that data is generated and managed in an efficient and stable manner. This topic describes the basic development process of Hologres tasks in DataWorks, billing, environment preparation, and permission management.

Prerequisites

Usage notes

The following table describes the usage notes for development of Hologres tasks in DataWorks.

Item

Description

Billing

If you develop Hologres tasks in DataWorks, you are charged not only for DataWorks resources but also the resources of other Alibaba Cloud services.

Environment preparation

Before you develop Hologres tasks in DataWorks, you must purchase DataWorks of a specific edition and a resource group based on your business requirements, add a Hologres data source, and associate the data source with a workspace.

Permission management

DataWorks provides a comprehensive permission management system that you can use to manage product-level permissions and module-level permissions. You can grant different permissions to different users based on your business requirements to implement fine-grained management of permissions.

Get started with Data Integration

DataWorks Data Integration allows you to read data from and write data to Hologres. DataWorks provides a variety of data synchronization scenarios, such as batch synchronization, real-time synchronization, and full and incremental synchronization.

Get started with Data Modeling and Development

DataWorks provides the Data Modeling service that is used to structure and manage large volumes of unordered and complex data. DataWorks also provides the DataStudio service for development of nodes that are scheduled to run. After the nodes are developed, you can go to Operation Center to monitor and perform O&M operations on the nodes.

Get started with DataAnalysis

DataWorks DataAnalysis provides the Hologres data analysis and service sharing capabilities.

Get started with Data Governance

DataWorks allows you to manage Hologres metadata and govern Hologres data.

Get started with DataService Studio

DataWorks provides the DataService Studio service to help you manage API services for internal and external systems in a centralized manner.

Get started with Open Platform

DataWorks provides openness capabilities that allow your application systems to quickly integrate with DataWorks. You can use DataWorks to manage data-related processes, govern data, perform O&M operations on data, and quickly respond to changes to the business status in the application systems.

Billing

DataWorks allows you to create Hologres synchronization tasks and data processing tasks in DataStudio and supports periodic scheduling of these tasks in Operation Center. You are charged not only for DataWorks resources but also the resources of other Alibaba Cloud services. The following table provides the details.

1. Fees for DataWorks resources

This section describes the fees that are included in your DataWorks bill. For information about the billable items of DataWorks, see Billing overview.

Fee

Description

Fees for the DataWorks edition that you use

You must activate DataWorks before you can develop tasks in DataWorks. If you activate DataWorks Standard Edition, DataWorks Professional Edition, or DataWorks Enterprise Edition, you are charged the fees for the edition when you purchase the edition.

Fees for the scheduling resources that you use to schedule tasks

After tasks are developed, scheduling resources are required to schedule the tasks. You can purchase resource groups for scheduling, such as subscription exclusive resource groups for scheduling and the pay-as-you-go shared resource group for scheduling, based on your business requirements, and pay for the resource groups.

Fees for the resources that you use to synchronize data

A data synchronization task consumes scheduling resources and synchronization resources. You can purchase resource groups for Data Integration, such as subscription exclusive resource groups for Data Integration and the pay-as-you-go shared resource group for Data Integration (debugging), based on your business requirements, and pay for the resource groups.

Note
  • You are not charged scheduling fees if you run tasks on nodes by clicking Run or Run with Parameters in the top toolbar on the DataStudio page.

  • You are not charged scheduling fees for failed tasks or dry-run tasks.

For more information that helps you understand the billing details, see Issuing logic of scheduling tasks in DataWorks.

2. Fees for resources of other Alibaba Cloud services

This section describes the fees that are not included in your DataWorks bill. You may also be charged for the resources of other Alibaba Cloud services that are used to develop and run tasks in DataWorks.

Important

You are charged for the resources of other Alibaba Cloud services based on the billing logic of the Alibaba Cloud services. For more information, see the billing documentation of the Alibaba Cloud services. For example, if you use a Hologres compute engine, you are charged based on the billing logic of Hologres. For more information, see Billable items of Hologres.

Fee

Description

Database fees

When you run data synchronization tasks to read data from and write data to databases, database fees may be generated.

Computing and storage fees

When you run tasks of a specific compute engine type, computing and storage fees of this type of compute engine may be generated. For example, if you create and run a Hologres SQL task to query Hologres data, you may be charged for computing and storage resources of a Hologres compute engine.

Network service fees

When you establish network connections between DataWorks and other related services, network service fees may be generated. For example, if you use services, such as Express Connect, Elastic IP Address (EIP), and Internet Shared Bandwidth, to establish network connections between DataWorks and other related services, you may be charged network service fees.

Environment preparation

Before you develop Hologres tasks in DataWorks, you must purchase DataWorks of a specific edition and a resource group based on your business requirements, associate a Hologres compute engine with a DataWorks workspace, and prepare the development environment. The following table provides the details.

1. Resource preparation

DataWorks provides Standard Edition, Professional Edition, and Enterprise Edition with rich product capabilities. DataWorks also provides resource groups shared by tenants and resource groups that are exclusive to specific tenants. You can select a resource group based on your business requirements.

Item

Description

References

Select a DataWorks edition

DataWorks Basic Edition allows you to perform the following basic operations during the development of Hologres data: migrate data to the cloud, develop data, schedule Hologres tasks, and govern data. If you want to use more advanced data governance and data security solutions, you can purchase DataWorks Standard Edition, DataWorks Professional Edition, or DataWorks Enterprise Edition.

Comparison among DataWorks Standard Edition, DataWorks Professional Edition, and DataWorks Enterprise Edition and their upgrade descriptions

Select a resource group

DataWorks provides shared resource groups for tenants to meet basic scheduling requirements. DataWorks also provides exclusive resource groups for tenants to meet their business requirements in the scenarios in which a large volume of business data needs to be processed in an efficient manner. Resource groups are classified into resource groups for scheduling, resource groups for Data Integration, and resource groups for DataService Studio based on use scenarios. You can plan and allocate resources based on your business requirements.

Overview

2. Development environment preparation

You must add a Hologres instance to a DataWorks workspace as a data source and associate the data source with DataStudio before you can develop data. You can add users to the workspace as members. This facilitates collaborative data development.

Item

Description

References

Prepare a data synchronization environment

Before you run Hologres synchronization tasks in DataWorks to synchronize data from or to Hologres, you must add a Hologres instance to a DataWorks workspace as a data source. You can configure the synchronization tasks for the data source only after the data source is added.

Add a Hologres data source

Prepare an environment for data development and analysis

Before you use DataWorks to schedule a Hologres task, you must add a Hologres instance to a DataWorks workspace as a data source and associate the data source with DataStudio. Then, you can perform operations, such as data development, data analysis, and periodic task scheduling, based on the data source.

Prepare a collaborative development environment

To ensure that RAM users can collaborate with each other to develop data in a workspace, you must perform the following operations:

  • Add the RAM users to the workspace as members and assign the Development role to the RAM users in the workspace.

  • Add workspace members to a Hologres compute engine instance and to a Hologres data source that you associate with a DataWorks workspace, and grant the required permissions on databases to a RAM user that you use to run tasks in the production environment.

Permission management

DataWorks provides a comprehensive permission management system that you can use to manage product-level permissions and module-level permissions. You can grant different permissions to different users based on your business requirements. Details of permission management:

1. Management of data access permissions

If you want to use a RAM user added to a DataWorks workspace to develop Hologres tasks in DataWorks, you must grant the RAM user permissions on the Hologres compute engine instance, permissions on the Hologres data source that you associate with the workspace, and permissions on related tables. For more information, see Permission management for Hologres.

2. Management of permissions on services and features

Before you develop data in DataWorks as a RAM user, you must assign a workspace-level role to the RAM user to grant the RAM user specific permissions. For more information, see Best practices for managing permissions of RAM users. Check the following permission management systems:

  • You can use RAM policy-based authorization to manage permissions on DataWorks service modules, such as prohibiting DataWorks users from accessing Data Map, and to manage permissions of performing operations in the DataWorks console, such as allowing DataWorks users to delete a workspace.

  • You can use role-based access control (RBAC) to manage permissions on DataWorks workspace-level service modules, such as allowing DataWorks users to access DataStudio to perform development-related operations, and to manage permissions on DataWorks global-level service modules, such as prohibiting DataWorks users from accessing Data Security Guard.

开发流程

Getting started

DataWorks provides multiple modules. You can develop tasks for which scheduling properties are configured in DataStudio. After the tasks are developed, you can go to Operation Center in the production environment to monitor and perform O&M operations on the tasks. DataWorks also provides process control for task development and deployment to standardize data development operations and ensure security of data development.

1. Data integration

DataWorks Data Integration allows you to read data from and write data to Hologres. You can synchronize data between a Hologres data source and another type of data source. In addition, DataWorks provides a variety of data synchronization scenarios, such as batch synchronization, real-time synchronization, and full and incremental synchronization. You can select one based on your business requirements. For more information, see Overview.

2. Data modeling and development

Module

Description

References

Data Modeling

Data Modeling is the first step for end-to-end data governance. Data Modeling uses the modeling methodology of the Alibaba data mid-end, interprets the business data of an enterprise from a business perspective by using the data warehouse planning, data standard, dimensional modeling, and data metric modules, and allows personnel inside the enterprise to quickly understand and share the idea of measuring and interpreting business data in compliance with data warehousing specifications.

Data Modeling overview

DataStudio

DataWorks encapsulates capabilities of a Hologres compute engine and allows you to run Hologres data synchronization tasks and Hologres data development tasks.

  • Data synchronization: You can synchronize data between a Hologres data source and another type of data source. DataStudio supports only specific batch synchronization and real-time synchronization scenarios. For more information about use scenarios of data synchronization, access the Data Integration module.

  • Data development: You can develop and allow the system to periodically schedule different types of tasks in DataWorks without the need to use complex command lines.

You can use general nodes and nodes of a specific type of compute engine in DataWorks to process complex logic.

DataWorks supports the following types of general nodes:

  • Zero load nodes that are used to manage workflows

  • HTTP Trigger nodes that are used in the scenarios in which external scheduling systems are used to trigger scheduling of nodes in DataWorks, OSS object inspection nodes, and FTP Check nodes

  • Assignment nodes that are used to pass input parameters and output parameters for nodes, and parameter nodes

  • Do-while nodes that are used to execute node code in loops, for-each nodes that are used to traverse the outputs of assignment nodes in loops and judge the outputs, and branch nodes

  • Other nodes, such as common Shell nodes and MySQL database nodes

After tasks are developed based on nodes, you can perform the following operations based on your business requirements:

  • Configure scheduling properties for nodes

    If you want to enable DataWorks to periodically run your tasks on nodes, you must configure scheduling properties for the nodes, such as scheduling dependencies and scheduling parameters.

  • Debug nodes

    To ensure that tasks on nodes in the production environment are run in an efficient manner and prevent a waste of computing resources, we recommend that you debug and run the tasks before you deploy the tasks.

  • Deploy nodes

    The tasks on nodes can be scheduled to run only after they are deployed to the production environment. Therefore, after the tasks are developed, you must deploy the tasks to the production environment. After the tasks are deployed, you can view and manage the tasks on the Cycle Task page in Operation Center.

  • Manage nodes

    You can perform various operations on the tasks on nodes, such as deploying and undeploying the tasks, and modifying scheduling properties for multiple tasks at the same time.

  • Perform process management

    DataWorks provides process control for task development and deployment to ensure the accuracy and security of the operations that are performed on tasks. For example, DataWorks provides the code review, forceful smoke testing, and code review logic customization features.

Operation Center

Operation Center is an end-to-end big data O&M and monitoring platform. Operation Center allows you to view the status of nodes and perform O&M operations on nodes on which exceptions occur. For example, you can perform intelligent diagnostics and rerun nodes in Operation Center. Operation Center provides the intelligent baseline feature that you can use to resolve issues such as uncontrollable output time of important nodes and difficulties in monitoring of massive nodes. This feature helps you ensure the timeliness of node output.

Perform basic O&M operations on auto triggered nodes

Data Quality

Data Quality ensures data availability for the end-to-end data R&D process and provides reliable data for your business in an efficient manner. Data Quality can help you identify data quality issues at the earliest opportunity and prevent data quality issues from escalating by virtue of effective monitoring rule-based quality checks and the combination of monitoring rules and node scheduling processes.

Data Quality overview

3. Data analysis

The DataAnalysis service module of DataWorks helps you perform SQL-based analysis online, gain an insight into business requirements, and edit and share data, and allows you to save query results as chart cards and quickly generate visualized data reports based on the chart cards for daily reporting. For more information, see DataAnalysis overview.

4. Data governance

After you associate a Hologres data source with a DataWorks workspace, DataWorks automatically collects the metadata of the data source. You can go to Data Map to view the metadata of the data source. You can also go to Data Governance Center to view data governance issues that are detected by DataWorks.

Module

Description

References

Data Map

Data Map is an enterprise-grade data management platform that provides management, sorting, quick search, and in-depth understanding capabilities for data objects based on the underlying unified metadata services.

Data Map overview

Security Center

Data Security Guard

Approval Center

Security Center is an end-to-end data security governance platform that covers classification of data assets, sensitive data identification, management on data-related authorization, masking of sensitive data, audit of access to sensitive data, and risk identification and response. Security Center helps you determine data security governance issues.

Data Governance Center

Data Governance Center automatically identifies items to be governed for multiple governance fields based on rules that come from experience in data-related fields, and provides governance and optimization solutions covering pre-event issue prevention and post-event issue resolution. Data Governance Center can help you actively and systematically complete data governance.

Data Governance Center overview

5. Data service

DataService Studio is designed to provide comprehensive data service and sharing capabilities for enterprises and helps enterprises manage API services for internal and external systems in a centralized manner. For more information, see DataService Studio overview.

6. Open Platform

DataWorks provides openness capabilities that allow your application systems to quickly integrate with DataWorks. You can use DataWorks to manage data-related processes, govern data, perform O&M operations on data, and quickly respond to changes to the business status in the application systems.

Item

Description

References

OpenAPI

The OpenAPI module allows you to call DataWorks API operations so that you can integrate your applications with DataWorks. This can help facilitate big data processing, decrease manual operations and O&M operations, minimize data risks, and reduce costs for enterprises.

OpenAPI

OpenEvent

The OpenEvent module allows you to subscribe to DataWorks change events related to your applications so that you can detect and respond to the changes at the earliest opportunity.

OpenEvent overview

Extensions

You can use the OpenEvent module to subscribe to event messages that are generated in your DataWorks workspace. You can use the Extensions module to register your local program as an extension to manage extension point events and processes.

Extensions overview

Appendix: Relationship between DataWorks and Hologres

Note

If you use a workspace in basic mode, only the production environment is provided, and you can associate only one Hologres compute engine instance with the workspace. In this topic, a workspace in standard mode is used.

DataWorks provides some Hologres-related capabilities. For example, you can schedule Hologres batch synchronization tasks, manage Hologres metadata, govern Hologres data, and perform security control on Hologres data in DataWorks. Data computing and storage of the tasks are still performed in Hologres. If you use a workspace in standard mode, you can associate different Hologres instances with the workspace in the development and production environments. This way, items such as storage and resources are isolated between the development and production environments.

  • For information about how to add a Hologres data source to a DataWorks workspace and associate the data source with DataStudio, and how to view Hologres instances used in different environments, see Add a Hologres data source.

  • For information about the issuing logic of nodes that are scheduled to run in DataWorks, see Issuing logic of scheduling nodes in DataWorks.