All Products
Search
Document Center

MaxCompute:Use DataWorks

Last Updated:Nov 24, 2023

You can export data from MaxCompute to other data sources in offline mode by using the Data Integration service of DataWorks. Then, you can process the exported data. This topic describes how to export data from MaxCompute to other data sources by using the Data Integration service of DataWorks.

Background information

You can use one of the following methods to export data:

  • Use the codeless user interface (UI). After you create a batch synchronization node in the DataWorks console, configure a source, a destination, and field mappings on the codeless UI to export data.

  • Use the code editor. After you create a batch synchronization node in the DataWorks console, switch to the code editor. Then, write code to configure a source, a destination, and field mappings to export data.

Prerequisites

Make sure that the following requirements are met:

Limits

Each batch synchronization node can export data from only one table. If you want to export data from multiple tables, you must create multiple batch synchronization nodes.

Procedure

Perform the following steps to export MaxCompute data by using Data Integration:

  1. Add a MaxCompute data source

    Add a MaxCompute data source to DataWorks.

  2. Add the destination to DataWorks

    Add the destination to DataWorks.

  3. Create a workflow

    Create a workflow in the DataWorks console. The workflow is required when you create a batch synchronization node.

  4. Create a batch synchronization node

    Create a batch synchronization node based on the created workflow.

  5. Configure and run the batch synchronization node by using the codeless UI or configure and run the batch synchronization node by using the code editor

    Configure and run the batch synchronization node by using the codeless UI or code editor.

  6. Check the synchronization results

    Check the synchronization results on the destination.

Add a MaxCompute data source to DataWorks

For more information, see Add a MaxCompute data source of the new version.

Add the destination to DataWorks

Add the destination to the data source list of DataWorks based on the data source type. For more information about how to add a data source, see Add a data source.

Create a workflow

Create a workflow in the DataWorks console. The workflow is required when you create a batch synchronization node.

  1. Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

  2. On the DataStudio page, move the pointer over the Create icon and select Create Workflow.

  3. In the Create Workflow dialog box, configure Workflow Name and Description.

  4. Click Create.

Create a batch synchronization node

Create a batch synchronization node based on the created workflow.

  1. Click the newly created workflow and right-click Data Integration.

  2. Choose Create Node > Offline Synchronization.

  3. In the Create Node dialog box, configure the Name parameter, and select a path from the Path drop-down list.

    Important

    The node name must be 1 to 128 characters in length, and can contain letters, digits, underscores (_), and periods (.).

  4. Click Confirm.

Configure and run the batch synchronization node by using the codeless UI

  1. Configure network connections and a resource group.

    1. Select the source.

      Select MaxCompute(ODPS) from the Source drop-down list and select the created MaxCompute data source from the Data Source Name drop-down list.

    2. Select an exclusive resource group for Data Integration.

      Select an existing exclusive resource group for Data Integration. For more information, see Create and use an exclusive resource group for Data Integration.

    3. Select the destination.

      Specify Destination and Data Destination Name.

    4. Test connectivity.

      Test the network connectivity between the resource group and the source and between the resource group and destination. You must make sure that the exclusive resource group for Data Integration is connected to the data sources. Click Next.image.png

  2. Configure a task.

    1. Configure the source and destination.

      In the Source and Destination sections, configure the table from which data is read, the table to which data is written, and the range of data to be synchronized. For more information, see Step 3: Configure the source and destination in the "Configure a batch synchronization node by using the codeless UI" topic.

    2. Configure field mappings.

      After the source and destination are configured, you must configure mappings between source fields and destination fields. After the mappings are configured, the batch synchronization node writes the values of the source fields to the destination fields of the same data type based on the mappings. For more information, see Step 4: Configure mappings between source fields and destination fields in the "Configure a batch synchronization node by using the codeless UI" topic.

    3. Configure channel control.

      You can configure channel control policies to define attributes for data synchronization. For more information, see Step 5: Configure channel control policies in the "Configure a batch synchronization node by using the codeless UI" topic.

  3. Configure scheduling properties.

    Configure scheduling properties on the Properties tab to filter data.

  4. In the top toolbar, click the 保存 icon to save the configurations and click the 运行 icon to run the batch synchronization node.

Configure and run the batch synchronization node by using the code editor

  1. Configure network connections and a resource group.

    Select the source, destination, and exclusive resource group for Data Integration, and then establish network connections between the resource group and the data sources. For more information, see Configure and run the batch synchronization node by using the codeless UI in this topic.

  2. Switch to the code editor and import a template.

    Click the Conversion script icon in the top toolbar. If no script is configured, you can click the image.png icon in the top toolbar of the configuration tab to apply a script template.image.png

  3. Edit code in the code editor to configure the batch synchronization node.

    1. Configure a reader for the synchronization node.

      Configure the source and the source table.

      {
                  "stepType": "odps",
                  "parameter": {
                      "partition": [],
                      "datasource": "odps_first",
                      "envType": 0,
                      "column": [
                          "*"
                      ],
                      "table": ""
                  },
                  "name": "Reader",
                  "category": "reader"
              },
      • stepType: the data source type of the source. Set this parameter to odps.

      • partition: the partition information of the source table. You can run the show partitions <table_name>; command to view the partition information of the table. For more information, see Table operations.

      • datasource: the name of the MaxCompute data source.

      • column: the name of the source column.

      • table: the name of the source table. You can run the show tables; command to view the table name.

      • name and category: Set name to Reader and category to reader. This way, the data source is configured as the source.

    2. Configure a writer for the batch synchronization node.

      Configure the destination and the destination table.

      {
                  "stepType":"oss",
                  "parameter":{
                      "partition":"",
                      "truncate":true,
                      "datasource":"",
                      "column":[
                          "*"
                      ],
                      "table":""
                  },
                  "name":"Writer",
                  "category":"writer"
              }
      • stepType: the data source type of the destination.

      • partition: the partition information of the destination table.

      • datasource: the name of the destination.

      • column: the name of the destination column.

      • table: the name of the destination table.

      • name and category: Set name to Writer and category to writer. This way, the data source is configured as the destination.

    3. Configure channel control policies, such as the maximum transmission rate and and the maximum number of dirty data records allowed.

      "setting": {
              "errorLimit": {
                  "record": "1024"   
              },
              "speed": {
                  "throttle": false,   
                  "concurrent": 1   
              }
          },
      • record: the maximum number of dirty data records allowed.

      • throttle: specifies whether throttling is enabled.

      • concurrent: the maximum number of parallel threads that the batch synchronization node uses to read data from the source or write data to the destination.

    For more information, see Step 4: Edit the script of the batch synchronization node to configure the node in the "Configure a batch synchronization node by using the code editor" topic.

  4. Configure the properties of the synchronization node. For more information, see Supported formats of scheduling parameters.

  5. In the top toolbar, click the 保存 icon to save the configurations and click the 运行 icon to run the batch synchronization node.

Check the synchronization results

Go to the destination and check whether the data in the MaxCompute table is exported to the destination table:

  • If all data is exported, the synchronization is complete.

  • If no data is exported or some data failed to be exported, see Batch synchronization.