This topic describes how to import data from a specific data source to MaxCompute offline or in real time by using the Data Integration service of DataWorks.

Background information

You can use Data Integration to import data offline or in real time:
  • Import data offline

    You can use one of the following methods to import data offline:

    • Use the codeless user interface (UI). After you create a batch synchronization node in the DataWorks console, configure a source, a destination, and field mappings on the codeless UI to import data.
    • Use the code editor. After you create a batch synchronization node in the DataWorks console, switch to the code editor. Then, write code to configure a source, a destination, and field mappings to import data.
  • Import data in real time
    You can use one of the following methods to import data in real time:

Prerequisites

  • The data sources and tables from which you want to import data to MaxCompute are prepared.
  • A MaxCompute project is created.

    For more information about how to create a MaxCompute project, see Create a MaxCompute project.

Limits

In the offline import scenario, each batch synchronization node can import data in one or more tables to only one table in MaxCompute.

Import data offline

  1. Add a MaxCompute data source.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. Select the region where the required workspace resides, find the workspace, and then click Data Integration in the Actions column.
    4. In the left-side navigation pane, choose Data Source > Data Sources.
    5. On the Data Source page, click Add data source in the upper-right corner.
    6. In the Add data source dialog box, click MaxCompute.
    7. In the Add MaxCompute data source dialog box, configure the parameters.
      MaxCompute data source
      Parameter Description
      Data Source Name The name of the data source. The name can contain letters, digits, and underscores (_) and must start with a letter.
      Data source description The description of the data source. The description can be a maximum of 80 characters in length.
      Environment The environment in which the data source is used. Valid values: Development and Production.
      Note This parameter is displayed only when the workspace is in standard mode.
      ODPS Endpoint The endpoint of the MaxCompute project. The endpoint is automatically generated by MaxCompute based on system configurations.
      Tunnel Endpoint The endpoint of the MaxCompute Tunnel service. For more information, see Endpoints.
      ODPS project name The name of the MaxCompute project.
      AccessKey ID The AccessKey ID of the account that you use to connect to the MaxCompute project. You can view the AccessKey ID on the Security Management page.
      AccessKey Secret The AccessKey secret that corresponds to the AccessKey ID. The AccessKey secret is equivalent to a logon password.
    8. Select Data Integration for Resource Group connectivity.
    9. In the resource group list, find the required resource groups and click Test connectivity.
      During data synchronization, one synchronization node uses only one resource group. To ensure that your synchronization node can be properly run, we recommend that you test the connectivity of all types of resource groups for Data Integration. If you want to test the connectivity of multiple types of resource groups for Data Integration at the same time, select the resource groups and click Batch test connectivity. For more information, see Select a network connectivity solution.
      Note
      • By default, only exclusive resource groups for Data Integration are displayed. To ensure the stability and performance of data synchronization, we recommend that you use exclusive resource groups for Data Integration.
      • If you want to test the connectivity of the shared resource group for Data Integration or custom resource groups, click Advanced below the resource group table. In the Warning message, click Confirm. Then, the shared resource group for Data Integration and custom resource groups are displayed.
    10. After the connectivity test succeeds, click Complete.
  2. Add the data source from which you want to import data to MaxCompute.
    Add the data source from which you want to import data to MaxCompute based on the data source type. For more information about how to add a data source, see Configure data sources.
  3. Create a workflow.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. Select the region where the required workspace resides, find the workspace, and then click Data Analytics in the Actions column.
    4. On the DataStudio page, move the pointer over the Create icon and select Workflow.
    5. In the Create Workflow dialog box, specify the Workflow Name and Description parameters.
      Notice The workflow name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
    6. Click Create.
  4. Create a batch synchronization node.
    1. Click the newly created workflow and right-click Data Integration.
    2. Choose Create > Batch Synchronization.
    3. In the Create Node dialog box, specify the Node Name and Location parameters.
      Notice The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
    4. Click Commit.
  5. Configure and run the batch synchronization node.
    • If you want to configure and run the batch synchronization node by using the codeless UI, go to Step 6.
    • If you want to configure and run the batch synchronization node by using the code editor, go to Step 7.
  6. Configure and run the batch synchronization node by using the codeless UI.
    1. Configure the source.
      Select the data source type of the source from the Connection drop-down list below Source and select the name of the source from the drop-down list on the right side of Connection. Then, select the source table from the Table drop-down list. Source
    2. Configure the destination.
      Select ODPS from the Connection drop-down list below Target and select the added MaxCompute data source from the drop-down list on the right side of Connection. Then, select the destination table from the Table drop-down list. Destination
    3. Configure field mappings.
      Configure mappings between the fields in the source table and the fields in the destination table. Field mappings
    4. Configure channel control.
      Channel control
    5. Configure scheduling properties.
      Configure scheduling properties on the Properties tab to filter data.
    6. In the top toolbar, click the Save icon to save the configurations and click the Run icon to run the batch synchronization node.
  7. Configure and run the batch synchronization node by using the code editor.
    1. Import a template.
      Select the data source type of the source from the Source Connection Type drop-down list and select the name of the source from the Connection drop-down list below Source Connection Type. Select ODPS from the Target Connection Type drop-down list and select the added MaxCompute data source from the Connection drop-down list below Target Connection Type. Then, click OK. Import a template
    2. Configure the source.
      Configure the source and the source table.
      {
                  "stepType": "mysql",
                  "parameter": {
                      "partition": [],
                      "datasource": "",
                      "envType": 0,
                      "column": [
                          "*"
                      ],
                      "table": ""
                  },
                  "name": "Reader",
                  "category": "reader"
              },
      • stepType: the data source type of the source.
      • partition: the partition information of the source table.
      • datasource: the name of the source.
      • column: the name of the source column. Make sure that the column has a one-to-one mapping with the column that is specified for the MaxCompute data source.
      • table: the name of the source table.
      • name and category: Set name to Reader and category to reader. This way, the data source is configured as the source.
    3. Configure the destination.
      Configure the destination and the destination table.
      {
                  "stepType":"odps",
                  "parameter":{
                      "partition":"",
                      "truncate":true,
                      "datasource":"odps_first",
                      "column":[
                          "*"
                      ],
                      "table":""
                  },
                  "name":"Writer",
                  "category":"writer"
              }
      • stepType: the data source type of the destination. Set this parameter to odps.
      • partition: the partition information of the destination table. You can run the show partitions <table_name>; command to view the partition information of the table. For more information, see View partition information.
      • datasource: the name of the MaxCompute data source.
      • column: the name of the destination column.
      • table: the name of the destination table. You can run the show tables; command to view the table name. For more information, see List tables and views in a project.
      • name and category: Set name to Writer and category to writer. This way, the data source is configured as the destination.
    4. Configure channel control.
      "setting": {
              "errorLimit": {
                  "record": "1024"   
              },
              "speed": {
                  "throttle": false,   
                  "concurrent": 1   
              }
          },
      • record: the maximum number of dirty data records allowed.
      • throttle: specifies whether throttling is enabled.
      • concurrent: the maximum number of parallel threads that the batch synchronization node uses to read data from the source or write data to the destination.
    5. Configure scheduling properties.
    6. In the top toolbar, click the Save icon to save the configurations and click the Run icon to run the batch synchronization node.
  8. Go to the MaxCompute data source and check whether the required data is imported to the MaxCompute table.
    • If all data is imported, the synchronization is complete.
    • If no data is imported or some data failed to be imported, see FAQ about batch synchronization.

Import data in a single table in real time

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. Select the region where the required workspace resides, find the workspace, and then click Data Analytics.
  2. Move the pointer over the Create icon and choose Data Integration > Real-time synchronization.
    Alternatively, you can click the required workflow, right-click Data Integration, and then choose Create > Real-time synchronization.
  3. In the Create Node dialog box, set the Node Name and Location parameters.
    Notice The node name must be 1 to 128 characters in length. It can contain letters, digits, underscores (_), and periods (.).
  4. Click Commit.
  5. On the configuration tab of the real-time sync node, drag MaxCompute in the Output section to the canvas on the right. Connect the MaxCompute node to the configured reader or conversion node.
  6. Click the MaxCompute node. In the panel that appears, configure the parameters. The following table describes the parameters.
    MaxCompute
    Parameter Description
    Data source The MaxCompute data source that you configured. You can select only a MaxCompute data source.

    If no data source is available, click New data source on the right to add a data source on the Data Source page. For more information, see Add a MaxCompute data source.

    Table The name of the MaxCompute table to which you want to write data.
    You can click Create Table on the right to create a table, or click Data preview to preview the selected table.
    Notice Before you create a table, connect the MaxCompute node to a reader node and make sure that the output fields are specified for the reader node.
    Mode The mode in which data is written to the destination partitions of the MaxCompute table. Valid values: Partitioning by Time and Dynamic Partitioning by Field Value. If you select Partitioning by Time, data is written to the destination partitions of the MaxCompute table based on the value of the _execute_time_ field. For more information, see Fields used for real-time synchronization. If you select Dynamic Partitioning by Field Value, data is dynamically written to the destination partitions of the MaxCompute table based on the value of a specified field in the source table after the mapping between the specified field in the source table and the specified partition field in the MaxCompute table is defined.
    Partition message The information about the partitioned MaxCompute table.
    Field Mapping The field mappings between the source and the destination. Click Field Mapping to configure field mappings. The real-time sync node synchronizes data based on the field mappings.
    If you want to create a table, click Create Table next to Table. In the New data table dialog box, configure the parameters. The following table describes the parameters. Create Table
    Parameter Description
    Table name The name of the MaxCompute table to which you want to write data in real time.
    Life cycle The lifecycle of the MaxCompute table to which you want to write data in real time. For more information, see Lifecycle.
    Data field structure The schema of the MaxCompute table to which you want to write data in real time. To add a field, click New field.
    Partition settings The partition information of the MaxCompute table to which you want to write data in real time. You can use the following modes to write data to the MaxCompute table in real time: Partitioning by Time and Dynamic Partitioning by Field Value.
    • Partitioning by Time: Data is written to the destination partitions of the MaxCompute table based on the value of the _execute_time_ field. For more information, see Fields used for real-time synchronization,Partitioning by Time
      Notice
      • You must configure at least two levels of partitions, which are yearly and monthly partitions. You can configure a maximum of five levels of partitions, which are yearly, monthly, daily, hourly, and minutely partitions.
      • For more information about MaxCompute tables, see Partition.
    • Dynamic Partitioning by Field Value: Data is dynamically written to the destination partitions of the MaxCompute table based on the value of a specified field in the source table after the mapping between the specified field in the source table and the specified partition field in the MaxCompute table is defined. Dynamic Partitioning by Field ValueFor example, the value of Field A in the source table is defined as the value of the partition field in the MaxCompute table. If the value of Field A in a record is aa, this record is written to the aa partition of the MaxCompute table. If the value of Field A in a record is bb, this record is written to the bb partition of the MaxCompute table.
  7. Click the Save icon icon in the top toolbar.

Import data in an entire database in real time

  1. Create a real-time data sync node.
  2. Commit and deploy the real-time data sync node.
  3. Start the real-time data sync node.

Import real-time data with one click

  1. Configure a sync solution.
  2. Synchronize data to MaxCompute in real time.