All Products
Search
Document Center

Tablestore:Full export

Last Updated:Oct 17, 2023

If you want to back up data in Tablestore to MaxCompute or migrate data from Tablestore to MaxCompute, you can create and configure batch synchronization tasks in the DataWorks console to export full data. After full data is exported to MaxCompute, you can use the DataAnalysis service of DataWorks to view and analyze the data exported to MaxCompute.

Usage note

The filed names in Tablestore are case-sensitive. Make sure that the field names in MaxComputer are consistent with those in Tablestore.

Step 1: Add a Tablestore data source

To add a Tablestore database as the data source, perform the following steps.

  1. Go to the Management Center.

    1. Log on to the DataWorks console as the project administrator.

    2. In the left-side navigation pane, click Workspaces and select a region.

    3. On the Workspaces page, find the workspace that you want to manage and move the pointer over the image..png icon in the Actions column. Click Workspace Settings to enter the Workspace management page.

      Note

      You can also click the other modules in the Actions column, such as DataStudio, Data Integration. In the top navigation bar of the page that appears, click the image..png icon to go to the Workspace management page.

  2. In the left-side navigation pane, click Data Source.

  3. On the Data Source page, click Add data source.

  4. In the NoSQL section of the Add data source dialog box, click OTS.

  5. In the Add OTS data source dialog box, configure the parameters in the following table.

    Parameter

    Description

    Data Source Name

    The name of the data source. The name can contain letters, digits, and underscores (_), and must start with a letter.

    Data Source Description

    The description of the data source. The description cannot exceed 80 characters in length.

    Endpoint

    The endpoint of the Tablestore instance. For more information, see Endpoints.

    If the Tablestore instance and the target data source resources are in the same region, enter the classic network endpoint. If the Tablestore instance and the target data source resources are not in the same region, enter the public endpoint. Do not enter a virtual private cloud (VPC) endpoint.

    Table Store instance name

    The name of the Tablestore instance. For more information, see Instance.

    AccessKey ID

    The AccessKey ID and AccessKey secret of an Alibaba Cloud account or a RAM user. For information about how to obtain your AccessKey pair, see Create an AccessKey pair.

    AccessKey Secret

  6. Test the network connectivity between the data source and the resource group that you select.

    To ensure that your synchronization nodes run as expected, you need to test the connectivity between the data source and all types of resource groups on which your synchronization nodes will run.

    Important
    • A synchronization node can use only one type of resource group. By default, the resource group list displays only exclusive resource groups for Data Integration. To ensure the stability and performance of data synchronization, we recommend that you use exclusive resource groups for Data Integration.

    • If a resource group is not created, click Create Exclusive Data Resource Group for Data Integration to create one. For more information, see Create and use an exclusive resource group for Data Integration.

    In the dialog box that appears, select Data Integration for Resource Group connectivity. Find the desired resource group for Data Integration and click Test connectivity in the Actions column. If the connectivity status is Connected, a network connection is established between the resource group and data source.

  7. If the data source passes the network connectivity test, click Complete.

    The newly created data source is displayed in the data source list.

Step 2: Add a MaxCompute data source

The procedure is similar to that in Step 1. In the Add data source dialog box, find and click MaxCompute to select MaxCompute as the data source engine.

In this example, a MaxCompute data source named OTS2ODPS is used. The following figure shows the details.

Important
  • A synchronization task can use only one type of resource group. By default, the resource group list displays only exclusive resource groups for Data Integration. To ensure the stability and performance of data synchronization, we recommend that you use exclusive resource groups for Data Integration.

  • If no resource group is available, click Create Exclusive Resource Group for Data Integration to create a resource group. For more information, see Create and use an exclusive resource group for Data Integration.

image..png

Step 3: Create a batch synchronization node

  1. Go to the DataStudio console.

    1. Log on to the DataWorks console as the project administrator.

    2. Select a region. In the left-side navigation pane, click Workspaces.

    3. On the Workspaces page, find the workspace that you want to manage and click DataStudio in the Actions column.

  2. On the Scheduled Workflow page of the DataStudio console, click Business Flow and select a business flow.

    For information about how to create a business flow, see Create a workflow.

  3. Right-click the Data Integration node and choose Create Node > Offline synchronization.

  4. In the Create Node dialog box, select a path and enter a node name.

  5. Click Confirm.

    The newly created offline synchronization node will be displayed under the Data Integration node.

Step 4: Configure and start a batch synchronization task

To create and configure a task to synchronize data from Tablestore to MaxCompute, perform the following steps:

  1. Double-click the created batch synchronization node in the Data Integration folder.

  2. Establish network connections between the resource group and the data source.

    1. Select the source and destination data sources for the data synchronization task and the resource group that is used to run the data synchronization task. Establish network connections between the resource group and data sources and test the connectivity.

      Important

      Data synchronization tasks are run by using resource groups. Select a resource group and make sure that network connections between the resource group and data sources are established.

    2. In the Configure Network Connections and Resource Group step, set the Source parameter to OTS and the Data Source Name parameter to the name of the Tablestore data source that you added in Step 1: Add a Tablestore data source.

    3. Select a resource group.

      After you select a resource group, the system displays the region and specifications of the resource group and automatically tests the connectivity between the resource group and the source data source.

      Important

      Make sure that the resource group is the same as that you selected when you added the data source.

    4. Set the Destination parameter to MaxCompute(ODPS) and the Data Source Name parameter to the name of the destination data source that you added in Step 2: Add a MaxCompute data source.

      The system automatically tests the connectivity between the resource group and the destination data source.

    5. Click Next.

  3. Configure the task.

    You can use the codeless user interface (UI) or the code editor to configure the task.

    Use the codeless UI to configure the task

    1. In the Configure Source and Destination section of the Configure tasks step, configure the data source and destination based on the actual case.

      Configure the data source

      Parameter

      Description

      Table

      The name of the data table in Tablestore.

      Range of Primary Key(begin)

      The start and end primary keys that are used to specify the range of the data to be read. The value is a JSON array.

      The start and end primary keys must be valid primary keys or virtual points that consist of INF_MIN type values and INF_MAX type values. The number of columns in the virtual points must be the same as that in the primary key.

      INF_MIN indicates an infinitely small value. All values of other types are greater than the INF_MIN type value. INF_MAX indicates an infinitely great value. All values of other types are smaller than the INF_MAX type value.

      The rows in the data table are sorted in ascending order by primary key. The reading range is a left-closed and right-open interval. All rows that are greater than or equal to the start primary key and less than the end primary key are returned.

      For example, a table contains the pk1 and pk2 primary key columns. The pk1 column is of the STRING type. The pk2 column is of the INTEGER type.

      To export full data from the table, specify the following parameters:

      • Example of the Range of Primary Key(begin) parameter

        [
          {
            "type": "INF_MIN"
          },
          {
            "type": "INF_MIN"
          }
        ]
      • Example of the Range of Primary Key(end) parameter

        [
          {
            "type": "INF_MAX"
          },
          {
            "type": "INF_MAX"
          }
        ]

      To export data from the rows in which the value of the pk1 column is tablestore, specify the following parameters:

      • Example of the Range of Primary Key(begin) parameter

        [. 
          {
            "type": "STRING",
            "value": "tablestore"
          },
          {
            "type": "INF_MIN"
          }
        ]
      • Example of the Range of Primary Key(end) parameter

        [  
          {
            "type": "STRING",
            "value": "tablestore"
          },
          {
            "type": "INF_MAX"
          }
        ]

      Range of Primary Key(end)

      Split configuration information

      The custom rule to split data. We recommend that you do not specify this parameter in common scenarios. If data is unevenly distributed in a Tablestore table, you can customize a rule to split data. You can configure a split key within the range between the start and end primary keys. You do not need to specify all primary keys. The value is a JSON array.

      Configure the data destination

      Parameter

      Description

      Table

      The name of the table in MaxCompute.

      Write Mode

      The mode in which data is written. Valid values:

      • Keep existing data before writing (Insert Into): The data is directly imported into the table or the static partition.

      • Clean up existing data before writing (Insert Overwrite): Existing data in the destination is cleared before the data is imported into the table or the static partition.

      Write by Converting Empty Strings into Null

      Specifies whether to convert the value of the empty strings in the source data to Null.

      Visible After Synchronization

      This parameter is displayed only after you click Advanced configuration.

      Specifies whether the data that is synchronized to MaxCompute can be queried before the synchronization is complete.

    2. In the Field Mapping section, click the image.png icon next to the Source list and the Target list to edit the source and target fields.

      Important

      Make sure that the number and type of the source and destination fields match.

      The source fields must be specified in the JSON format. Example: {"name":"id","type":"STRING"}. You can use the field names to specify the destination fields. One row represents one field. A row in the Source list is mapped to the same row in the Destination list.

      image.png
    3. In the Channel Control section, configure the parameters for task execution, such as the Synchronization rate and Dirty Data Records Allowed parameters. For more information about the parameters, see Configure a batch synchronization node by using the codeless UI.

    4. Click the image.png icon to save the configuration.

      Note

      If you do not save the configuration, a message appears when you perform subsequent operations. In this case, click OK in the message to save the configuration.

    Use the code editor to configure the task

    To synchronize full data, you must use Tablestore Reader and MaxCompute Writer. For more information, see Tablestore data source and MaxCompute data source.

    Important

    You cannot switch between the codeless UI and the code editor. Proceed with caution.

    1. In the Configure tasks step, click the image.png icon. In the dialog box that appears, click OK.

    2. In the code editor, specify the parameters based on the following sample code.

      Important
      • In most cases, a task that exports full data is run only once. Therefore, you do not need to configure scheduling parameters for the task.

      • If the script configurations contain variables such as ${date}, set each variable to a specific value when you run the task to synchronize data.

      • Comments are provided in the sample code to help you understand the configurations. Delete all comments when you use the sample code.

      {
          "type": "job",
          "version": "2.0",
          "steps": [
              {
                  "stepType": "ots", # The type of the reader. You cannot change the value of this parameter. 
                  "parameter": {
                      "datasource": "",  # The name of the Tablestore data source. Specify this parameter based on the actual case. 
                      "column": [ # The names of the columns to be exported from the Tablestore data source. 
                          {
                              "name": "column1"
                          },
                          {
                              "name": "column2"
                          },
                          {
                              "name": "column3"
                          },
                          {
                              "name": "column4"
                          },
                          {
                              "name": "column5"
                          }
                      ],
                      "range": {
                          "split": [ # The partition range for the table in Tablestore, which can accelerate the export. In most cases, you do not need to specify this parameter. 
                              {
                                  "type": "INF_MIN"
                              },
                              {
                                  "type": "STRING",
                                  "value": "splitPoint1"
                              },
                              {
                                  "type": "STRING",
                                  "value": "splitPoint2"
                              },
                              {
                                  "type": "STRING",
                                  "value": "splitPoint3"
                              },
                              {
                                  "type": "INF_MAX"
                              }
                          ],
                          "end": [ 
                              {
                                  "type": "INF_MAX" # The end position of the first primary key column in Tablestore. To export full data, set this parameter to INF_MAX. To export only part of data, specify this parameter based on your business requirements. If the table contains multiple primary key columns, configure the information about the primary key columns for the end parameter. 
                              },
                              {
                                  "type": "INF_MAX"
                              },
                              {
                                  "type": "STRING",
                                  "value": "end1"
                              },
                              {
                                  "type": "INT",
                                  "value": "100"
                              }
                          ],
                          "begin": [ 
                              {
                                  "type": "INF_MIN" # The start position of the first primary key column in Tablestore. To export full data, set this parameter to INF_MIN. To export only part of data, specify this parameter based on your business requirements. If the table contains multiple primary key columns, configure the information about the primary key columns for the begin parameter. 
                              },
                              {
                                  "type": "INF_MIN"
                              },
                              {
                                  "type": "STRING",
                                  "value": "begin1"
                              },
                              {
                                  "type": "INT",
                                  "value": "0"
                              }
                          ]
                      },
                      "table": "" # The name of the table in Tablestore. 
                  },
                  "name": "Reader",
                  "category": "reader"
              },
              {
                  "stepType": "odps", # The type of the writer. You cannot change the value of this parameter. 
                  "parameter": {
                      "partition": "",  # This parameter is required if the MaxCompute table is partitioned. Do not specify this parameter if the MaxCompute table is not partitioned. The partition to which data is written. The last-level partition must be specified. 
                      "truncate": true, # Specifies whether to delete all historical data. 
                      "datasource": "", # The name of the MaxCompute data source. Specify this parameter based on the actual case. 
                      "column": [ # The names of the columns in MaxCompute. Specify the column names in the same order as those in Tablestore. 
                          "*"
                      ],
                      "table": "" # The name of the table in MaxCompute. The table must be created in advance. Otherwise, the task may fail. 
                  },
                  "name": "Writer",
                  "category": "writer"
              },
              {
                  "name": "Processor",
                  "stepType": null,
                  "category": "processor",
                  "parameter": {}
              }
          ],
          "setting": {
              "executeMode": null,
              "errorLimit": {
                  "record": "0" # If the number of errors exceeds the number of records, the task fails. 
              },
              "speed": {
                  "throttle":true, # Specifies whether to enable throttling. A value of false indicates that throttling is disabled, and a value of true indicates that throttling is enabled. The mbps parameter takes effect only if the throttle parameter is set to true. 
                  "concurrent":1 # The maximum number of concurrent tasks. 
                  "mbps":"12" # The rate to which the traffic is limited. 
              }
          },
          "order": {
              "hops": [
                  {
                      "from": "Reader",
                      "to": "Writer"
                  }
              ]
          }
      }

      You can use the begin and end parameters to configure the range of data to be exported. For example, a table contains the pk1 and pk2 primary key columns. The pk1 column is of the STRING type. The pk2 column is of the INTEGER type.

      • To export full data from the table, specify the following parameters:

        "begin": [ # The start position of data to be exported. 
          {
            "type": "INF_MIN"
          },
          {
            "type": "INF_MIN"
          }
        ],
        "end": [  # The end position of data to be exported. 
          {
            "type": "INF_MAX"
          },
          {
            "type": "INF_MAX"
          }
        ],
      • To export data from the rows in which the value of the pk1 column is tablestore, specify the following parameters:

        "begin": [ # The start position of data to be exported. 
          {
            "type": "STRING",
            "value": "tablestore"
          },
          {
            "type": "INF_MIN"
          }
        ],
        "end": [  # The end position of data to be exported. 
          {
            "type": "STRING",
            "value": "tablestore"
          },
          {
            "type": "INF_MAX"
          }
        ],
    3. Click the image.png icon to save the configuration.

      Note

      If you do not save the script, a message appears when you perform subsequent operations. In this case, click OK in the message to save the script.

  4. Run the synchronization task

    Important

    In most cases, you need to synchronize full data only once and do not need to configure scheduling properties.

    1. Click the 1680170333627-a1e19a43-4e2a-4340-9564-f53f2fa6806e icon.

    2. In the Parameters dialog box, select the name of the resource group from the drop-down list.

    3. Click Run.

      After the script is run, click the link next to Detail log url on the Runtime Log tab. On the Detailed Runtime Logs page, check Current task status.

      If the value of Current task status is FINISH, the task is complete.

Step 5: View the data imported to MaxCompute

To view the data imported to MaxCompute in the DataMap console, perform the following steps:

  1. Go to the DataMap console.

    1. Log on to the DataWorks console as the project administrator.

    2. In the left-side navigation pane, click Workspaces and select a region.

    3. On the Workspaces page, find the workspace that you want to manage and click Data Map in the Actions column.

  2. In the top navigation bar of the DataMap console, choose My Data > Managed by Me.

  3. Click the table name of the imported data on the MaxCompute tab.

  4. On the table details page, click the Data Preview tab.

  5. View the data imported into MaxCompute on the Data Preview tab.