This topic describes how to use the DataWorks console to synchronize incremental data from Tablestore to Object Storage Service (OSS).

Step 1: Add a Tablestore data source

If a Tablestore data source is added, skip this step.

For more information about how to add Tablestore data sources, see Step 1: Add a Tablestore data source.

Step 2: Add an OSS data source

If an OSS data source is added, skip this step.

For more information about how to add an OSS data source, see Step 2: Add an OSS data source.

Step 3: Configure a scheduled synchronization task

To create and configure a task to synchronize incremental data from Tablestore to OSS, perform the following steps.

  1. Go to Data Analytics.
    1. Log on to the DataWorks console as a project administrator.
      Note Only the project administrator role can be used to add data sources. Members who assume other roles can only view data sources.
    2. Select a region. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, click Data Analytics in the Actions column that corresponds to the workspace.
  2. On the Data Analytics page of the DataStudio console, click Business Flow and select a business flow.

    For more information about how to create a business flow, see Create a workflow.

  3. Create a synchronization task node.
    You must create a node for each synchronization task.
    1. Right-click Data Integration and then choose Create > Batch synchronization.
      You can also move the pointer over the fig_addnode icon, and then choose Data Integration > Batch synchronization to create a node.
    2. In the Create Node dialog box, configure Node Name and Location.
      fig_oss_001
    3. Click Commit.
  4. Configure the Tablestore data source.
    1. In the hierarchy tree, click Data Integration. Double-click the name of the node for the data synchronization task.
    2. On the edit page of the synchronization task node, configure Source and Target in the Connections section.
      • Configure Source.

        Set Connection to OTS Stream for Source. Select a data source and table. Configure the start time and end time of the task, the name of the status table, and the maximum number of retry attempts.

      • Configure Target.

        Set Connection to OSS for Target. Select a data source. Configure the prefix of the object name, text type, and delimiter for the column.

      fig_oss_002
    3. Click the script icon to configure the script.

      When you use the script to configure the connection, you must configure OTSStream Reader and OSS Writer plug-ins. For more information about specific operations, see Tablestore Reader and OSS Writer.

      On the configuration page of the script, configure the parameters based on the following example:
      {
      "type": "job",
      "version": "1.0",
      "configuration": {
      "setting": {
      "errorLimit": {
       "record": "0"  # The maximum number of errors that are allowed. The synchronization task fails when the number of errors exceeds this value. 
      },
      "speed": {
       "mbps": "1",  # The maximum bandwidth of each synchronization task. 
       "concurrent": "1"   # The maximum number of concurrent threads for each synchronization task. 
      }
      },
      "reader": {
      "plugin": "otsstream",  # The name of the Reader plug-in. 
      "parameter": {
       "datasource": "", # The name of the Tablestore data source. If you specify datasource, you can leave the endpoint, accessId, accessKey, and instanceName parameters empty. 
       "dataTable": "", # The name of the data table in Tablestore. 
       "statusTable": "TablestoreStreamReaderStatusTable", # The table that stores the status of Tablestore Stream. In most cases, you do not need to change the value of this parameter. 
       "startTimestampMillis": "",  # The start time of data export. The task must be started in loops because this task is used for incremental export. The start time for each loop is different. Therefore, you must set a variable such as ${start_time}. 
       "endTimestampMillis": "",  #  The time at which data export ends. You must set this parameter to a variable such as ${end_time}. 
       "date": "yyyyMMdd",  # The date based on which to export data. The same results are returned if you configure the date parameter while the startTimestampMillis and endTimestampMillis parameters are also configured. If you configure startTimestampMillis and endTimestampMillis, you can delete the date parameter. 
       "mode": "single_version_and_update_only", # The mode in which Tablestore Stream is used to export data. Set this parameter to single_version_and_update_only. If the configuration template does not contain this parameter, add this parameter. 
       "column":[  # Set the columns that you want to export from the data table to OSS. If the configuration template does not contain this parameter, add this parameter. You can customize the number of columns. 
                {
                   "name": "uid"  # The name of a primary key column in the data table of Tablestore. 
                },
                {
                   "name": "name"  # The name of an attribute column in the data table of Tablestore. 
                },
       ],
       "isExportSequenceInfo": false, # Specify whether to export time series information. If you set the mode parameter to single_version_and_update_only, this parameter can be set only to false. 
       "maxRetries": 30 # The maximum number of retry attempts. 
      }
      },
      "writer": {
      "plugin": "oss", # The name of the Writer plug-in. 
      "parameter": {
       "datasource": "", # The name of the OSS data source. 
       "object": "",  # The prefix of the name of the file you want to synchronize to OSS. We recommend that you use the "Tablestore instance name/Table name/date" format. Example: "instance/table/{date}". 
       "writeMode": "truncate", # The operation the system performs when files of the same name exist. Valid values: truncate, append, and nonConflict. truncate specifies that files of the same name are deleted. append specifies that data is appended to the content of files of the same name. nonConflict specifies that an error is reported if files of the same name exist. 
       "fileFormat": "csv", # The format of the file. Valid values: csv, txt, and parquet. 
       "encoding": "UTF-8", # The encoding type. 
       "nullFormat": "null", # The string used to define the null value. The value can be an empty string. 
       "dateFormat": "yyyy-MM-dd HH:mm:ss", # # The time format. 
       "fieldDelimiter": "," # The delimiter used to separate each column. 
      }
      }
      }
      }
    4. Click the save icon to save the data source configurations.
  5. Run the synchronization task.
    1. Click the start icon.
    2. In the Arguments dialog box, select the resource group for scheduling.
    3. Click OK to run the task.
      After the task is completed, you can check whether the task is successful and view the number of exported rows on the Runtime Log tab.

      Incremental data is automatically synchronized from Tablestore to OSS at the latency of 5 to 10 minutes.

  6. Configure the scheduling parameters.
    You can configure the running time, rerun properties, and scheduling dependencies of the synchronization task in Properties.
    1. In the hierarchy tree, click Data Integration. Double-click the name of the synchronization task node.
    2. On the right side of the edit page of the synchronization task node, click Properties to configure the scheduling parameters. For more information, see Configure recurrence and dependencies for a node.
  7. Submit the synchronization task.
    After the synchronization task is submitted to the scheduling system, the scheduling system runs the synchronization task at the scheduled time based on the configured scheduling parameters.
    1. On the edit page of the synchronization task node, click the submit icon.
    2. In the Commit Node dialog box, enter your comments in the Change description field.
    3. Click OK.

Step 4: View the synchronization task

  1. Go to Operation Center.
    Note You can also click Operation Center in the upper-right corner of the DataStudio console to go to Operation Center.
    1. Log on to the DataWorks console as a project administrator.
    2. Select a region. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, click Operation Center in the Actions column that corresponds to the required workspace.
  2. In the left-side navigation pane of the Operation Center console, choose Cycle Task Maintenance > Cycle Task.
  3. On the Cycle Task page, view the details about the submitted synchronization task.
    • In the left-side navigation pane, choose Cycle Task Maintenance > Cycle Instance to view the task that is scheduled to run on the current date. Click the instance name to view the task running details.
    • You can view logs while a task is running or after the task is completed.

Step 5: View the data exported to OSS

  1. Log on to the OSS console.
  2. Select the corresponding bucket and object name. You can check whether the object contains the content as expected after you download the object.