This topic describes how to create a synchronization node by using the code editor.

Development procedure

To create a synchronization node by using the code editor, perform the following steps:
  1. Add a data source.
  2. Create a batch synchronization node.
  3. Apply a template.
  4. Configure a reader for the synchronization node.
  5. Configure a writer for the synchronization node.
  6. Configure field mappings.
  7. Configure channel control policies, such as the maximum transmission rate and the maximum number of dirty data records allowed.
  8. Configure scheduling properties for the synchronization node.

Add a data source

A synchronization node can synchronize data between various homogeneous or heterogeneous data sources. On the DataStudio page of the DataWorks console, click the Workspace Manage icon in the upper-right corner. On the page that appears, click Data Source in the left-side navigation pane. On the Data Source page, add a data source. For more information, see Add a data source.

After you add a data source, you can select it when you configure a synchronization node on the DataStudio page. For more information about the types of data sources that are supported by Data Integration, see Supported data sources, readers, and writers.
Note
  • Data Integration does not support connectivity testing for some data source types. For more information, see Select a network connectivity solution.
  • If an on-premises data source does not have a public IP address or is not accessible from a network, the connectivity testing fails when you configure the data source. You can use a custom resource group to resolve the connection failure. For more information, see Create a custom resource group for Data Integration.

    If a data source cannot be directly connected over a network, Data Integration cannot obtain the table schema. In this case, you can create a synchronization node for this data source only by using the code editor.

Create a workflow

  1. Log on to the DataWorks console.
  2. In the left-side navigation pane, click Workspaces.
  3. After you select the region in which the workspace that you want to manage resides, find the workspace and click Data Analytics in the Actions column.
  4. On the DataStudio page, move the pointer over the Create icon icon and select Workflow.
  5. In the Create Workflow dialog box, specify Workflow Name and Description.
    Notice The workflow name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  6. Click Create.

Create a batch synchronization node

  1. Click the newly created workflow and right-click Data Integration.
  2. Choose Create > Batch Synchronization.
  3. In the Create Node dialog box, configure the Node Name and Location parameters.
    Notice The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  4. Click Commit.

Apply a template

  1. On the node configuration tab that appears, click the Switch to Code Editor icon in the top toolbar.
    Switch to Code Editor
  2. In the Confirm dialog box, click OK.
    Note The code editor supports more features than the codeless user interface (UI). For example, you can configure synchronization nodes in the code editor even when the connectivity test fails.
  3. Click the Apply a template icon in the top toolbar.
  4. In the Apply Template dialog box, configure the following parameters: Source Connection Type, Connection, Target Connection Type, and Connection.
  5. Click OK.

Configure a reader for the synchronization node

After the template is applied, the basic settings of the reader are generated. You can configure the source and source table based on your business requirements.
{"type": "job",
    "version": "2.0",
    "steps": [   // Do not modify the preceding lines. They indicate the header code of the synchronization node. 
        {
            "stepType": "mysql",
            "parameter": {
                "datasource": "MySQL",
                "column": [
                    "id",
                    "value",
                    "table"
                ],
                "socketTimeout": 3600000,
                "connection": [
                    {
                        "datasource": "MySQL",
                        "table": [
                            "`case`"
                        ]
                    }
                ],
                "where": "",
                "splitPk": "",
                "encoding": "UTF-8"
            },
            "name": "Reader",
            "category": "reader"    // Specifies that these settings are related to the reader. 
        },   
Parameter description:
  • type: the type of the synchronization node. You must set the value to job.
  • version: the version number of the synchronization node. You can set the value to 1.0 or 2.0.
Note
  • For more information about how to configure the source, see Configure MaxCompute Reader.
  • Some synchronization nodes may need to synchronize incremental data. In this case, you can use the scheduling parameters of DataWorks to specify the date and time for incremental data synchronization. For more information, see Configure scheduling parameters.

Configure a writer for the synchronization node

After the reader is configured, you can configure the destination and destination table based on your business requirements.
{ 
  "stepType": "odps",
  "parameter": {
      "postSql":[], // The SQL statement that you want to execute after the synchronization node is run. 
      "partition": "",
      "truncate": true,
      "compress": false,
      "datasource": "odps_first",
      "column": [
          "*"
       ],
       "emptyAsNull": false,
       "table": "", 
       "preSql":[ 
               "delete from XXX;" // The SQL statement that you want to execute before the synchronization node is run. Separate multiple statements with semicolons (;). 
             ]
     },
     "name": "Writer",
     "category": "writer"   // Specifies that these settings are related to the writer. 
   }
 ],   
Note
  • For more information about how to configure the destination, see Configure MaxCompute Writer.
  • You can select the writing method for most nodes. For example, the writing method can be overwriting or appending. Supported writing methods vary based on the data source type.

Map the fields in the source and destination tables

The code editor supports only the mappings of fields in the same row. The data types of the fields must match.
Note Make sure that the data type of a source field is the same as that of the mapped destination field or the data type conversion is feasible.

Configure channel control policies

After the preceding steps are performed, you can configure the channel control policies for the synchronization node. The setting parameter specifies node efficiency parameters, including the number of parallel threads, bandwidth throttling, dirty data policy, and resource group.
"setting": {
        "errorLimit": {
            "record": "1024"   // The maximum number of dirty data records allowed. 
        },
        "speed": {
            "throttle": false,   // Specifies whether to enable bandwidth throttling. 
            "concurrent": 1 // The maximum number of parallel threads.    
        }
    },
Parameter Description
Expected Maximum Concurrency The maximum number of parallel threads that the synchronization node uses to read data from the source or write data to the destination. You can configure the parallelism for the synchronization node on the codeless UI. For example, if you set the concurrent parameter to 8 and you want to read data from the same table in two instances, the synchronization node uses a maximum of eight threads to read data from or write data to the two instances in parallel. The eight parallel threads are randomly allocated to the two instances.
Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source.
Dirty Data Records Allowed The maximum number of dirty data records allowed.

Configure scheduling properties of the synchronization node

In most cases, synchronization nodes use scheduling parameters to filter data. This section describes how to configure scheduling parameters for a synchronization node.

On the DataStudio page, double-click the batch synchronization node in the related workflow. On the node configuration tab, click the Properties panel in the right-side navigation pane to configure scheduling properties for the node.

In the Properties panel, you can configure the scheduling properties of the synchronization node, such as the recurrence, time when the synchronization node is run, and dependencies. Batch synchronization nodes do not have ancestor nodes because they are run before extract, transform, and load (ETL) nodes. We recommend that you specify the root node of the workspace as their ancestor node.

After the synchronization node is configured, save and commit the node. For more information about the node scheduling properties, see Basic properties.