This topic describes the data types and parameters that MaxCompute Writer supports and how to configure it by using the codeless user interface (UI) and code editor.

Prerequisites

A MaxCompute connection is configured. For more information, see Configure a MaxCompute connection.

Background information

MaxCompute Writer is designed for developers to insert data to or update data in MaxCompute. MaxCompute Writer is suitable for importing gigabytes or terabytes of data to MaxCompute. For more information, see What is MaxCompute?.

Based on the information that you specify, such as the source project, table, partition, and field, MaxCompute Writer writes data to MaxCompute by using Tunnel. For more information about common Tunnel commands, see Tunnel commands.

For a table with a strict schema, such as a table in a MySQL database or MaxCompute project, Data Integration reads data from the table and stores the data in the memory. Then, Data Integration converts the data to the format that is supported by the destination data store and writes the data to the destination data store.

If the data conversion fails or the data fails to be written to the destination data store, the data is regarded as dirty data. You can specify a maximum number of dirty data records allowed.
Note If the data in a data store contains a null value, MaxCompute Writer cannot convert the data to the VARCHAR type.

Parameters

Parameter Description Required Default value
datasource The connection name. It must be the same as the name of the added connection. You can add connections in the code editor. Yes N/A
table The name of the destination table. The name is not case-sensitive. You can specify only one table as the destination table. Yes N/A
partition The partition to which data is written. The last-level partition must be specified. For example, if you want to write data to a three-level partitioned table, set the partition parameter to a value that contains the third-level partition information, for example, pt=20150101, type=1, biz=2.
  • To write data to a non-partitioned table, do not set this parameter. The data is directly written to the destination table.
  • MaxCompute Writer does not support writing data based on the partition route. To write data to a partitioned table, make sure that data is written to the last-level partition.
Required only for writing data to a partitioned table N/A
column The columns in the destination table to which data is written. To write data to all the columns in the destination table, set the value to an asterisk (*), for example, "column":["*"]. Set this parameter to the specified columns if data is written to only specific columns in the destination table. Separate the columns with commas (,), for example, "column":["id","name"].
  • MaxCompute Writer can filter columns and change the order of columns. For example, a MaxCompute table has three columns: a, b, and c. If you want to write data only to column c and column b, you can set the column parameter in the format of "column":["c","b"]. During data synchronization, column a is automatically set to null.
  • The column parameter must explicitly specify a set of columns to which data is written. The parameter cannot be left empty.
Yes N/A
truncate To ensure the idempotence of write operations, set the truncate parameter in the format of "truncate":"true". When a sync node is rerun due to a write failure, MaxCompute Writer deletes the data that has been written before it imports the source data again. This ensures that the same data is written for each rerun.

MaxCompute Writer uses MaxCompute SQL to delete data. MaxCompute SQL cannot ensure the atomicity. Therefore, the truncation operation is not an atomic operation. Conflicts may occur when concurrent nodes delete data from the same table or partition.

To avoid this issue, we recommend that you do not run concurrent DDL nodes to write data to the same partition. You can create different partitions for nodes that need to run concurrently.

Yes N/A

Configure MaxCompute Writer by using the codeless UI

  1. Configure the connections.
    Configure the connections to the source and destination data stores for the sync node. Select data source section
    Parameter Description
    Data source The datasource parameter in the preceding parameter description. Select a connection type and select the name of a connection that you have configured in DataWorks.
    Table The table parameter in the preceding parameter description.
    Partition information You can specify the partition key columns to which data is written. Assume that the partition key column of a MaxCompute table is pt=${bdp.system.bizdate}. You can configure the column to which data is written to pt. Ignore it if the column is marked as unidentified. To write data to specific partitions, specify the corresponding dates.
    Cleanup rules The write rule. Valid values:
    • Clean up existing data before writing (Insert Overwrite): All data in the destination table or partition is deleted before data import. This rule is equivalent to the INSERT OVERWRITE statement.
    • Retain existing data before writing (Insert Into): No data is deleted from the destination table or partition before data import. New data is always appended to the destination table or partition upon each run. This rule is equivalent to the INSERT INTO statement.
    Note
    • MaxCompute Reader reads data by using Tunnel. Sync nodes cannot filter data. Each sync node reads all the data from a table or partition.
    • MaxCompute Writer writes data by using Tunnel instead of the INSERT INTO statement. You can view the complete data in the destination table only after a sync node is properly run. Pay attention to the node dependencies.
    Empty string as null Specifies whether to convert empty strings to null.
  2. Configure field mapping. It is equivalent to setting the column parameter in the preceding parameter description. Fields in the source table on the left have a one-to-one mapping with fields in the destination table on the right.Field Mapping section
    GUI element Description
    The same name mapping Click The same name mapping to establish a mapping between fields with the same name. Note that the data types of the fields must match.
    Peer mapping Click Peer mapping to establish a mapping between fields in the same row. Note that the data types of the fields must match.
    Unmap Click Unmap to remove mappings that have been established.
    Automatic typesetting Click Automatic typesetting to sort the fields based on specified rules.
  3. Configure channel control policies.Channel control section
    Parameter Description
    Maximum number of concurrent tasks expected The maximum number of concurrent threads that the sync node uses to read data from or write data to data stores. You can configure the concurrency for the node on the codeless UI.
    Synchronization rate Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workload of the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to a proper value.
    The number of error records exceeds The maximum number of dirty data records allowed.

Configure MaxCompute Writer by using the code editor

You can configure MaxCompute Writer by using the code editor. For more information, see Create a sync node by using the code editor.

The following example shows how to configure a sync node to write data to a MaxCompute table. For more information about the parameters, see the preceding parameter description.
{
    "type":"job",
    "version":"2.0", // The version number.
    "steps":[
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"odps",// The writer type.
            "parameter":{
                "partition":"",// The partition information.
                "truncate":true,// The write rule.
                "compress":false,// Specifies whether to enable compression.
                "datasource":"odps_first",// The connection name.
            "column": [// The columns to which data is written.
                "id",
                "name",
                "age",
                "sex",
                "salary",
                "interest"
                ],
                "emptyAsNull":false,// Specifies whether to convert empty strings to null.
                "table":""// The name of the destination table.
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed.
        },
        "speed":{
            "throttle":false,// Specifies whether to enable bandwidth throttling. A value of false indicates that the bandwidth is not throttled. A value of true indicates that the bandwidth is throttled. The maximum transmission rate takes effect only if you set this parameter to true.
            "concurrent":1 // The maximum number of concurrent threads.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}
If you want to specify the Tunnel endpoint of MaxCompute, you can configure the connection in the code editor. To configure the connection, replace "datasource":"", in the preceding code with detailed parameters of the connection. Example:
"accessId":"<yourAccessKeyId>",
 "accessKey":"<yourAccessKeySecret>",
 "endpoint":"http://service.eu-central-1.maxcompute.aliyun-inc.com/api",
 "odpsServer":"http://service.eu-central-1.maxcompute.aliyun-inc.com/api", 
"tunnelServer":"http://dt.eu-central-1.maxcompute.aliyun.com", 
"project":"**********", 

Additional instructions

  • Column filter

    By configuring MaxCompute Writer, you can perform operations that MaxCompute does not support, for example, filter columns, reorder columns, and set empty fields to null. To write data to all the columns in the destination table, set the column parameter in the format of "column":["*"].

    For example, a MaxCompute table has three columns: a, b, and c. If you want to write data only to column c and column b, you can set the column parameter in the format of "column":["c","b"]. The first column and the second column of the source data are written to column c and column b in the MaxCompute table respectively. During data synchronization, column a is automatically set to null.

  • Column configuration error handling

    To avoid losing the data of redundant columns and ensure high data reliability, MaxCompute Writer returns an error message if the number of columns to be written is greater than that in the destination table. For example, if a MaxCompute table contains columns a, b, and c, MaxCompute Writer returns an error message if more than three columns are to be written to the table.

  • Partition configuration

    MaxCompute Writer can write data only to the last-level partition, and cannot write data to the specified partition based on a field. To write data to a partitioned table, specify the last-level partition. For example, if you want to write data to a three-level partitioned table, set the partition parameter to a value that contains the third-level partition information, for example, pt=20150101, type=1, biz=2. The data cannot be written if you set the partition parameter to pt=20150101 or pt=20150101, type=1.

  • Node rerunning
    To ensure the idempotence of write operations, set the truncate parameter to true. When a sync node is rerun due to a write failure, MaxCompute Writer deletes the data that has been written before it imports the source data again. This ensures that the same data is written for each rerun. If a sync node is interrupted due to other exceptions, the data cannot be rolled back and the node cannot be automatically rerun. You can ensure the idempotence of write operations and the data integrity by setting the truncate parameter to true.
    Note If the truncate parameter is set to true, all data of the specified partition or table is deleted before a rerun. Exercise caution when you set this parameter to true.