This topic describes the parameters that are supported by MaxCompute Writer and how to configure MaxCompute Writer by using the codeless user interface (UI) and code editor.

Prerequisites

Before you configure MaxCompute Writer, you must configure a MaxCompute data source. For more information, see Add a MaxCompute data source.

Background information

MaxCompute Writer is designed for developers to insert data to or update data in MaxCompute. MaxCompute Writer can write gigabytes or terabytes of data to MaxCompute. For more information about MaxCompute, see What is MaxCompute?

MaxCompute Writer writes data to MaxCompute by using Tunnel commands based on the information you specified, such as the source project, table, partition, and field. For more information about common Tunnel commands, see Tunnel commands.

For a table with a strict schema, such as a table in a MySQL database or MaxCompute project, Data Integration reads data from the table and stores the data in the memory. Then, Data Integration converts the data to the format that is supported by the destination and writes the data to the destination.

If the data conversion fails or the data fails to be written to the destination, the data is regarded as dirty data. You can specify the maximum number of dirty data records allowed.
Note If the data in the source contains a null value, MaxCompute Writer cannot convert the data to the VARCHAR type.

Supported data types

The following data type editions are supported: MaxCompute V1.0 data type edition, MaxCompute V2.0 data type edition, and Hive-compatible data type edition. This section provides the support status of data types of each edition.

MaxCompute V1.0 data type edition

Data type MaxCompute Reader for batch data read MaxCompute Writer for batch data write MaxCompute Writer for real-time data write
BIGINT Supported Supported Supported
DOUBLE Supported Supported Supported
DECIMAL Supported Supported Supported
STRING Supported Supported Supported
DATETIME Supported Supported Supported
BOOLEAN Supported Supported Supported
ARRAY Supported Supported Supported
MAP Supported Supported Supported
STRUCT Supported Supported Supported

MaxCompute V2.0 data type edition and Hive-compatible data type edition

Data type MaxCompute Reader for batch data read MaxCompute Writer for batch data write MaxCompute Writer for real-time data write
TINYINT Supported Supported Supported
SMALLINT Supported Supported Supported
INT Supported Supported Supported
BIGINT Supported Supported Supported
BINARY Supported Supported Supported
FLOAT Supported Supported Supported
DOUBLE Supported Supported Supported
DECIMAL(pecision,scale) Supported Supported Supported
VARCHAR(n) Supported Supported Supported
CHAR(n) Not supported Supported Supported
STRING Supported Supported Supported
DATE Supported Supported Supported
DATETIME Supported Supported Supported
TIMESTAMP Supported Supported Supported
BOOLEAN Supported Supported Supported
ARRAY Supported Supported Supported
MAP Supported Supported Supported
STRUCT Supported Supported Supported

Parameters

Parameter Description Required Default value
datasource The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. Yes No default value
table The name of the table to which you want to write data. The name is not case-sensitive. You can specify only one table. Yes No default value
partition The partition to which data is written. The last-level partition must be specified. For example, if you want to write data to a table with three-level partitions, set the partition parameter to a value that contains the third-level partition information, such as pt=20150101, type=1, biz=2.
  • To write data to a non-partitioned table, do not configure this parameter. The data is directly written to the destination table.
  • MaxCompute Writer does not support data write operations based on the partition route. To write data to a partitioned table, make sure that the data is written to the lowest-level partition.
Required only for partitioned tables No default value
column The names of the columns to which you want to write data. If you want to write data to all the columns in the destination table, set this parameter to an asterisk (*), such as "column": ["*"]. If you want to write data only to specific columns in the destination table, set this parameter to the names of the columns. Separate the names with commas (,), such as "column":["id","name"].
  • MaxCompute Writer can filter columns and change the order of columns. For example, a MaxCompute table has three columns: a, b, and c. If you want to write data only to column c and column b, you can enter "column": ["c","b"]. During data synchronization, column a is automatically set to null.
  • The column parameter must explicitly specify all the columns to which you want to write data. This parameter cannot be left empty.
Yes No default value
truncate To ensure the idempotence of write operations, set the truncate parameter to true. If a failed synchronization node is rerun due to a write failure, MaxCompute Writer deletes the data that has been written to the destination table and writes the source data again. This ensures that the same data is written for each rerun.

MaxCompute Writer uses MaxCompute SQL to delete data. MaxCompute SQL cannot ensure data atomicity. Therefore, the TRUNCATE operation is not an atomic operation. Conflicts may occur when multiple nodes delete data from the same table or partition in parallel.

To prevent this issue, we recommend that you do not execute multiple DDL statements to write data to the same partition at the same time. You can create different partitions for nodes that need to run in parallel.

Yes No default value

Configure MaxCompute Writer by using the codeless UI

  1. Configure data sources.
    Configure Source and Target for the synchronization node. Write data to MaxCompute
    Parameter Description
    Connection The name of the data source to which you want to write data. This parameter is equivalent to the datasource parameter that is described in the preceding section.
    Table The name of the table to which you want to write data. This parameter is equivalent to the table parameter that is described in the preceding section. If the table is a partitioned table, you must specify the partition to which you want to write data. You can use scheduling parameters when you specify the partition. For more information about scheduling parameters, see Overview of scheduling parameters.
    Writing Rule The write rule. Valid values:
    • Write with Original Data Deleted (Insert Overwrite): All data in the table or partition is deleted before MaxCompute Writer writes data. This rule is equivalent to the INSERT OVERWRITE statement.
    • Write with Original Data Retained (Insert Into): No data is deleted before MaxCompute Writer writes data. New data is appended upon each run. This rule is equivalent to the INSERT INTO statement.
    Note
    • MaxCompute Reader reads data by using Tunnel commands. Synchronization nodes cannot filter data. Each synchronization node reads all the data from a table or partition.
    • MaxCompute Writer writes data by using Tunnel commands instead of the INSERT INTO statement. You can view complete data in the destination table only after the synchronization node is successfully run. Pay attention to the node dependencies.
    Convert Empty Strings to Null Specifies whether to convert empty strings to null.
    Allow Query After Synchronization is Complete Data write is a process that lasts for a period of time. This parameter specifies whether to allow users to query synchronized data only after the data synchronization node finishes running.
    Note The setting of this parameter does not take effect for nodes for which the Distributed Execution feature is enabled.
    • If you select Yes, you can query the data that is synchronized to MaxCompute by the data synchronization node only after the node finishes running.
      Note A data synchronization node can synchronize a maximum of 1 TB of data to a destination. If severe data skew occurs, the maximum of amount of data that is written to the destination may be less than this threshold.
    • If you select No, you can query the data that is synchronized to MaxCompute by the data synchronization node when the node is running.
    Note Before source data is written to MaxCompute, a partition may be automatically created in the destination MaxCompute table or existing data in the destination MaxCompute table or destination partition may be deleted. Therefore, when the data synchronization node is running, you may view the partition that is automatically created or the table or partition from which existing data is deleted regardless of whether you set Allow Query After Synchronization is Complete to Yes or No. We recommend that you do not determine whether data synchronization is complete based on whether the automatically created partition exists in the destination or no more data records are written to the destination. If you want to determine whether data synchronization is complete, we recommend that you create an ODPS SQL node and configure the ODPS SQL node as the descendant node of your data synchronization node. The system creates a MaxCompute table or a partition for the ODPS SQL node. You can determine whether data synchronization is complete based on whether the MaxCompute table or partition exists in the ODPS SQL node.
  2. Configure field mappings. This operation is equivalent to setting the column parameter when you use the code editor. Fields in the source on the left have a one-to-one mapping with fields in the destination on the right. Configure field mappings
    Parameter Description
    Map Fields with the Same Name Click Map Fields with the Same Name to establish mappings between fields with the same name. The data types of the fields must match.
    Map Fields in the Same Line Click Map Fields in the Same Line to establish mappings between fields in the same row. The data types of the fields must match.
    Delete All Mappings Click Delete All Mappings to remove the mappings that are established.
    Auto Layout Click Auto Layout. Then, the system automatically sorts the fields based on specific rules.
  3. Configure channel control policies. Channel control
    Parameter Description
    Expected Maximum Concurrency The maximum number of parallel threads that the synchronization node uses to read data from the source or write data to the destination. You can configure the parallelism for the synchronization node on the codeless UI.
    Bandwidth Throttling Specifies whether to enable throttling. You can enable throttling and specify a maximum transmission rate to prevent heavy read workloads on the source. We recommend that you enable throttling and set the maximum transmission rate to an appropriate value based on the configurations of the source.
    Dirty Data Records Allowed The maximum number of dirty data records allowed.
    Distributed Execution

    The distributed execution mode that allows you to split your node into pieces and distribute them to multiple Elastic Compute Service (ECS) instances for parallel execution. This speeds up synchronization. If you use a large number of parallel threads to run your synchronization node in distributed execution mode, excessive access requests are sent to the data sources. Therefore, before you use the distributed execution mode, you must evaluate the access load on the data sources. You can enable this mode only if you use an exclusive resource group for Data Integration. For more information about exclusive resource groups for Data Integration, see Exclusive resource groups for Data Integration and Create and use an exclusive resource group for Data Integration.

Configure MaxCompute Writer by using the code editor

For more information about how to configure a synchronization node by using the code editor, see Configure a batch synchronization node by using the code editor.

In the following code, a synchronization node is configured to write data to MaxCompute. For more information about the parameters, see the preceding parameter description.
{
    "type":"job",
    "version":"2.0",// The version number. 
    "steps":[
        {
            "stepType":"stream",
            "parameter":{},
            "name":"Reader",
            "category":"reader"
        },
        {
            "stepType":"odps",// The writer type. 
            "parameter":{
                "partition":"",// The name of the partitions to which you want to write data. 
                "truncate":true,// The write rule. 
                "compress":false,// Specifies whether to enable compression. 
                "datasource":"odps_first",// The name of the data source. 
            "column": [// The names of the columns to which you want to write data. 
                "id",
                "name",
                "age",
                "sex",
                "salary",
                "interest"
                ],
                "emptyAsNull":false,// Specifies whether to convert empty strings to null. 
                "table":""// The name of the table to which you want to write data. 
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed. 
        },
        "speed":{
            "throttle":true,// Specifies whether to enable bandwidth throttling. The value false indicates that bandwidth throttling is disabled, and the value true indicates that bandwidth throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
            "concurrent":1, // The maximum number of parallel threads. 
            "mbps":"12"// The maximum transmission rate.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}
If you want to specify the Tunnel endpoint, you can configure the data source in the code editor. To configure the data source, replace "datasource":"", in the preceding code with detailed parameters of the data source. Example:
"accessId":"<yourAccessKeyId>",
 "accessKey":"<yourAccessKeySecret>",
 "endpoint":"http://service.eu-central-1.maxcompute.aliyun-inc.com/api",
 "odpsServer":"http://service.eu-central-1.maxcompute.aliyun-inc.com/api", 
"tunnelServer":"http://dt.eu-central-1.maxcompute.aliyun.com", 
"project":"**********", 

Additional information

  • Column filter

    MaxCompute Writer allows you to perform operations that MaxCompute does not support, such as filtering columns, reordering columns, and setting empty fields to null. If you want to write data to all the columns in the destination table, set the column parameter to ["*"].

    For example, a MaxCompute table has three columns: a, b, and c. If you want to write data only to column c and column b, you can set the column parameter to ["c","b"]. The first column and the second column in the source table are written to column c and column b in the MaxCompute table. During data synchronization, column a is automatically set to null.

  • Handling column configuration errors

    To prevent data loss caused by redundant columns and ensure high data reliability, MaxCompute Writer returns an error message if the number of columns that are to be written is more than that in the destination table. For example, if a MaxCompute table contains columns a, b, and c, MaxCompute Writer returns an error message if more than three columns are to be written to the table.

  • Partition configuration

    MaxCompute Writer can write data to the lowest-level partition but cannot write data to a specified partition based on a field. To write data to a partitioned table, specify the lowest-level partition. For example, if you want to write data to a table with three-level partitions, set the partition parameter to a value that contains the third-level partition information, such as pt=20150101, type=1, biz=2. The data cannot be written if you set the partition parameter to pt=20150101, type=1 or pt=20150101.

  • Node rerunning
    To ensure the idempotence of write operations, set the truncate parameter to true. If a failed synchronization node is rerun due to a write failure, MaxCompute Writer deletes the data that has been written to the destination table and writes the source data again. This ensures that the same data is written for each rerun. If a synchronization node is interrupted due to other exceptions, the idempotence of data cannot be ensured, the data cannot be rolled back, and the node cannot be automatically rerun. You can ensure the idempotence of write operations and the data integrity by setting the truncate parameter to true.
    Note If the truncate parameter is set to true, all data in the specified partition or table is deleted before a rerun. Exercise caution when you configure this parameter.