This topic describes the data types and parameters supported by MaxCompute Reader and how to configure it by using the codeless user interface (UI) and code editor.

MaxCompute Reader allows you to read data from MaxCompute. For more information about MaxCompute, see MaxCompute overview.

Based on the specified information such as the source project, table, partition, and field, MaxCompute Reader reads data from MaxCompute through Tunnel. For more information about common Tunnel commands, see Tunnel commands.

MaxCompute Reader cannot read views. It can only read partitioned tables and non-partitioned tables. When enabling MaxCompute Reader to read partitioned tables, you must specify the partition information. For example, set pt to 1 and ds to hangzhou for the t0 table. The partition information is not required for non-partitioned tables. Additionally, you can select some or all of the table fields, change the order in which the fields are arranged, and add constant fields and partition key columns. Note that partition key columns are not table fields.

Data types

The following table lists the data types supported by MaxCompute Reader.

Category Data Integration data type MaxCompute data type
Integer LONG BIGINT, INT, TINYINT, and SMALLINT
Boolean BOOLEAN BOOLEAN
Date and time DATE DATETIME and TIMESTAMP
Floating point DOUBLE FLOAT, DOUBLE, and DECIMAL
Binary BYTES BINARY
Complex STRING ARRAY, MAP, and STRUCT

Parameters

Parameter Description Required Default value
datasource The connection name. It must be identical to the name of the added connection. You can add connections in the code editor. Yes None
table The name of the source table. The name is case-insensitive. Yes None
partition The partitions that MaxCompute Reader reads. Linux shell wildcards are supported. An asterisk (*) represents zero or more characters, and a question mark (?) represents that the previous character can be included or not. Assume that a partitioned table named test has four partitions: pt=1 and ds=hangzhou, pt=1 and ds=shanghai, pt=2 and ds=hangzhou, and pt=2 and ds=beijing.
  • If you want to read data from the partition with pt=1 and ds=shanghai, enter "partition":"pt=1/ds=shanghai".
  • If you want to read data from all the partitions with pt=1, enter "partition":"pt=1/ds=*".
  • If you want to read data from all the partitions in the test table, enter "partition":"pt=*/ds=*".
Required only for partitioned table None
column The columns in the source table that MaxCompute Reader reads. Assume that the fields of a table named test are id, name, and age.
  • To read the fields in turn, enter "column":["id","name","age"] or "column":["*"].
    Note We recommend that you do not set "column":["*"]. This is because data synchronization may fail if the source table changes in the column order, data type, or number of columns.
  • To read the name and id fields in turn, enter "column":["name","id"].
  • You can add a constant field to extracted data for the purpose of proper mapping between source table columns and destination table columns. Each constant must be enclosed in single quotation marks (' '). For example, if you set "column":["age","name","'1988-08-08 08:08:08'","id"], the data extracted contains an age column, a name column, a constant "1988-08-08 08:08:08", and an id column in turn.

    The single quotation marks (' ') are used to identify constant columns. The constant column values exclude the single quotation marks (' ').

    Note
    • MaxCompute Reader does not use SELECT statements to read data. Therefore, you cannot specify function fields.
    • The column parameter must explicitly specify a set of columns to be synchronized. The parameter cannot be left empty.
Yes None

Configure MaxCompute Reader by using the codeless UI

  1. Configure the connections.
    Configure the source and destination connections for the sync node.Connections
    Parameter Description
    Connection The datasource parameter in the preceding parameter description. Select a connection type, and enter the name of a connection that has been configured in DataWorks.
    Table The table parameter in the preceding parameter description.
    Partition Information The partitions to be synchronized.
    Compression Specifies whether to enable compression. Valid values: Enable and Disable.
    Convert Empty Strings to Null Specifies whether to convert empty strings to null.
    Note To synchronize all columns in the source table, enter "column": [""]. The partition parameter supports wildcards and includes one or more partitions.
    • "partition":"pt=20140501/ds=*" specifies that all ds partitions with pt=20140501 are to be synchronized.
    • "partition":"pt=top?" specifies that the partitions with pt=top and pt=to are to be synchronized.
    You can specify the partition key columns to be synchronized. Assume that the partition key column of a MaxCompute table is pt=${bdp.system.bizdate}. You can configure the column to be synchronized to pt. Ignore it if the column is marked as unidentified.
    • To synchronize all partitions, enter pt=*.
    • To synchronize some of the partitions, specify the corresponding dates.
  2. Configure field mappings, that is, the column parameter in the preceding parameter description.
    Fields in the source table on the left have a one-to-one mapping with fields in the destination table on the right. You can click Add to add a field, or move the pointer over a field and click the Delete icon to delete the field.Mappings
    Parameter Description
    Map Fields with the Same Name Click Map Fields with the Same Name to establish a mapping between fields with the same name. Note that the data types of the fields must match.
    Map Fields in the Same Line Click Map Fields in the Same Line to establish a mapping for fields in the same row. Note that the data types of the fields must match.
    Delete All Mappings Click Delete All Mappings to remove mappings that have been established.
    Auto Layout Click Auto Layout. The fields are automatically sorted based on specified rules.
    Change Fields Click the Change Fields icon. In the Change Fields dialog box that appears, you can manually edit fields in the source table. Each field occupies a row. The first and the last blank rows are included, whereas other blank rows are ignored.
    Add
    • Click Add to add a field. You can enter constants. Each constant must be enclosed in single quotation marks (' '), such as 'abc' and '123'.
    • You can use scheduling parameters, such as ${bizdate}.
    • You can enter functions supported by relational databases, such as now() and count(1).
    • Fields that cannot be parsed are indicated by Unidentified.
  3. Configure channel control policies.Channel
    Parameter Description
    Expected Maximum Concurrency The maximum number of concurrent threads to read and write data to data storage within the sync node. You can configure the concurrency for a node on the codeless UI.
    Bandwidth Throttling Specifies whether to enable bandwidth throttling. You can enable bandwidth throttling and set a maximum transmission rate to avoid heavy read workload of the source. We recommend that you enable bandwidth throttling and set the maximum transmission rate to a proper value.
    Dirty Data Records Allowed The maximum number of dirty data records allowed.
    Resource Group The resource group used for running the sync node. If a large number of nodes including this sync node are deployed on the default resource group, the sync node may need to wait for resources. We recommend that you purchase an exclusive resource group for data integration or add a custom resource group. For more information, see DataWorks exclusive resources and Add a custom resource group.

Configure MaxCompute Reader by using the code editor

In the following code, a node is configured to read data from MaxCompute. For more information about the parameters, see the preceding parameter description.

{
    "type":"job",
    "version":"2.0",
    "steps":[
        {
            "stepType":"odps",// The reader type.
            "parameter":{
                "partition":[], // The partitions that MaxCompute Reader reads.
                "isCompress":false, // Specifies whether to enable compression.
                "datasource":"",// The connection name.
                "column":[ // The columns to be synchronized.
                    "id"
                ],
                "emptyAsNull":true,
                "table":""// The table name.
            },
            "name":"Reader",
            "category":"reader"
        },
        { // The following template is used to configure Stream Writer. For more information, see the corresponding topic.
            "stepType":"stream",
            "parameter":{
            },
            "name":"Writer",
            "category":"writer"
        }
    ],
    "setting":{
        "errorLimit":{
            "record":"0"// The maximum number of dirty data records allowed.
        },
        "speed":{
            "throttle":false,// Specifies whether to enable bandwidth throttling. A value of false indicates that the bandwidth is not throttled. A value of true indicates that the bandwidth is throttled. The maximum transmission rate takes effect only if you set this parameter to true.
            "concurrent":1,// The maximum number of concurrent threads.
        }
    },
    "order":{
        "hops":[
            {
                "from":"Reader",
                "to":"Writer"
            }
        ]
    }
}

If you want to specify the Tunnel endpoint of MaxCompute, you can configure the connection in the code editor. To configure the connection, replace "datasource":"", in the preceding code with detailed parameters of the connection. Example:

"accessId":"*******************",
"accessKey":"*******************",
"endpoint":"http://service.eu-central-1.maxcompute.aliyun-inc.com/api",
"odpsServer":"http://service.eu-central-1.maxcompute.aliyun-inc.com/api", 
"tunnelServer":"http://dt.eu-central-1.maxcompute.aliyun.com", 
"project":"*****",