All Products
Search
Document Center

DataHub:Synchronize data to MaxCompute

Last Updated:Jul 24, 2023

Synchronize data to MaxCompute

Preparations

1.Create a MaxCompute table

DataHub allows you to synchronize data to MaxCompute tables. Both partitioned and non-partitioned tables are supported. To facilitate data processing in MaxCompute, we recommend that you synchronize data to partitioned tables.

DataHub allows you to synchronize data from the topics of the TUPLE or BLOB type to MaxCompute tables.

(1) If you want to synchronize data from topics of the TUPLE type, data types in MaxCompute tables must match data types in the DataHub topics. The following table describes the data type mappings.

MaxCompute

DataHub

BIGINT

BIGINT

STRING

STRING

BOOLEAN

BOOLEAN

DOUBLE

DOUBLE

DATETIME

TIMESTAMP

DECIMAL

DECIMAL

TINYINT

TINIINT

SMALLINT

SMALLINT

INT

INTEGER

FLOAT

FLOAT

MAP

Not supported

ARRAY

Not supported

DataHub does not support all data types in MaxCompute. We recommend that you create a MaxCompute schema based on data types supported by DataHub.

(2) If you want to synchronize data from topics of the BLOB type, the MaxCompute schema needs to contain only one field of the STRING type. By default, DataHub synchronizes data to this field.

DataHub

MaxCompute

BLOB

STRING

(3) To facilitate data tracking and troubleshooting, we recommend that you add the __rowkey__ field of the SRTING type when you create a MaxCompute schema. DataHub automatically synchronizes the trace information of the synchronized data in DataHub to this field to facilitate subsequent data troubleshooting.

2.Prepare an account for data synchronization that is granted required permissions

(1) When you create a DataConnector to synchronize data to MaxCompute, you must manually enter the account information that is used to access the MaxCompute table. Make sure that the entered account information is valid. In most cases, you can use a RAM user for MaxCompute.

(2) You must grant the account the required permissions on the MaxCompute table. The permissions include the CreateInstance, Describe, Alter, and Update permissions.

You can manage the permissions on MaxCompute tables in the DataWorks console. For more information, see Configure MaxCompute. You can also use the command-line tool of MaxCompute for authorization.

3.Determine the value of the TimestampUnit parameter.

(1) The TimestampUnit parameter specifies the unit in which the data of the TIMESTAMP type is converted and written to the fields of the DATETIME type in MaxCompute.

(2) If you want the fields of the TIMESTAMP type to be written to MaxCompute in the unit of seconds, select SECOND from the TimestampUnit drop-down list when you create a DataConnector. If you want the fields of the TIMESTAMP type to be written to MaxCompute in the unit of milliseconds, select MILLISECOND. If you want the fields of the TIMESTAMP type to be written to MaxCompute in the unit of microseconds, select MICROSECOND.

Usage notes

  1. More partitions cause slower data synchronization from DataHub due to the current writing standards of MaxCompute. When you create a DataConnector to synchronize data to MaxCompute, we recommend that you limit the number of partitions, especially in USER_DEFINE mode.

    • Make sure that the timestamps of the records in a shard and the shards are as continuous as possible.

    • Do not create excessive partitions when you create partitions in all partition modes.

Create a DataConnector

  1. In the left-side navigation pane of the DataHub console, click Project Manager. On the Project List page, find a project and click View in the Actions column. On the details page of the project, find a topic and click View in the Actions column.

  2. Click the synchronization button +Connector in the upper right corner to create a synchronization task

    8
  3. In the Create Connector panel, click MaxCompute, as shown in the following figure.

    odps_1

    (2) Synchronize data from a topic of the BLOB type

odps_2

Description of a few parameters:

The following part describes a few parameters that are used to create a DataConnector in the DataHub console. For more information about synchronization configurations, see the descriptions of DataHub SDKs.

  1. Import Fields

    DataHub can synchronize the values of specific fields to MaxCompute tables based on your settings.

  2. Partition Mode

    The partition mode determines to which partition in MaxCompute data is written. The following table describes the partition modes supported by DataHub.

Partition mode

Partition basis

Supported data type of a topic

Description

USER_DEFINE

Value of the partition field that has the same name as the partition field in MaxCompute in the record

TUPLE

(1). The DataHub schema must contain the partition field in MaxCompute. (2) The value of this field must be a UTF-8 string. The value of this field can be empty, which indicates that the data is not partitioned.

SYSTEM_TIME

Time when the record is written to DataHub

TUPLE or BLOB

(1). Set the time format for MaxCompute partitions in the partition configurations. (2) Specify the time zone.

EVENT_TIME

Value of the event_time(TIMESTAMP) field in the record

TUPLE

(1). Set the time format for MaxCompute partitions in the partition configurations. (2) Specify the time zone.

META_TIME

Value of the __dh_meta_time__ field in the record

TUPLE / BLOB

(1). Set the time format for MaxCompute partitions in the partition configurations. (2) Specify the time zone.

SYSTEM_TIME, EVENT_TIME, and META_TIME are the partition modes in which data to be synchronized is partitioned based on the timestamp and time zone configurations. The default unit is microseconds.

  1. Partition Config: the timestamp configurations based on which data to be synchronized is partitioned. DataHub supports the fixed time formats in MaxCompute. The following table describes the fixed time formats and corresponding partitions.

Partition

Time format

Description

ds

%Y%m%d

day

hh

%H

hour

mm

%M

minute

  1. Time Range: the interval of a partition when the data is partitioned based on the timestamps. Valid values: 15 to 1440. Unit: minutes. A value of 1440 indicates one day. The step size of the parameter is 15 minutes.

  2. Timezone: the time zone of the timestamps based on which data is partitioned and written to MaxCompute.

  3. Split Key: When you create a DataConnector to synchronize data of the BLOB type, you can specify a hexadecimal delimiter to split the data of the BLOB type and synchronize the data to MaxCompute. For example, 0A indicates a line feed (\n)

  4. Base64 Encode: By default, data of the BLOB type in DataHub is binary data, and the mapped field in MaxCompute is of the STRING type. Therefore, when you create a DataConnector in the DataHub console, Base64 encoding is performed before synchronization by default. You can use DataHub SDKs to meet more customization requirements.

View a DataConnector

odps_3

Exa

mple

1. USER_DEFINE mode

  1. Create a DataHub topic.

    Note: The schema of the topic must contain the partition fields in MaxCompute, and the fields are of the STRING type, as shown in the following figure.

    mc_01
  2. Write data to the DataHub topic. You can use a DataHub SDK to write the data.

    During the test, use a DataHub SDK to write multiple records. The values of [ds,hh,mm] are [20210304,01,15] and [20210304,02,15]. The following figure shows the records.

mc_02

3. Create a DataConnector.

In USER_DEFINE mode, you can configure the partition fields when you create a DataConnector. If no available MaxCompute table exists, a table is automatically created.

mc_03

In this example, the fields to be imported are the f1 and f2 fields. The f3 field is not synchronized.

4. Verify the synchronized data.

odps_4

Query data in the MaxCompute table. The following figure shows the query result.odps+5

In USER_DEFINE mode, DataHub synchronizes data from DataHub to the corresponding partition based on the value of the partition field in MaxCompute.

2. SYSTEM_TIME mode

  1. Create a DataHub topic.

    Note: The partition to which data is written is determined based on the time when data is written to DataHub. Therefore, the topic schema needs to contain only data fields and does not need to contain partition fields, as shown in the following figure.

    mc_04
  2. Write data to the DataHub topic. You can use a DataHub SDK to write the data.

    During the test, use a DataHub SDK to write multiple records. The time when the data is written to DataHub is 2021-03-04 14:02:45. The following figure shows the records.

    mc_06
  3. Create a DataConnector.

    mc_07

    4. Verify the synchronized data.

    You can view the synchronization information of the DataConnector in the DataHub console, such as DoneTime.

    mc_08

    Query data in the MaxCompute table. The following figure shows the query result.

    mc_09

    In SYSTEM_TIME mode, DataHub synchronizes data from DataHub to the corresponding partition based on the time when data was written to DataHub.

    FAQ

    • Issue: The timestamp for data synchronization to MaxCompute changes to 1970-01-19.

      Cause: The default unit of the timestamp for data synchronization from DataHub to MaxCompute is microseconds, and the unit of the timestamp when data was written to DataHub is milliseconds. Solution: Use microseconds for the unit of the timestamp when data is written to DataHub.