Write and synchronize data to MaxCompute - DataHub

Step 1: Activate DataHub

Log on to the DataHub console.
Activate the service as prompted on the page.

Step 2: Create a project and a topic

Log on to the DataHub console.
Click the Create Project button and enter the required information to create a project.

Parameter	Description
Project	A project is the basic organizational unit for DataHub data and contains multiple topics. DataHub projects and MaxCompute projects are independent of each other. Projects created in MaxCompute cannot be reused in DataHub. You must create them separately.
Description	The description of the project.

3. On the project details page, click the Create Topic button to create a topic.

Parameter	Description
Creation	A project is the basic organizational unit for data in DataHub. A project contains multiple topics. DataHub projects are independent of MaxCompute projects. Projects created in MaxCompute cannot be used in DataHub and must be created separately.
Name	The description of the topic.
Type	The topic category. `TUPLE` represents structured data, and `BLOB` represents unstructured data.
Schema details	Selecting the TUPLE type displays the schema details. Create fields as needed. If you allow NULL values, the field is automatically set to NULL when the upstream data is missing a value. Otherwise, the system performs a strict check, and a write error occurs if the field type does not match.
Number of shards	A shard is a concurrent channel for data transmission within a topic. Each shard has a unique ID. A shard can have multiple states, such as `Opening` (starting) and `Active` (started and ready to serve). Each enabled shard consumes server-side resources. Request the number of shards as needed.
Lifecycle	The maximum amount of time, in days, that data written to the topic can be stored. The minimum value is 1 and the maximum value is 7. To modify the lifecycle, use the Java SDK.
Description	The description of the topic.

Step 3: Write data

DataHub supports multiple methods for writing data. You can use plugins such as Flume for logs, or DTS and Canal for databases. You can also write data using an SDK. This example shows how to use the console tool to write data by uploading a file.

Download and decompress the console tool package. Configure the AccessKey and endpoint information. For more information, see Console command tool.

Run the uf command to upload the file.

uf -f /temp/test.csv -p test_topic -t test_topic -m "," -n 1000

In the web console, check whether the data was written successfully. You can view the data writing status, the latest data write time, and the total data volume.
Sample data to check the data quality.
1. Select the shard and start time for sampling.
2. Click Sample to view the data.

Step 4: Synchronize data

The following example shows how to sync MaxCompute.

Navigate to the Project List > Project Details > Topic Details page.
In the upper-right corner, click the + Sync button to create a sync task.
Select the MaxCompute job type.

Parameter descriptions

This section describes some of the parameters for creating a sync task in the console. For more flexible operations, you can use the SDK.

Import Fields
You can sync data from specific columns in DataHub to a MaxCompute table based on your configuration.
Partition Mode
The partition mode determines the MaxCompute partition to which data is written. DataHub supports the following partition modes:

Partition pattern	Partition basis	Supported Topic Types	Description
USER_DEFINE	The value of the partition key column in the record. The column name is the same as the MaxCompute partition field.	TUPLE	(1). The DataHub schema must include the MaxCompute partition field. (2). The value of this column must be a `non-empty UTF-8 string`.
SYSTEM_TIME	The time when the record is written to DataHub.	TUPLE / BLOB	(1). In the partition configuration, set the time transform format for the MaxCompute partition. (2). Set the time zone information.
EVENT_TIME	The value of the `event_time(TIMESTAMP)` column in the record.	TUPLE	(1). In the partition configuration, set the time transform format for the MaxCompute partition. (2). Set the time zone information.
META_TIME	The value of the `__dh_meta_time__` property field of the record.	TUPLE / BLOB	(1). In the partition configuration, set the time transform format for the MaxCompute partition. (2). Set the time zone information.

The SYSTEM_TIME, EVENT_TIME, and META_TIME modes use the timestamp and time zone configuration to determine the MaxCompute partition. The default unit for the timestamp is microseconds.

The partition configuration determines how timestamps are converted for MaxCompute partitions. By default, the console uses fixed MaxCompute partition formats. The partition configurations are as follows:

Partition	Time format	Description
ds	%Y%m%d	day
hh	%H	hour
mm	%M	minute

The partition interval determines the time interval for converting timestamps for MaxCompute partitions. The time range is from 15 minutes to 1440 minutes (1 day), and the step interval is 15 minutes.
The time zone parameter specifies the time zone used for conversion when you partition MaxCompute based on timestamps.
When you sync BLOB data, you can specify a hexadecimal separator to split the data before you sync it to MaxCompute. For example, 0A represents a line feed (\n).
By default, DataHub BLOB topics store binary data. The corresponding column in the MaxCompute sync task is the STRING type. Therefore, when you create a sync task in the console, the data is Base64-encoded by default before synchronization. For more customization, you can use the SDK.

Step 6: View the sync task

On the details page of the corresponding connector, you can view the running status and checkpoint information of the sync task. This information includes the sync checkpoint, sync status, and operations such as restart and stop.

For more information, see Create a MaxCompute sync task.