DataHub quick start - synchronize data to MaxCompute - DataHub

Step 1: Activate the DataHub service

Log in to the DataHub console.
Follow the on-screen instructions to activate the service.

Step 2: Create a Project and a Topic

Log in to the DataHub console.
Click Create Project. In the dialog box, set the Name (must start with a letter, 3–32 characters, only letters, digits, and underscores) and Description (up to 1,024 characters), then click Create.

Parameter	Description
Project	A Project is the basic organizational unit in DataHub, containing one or more Topics. DataHub Projects are independent of MaxCompute projects — you must create a separate Project in DataHub.
Description	The description of the Project.

3 . On the Project Details page, click Create Topic. In the New Topic dialog box, for Creation Method, select Create Directly or Import MaxCompute table schema.

Parameter	Description
Creation Method	Create a Topic from scratch or import the schema from an existing MaxCompute table.
Name	The name of the Topic.
Type	The Topic type. `TUPLE` represents structured data, and `BLOB` represents unstructured data.
Schema Details	Appears when you select `TUPLE`. Define fields as needed. If a field allows `NULL`, missing upstream values default to `NULL`. If `NULL` is not allowed, DataHub validates strictly and reports an error on type mismatch.
Number of Shards	Concurrent channel for data transmission within a Topic. Each Shard has an ID and a state such as `Opening` or `Active`. Each active Shard consumes server resources, so allocate only what you need.
Lifecycle	Data retention period for the Topic, in days (1–7). To change this value, use the Java SDK.
Description	The description of the Topic.

Step 3: Write data

DataHub supports multiple data ingestion methods: Flume for logs, DTS or Canal for databases, or an SDK. This example uses the console tool to upload a file.

Download and decompress the console tool package, then configure the AccessKey pair and endpoint. console command-line tool.

Use the uf command to upload the file.

uf -f /temp/test.csv -p test_topic -t test_topic -m "," -n 1000

Verify that the data was written. Check the latest write time and total data volume on the Shard List tab of the Topic Details page.
Sample data to check data quality.
1. Select the Shard and start time for sampling.
2. Click Sample to view the data.

In the sampling dialog box, set the Sample Count (default: 20) and use Select Filter Fields to filter by specific fields.

Step 4: Synchronize data

This example demonstrates how to synchronize data to MaxCompute.

Navigate to the Project List/Project Details/Topic Details page.
In the upper-right corner, click + Sync to create a synchronization task.
Select the MaxCompute job type:

1) For TUPLE type synchronization, configure the following parameters in the New Connector dialog box: Project Name, Table Name, AccessKey ID, AccessKey Secret, Fields to Import, Partitioning Mode, Partition Configuration, Partition Interval, Time Zone, Start Time, and TimestampUnit. When finished, click Create.

Selected configuration notes:

Key configuration parameters for console-based synchronization tasks are described below. For advanced options, use the SDK.

Fields to Import

Synchronize only specific columns to a MaxCompute table.
Partitioning Mode

Determines which MaxCompute partition receives the data. Supported modes:

Partitioning mode	Partition basis	Supported types	Description
USER_DEFINE	Partition column value in the record. Column name must match the MaxCompute partition field.	TUPLE	(1) The DataHub schema must include the MaxCompute partition fields. (2) The value of this column must be a `non-empty UTF-8 string`.
SYSTEM_TIME	The time the record is written to DataHub.	TUPLE / BLOB	(1) In Partition Configuration, set the format for converting the timestamp to a MaxCompute partition. (2) Set the time zone.
EVENT_TIME	The value of the `event_time` (TIMESTAMP) column in the record.	TUPLE	(1) In Partition Configuration, set the format for converting the timestamp to a MaxCompute partition. (2) Set the time zone.
META_TIME	The value of the `__dh_meta_time__` attribute field in the record.	TUPLE / BLOB	(1) In Partition Configuration, set the format for converting the timestamp to a MaxCompute partition. (2) Set the time zone.

The SYSTEM_TIME, EVENT_TIME, and META_TIME modes use a timestamp and time zone to determine the MaxCompute partition. The default timestamp unit is microseconds.

The partition configuration converts a timestamp into a MaxCompute partition. The console uses a fixed partition format by default:

Partition	Time format	Description
ds	%Y%m%d	Day
hh	%H	Hour
mm	%M	Minute

Time interval for converting timestamps to MaxCompute partitions. Range: 15 minutes to 1,440 minutes (1 day), in increments of 15 minutes.
Time zone used to convert timestamps into MaxCompute partitions.
For BLOB data, specify a hexadecimal delimiter to split records before synchronizing to MaxCompute. For example, 0A represents the newline character (\n).
DataHub stores BLOB data as binary, but the MaxCompute column uses the STRING type. The console Base64-encodes BLOB data by default before synchronizing. For advanced options, use the SDK.

Step 5: View the synchronization task

The Connector details page shows the task status, checkpoint information, and monitoring metrics such as Sync Latency, DoneTime, and Dirty Data Count. You can restart or stop the task and manage Sync Task Fields. Updates take effect immediately.

Create a task to synchronize data to MaxCompute.