All Products
Search
Document Center

DataHub:Quick start (synchronization example)

Last Updated:Jun 08, 2026

Step 1: Activate the DataHub service

  1. Log in to the DataHub console.

  2. Follow the on-screen instructions to activate the service.

Step 2: Create a Project and a Topic

  1. Log in to the DataHub console.

  2. Click Create Project. In the dialog box, set the Name (must start with a letter, 3–32 characters, only letters, digits, and underscores) and Description (up to 1,024 characters), then click Create.

Parameter

Description

Project

A Project is the basic organizational unit in DataHub, containing one or more Topics. DataHub Projects are independent of MaxCompute projects — you must create a separate Project in DataHub.

Description

The description of the Project.

3 . On the Project Details page, click Create Topic. In the New Topic dialog box, for Creation Method, select Create Directly or Import MaxCompute table schema.

Parameter

Description

Creation Method

Create a Topic from scratch or import the schema from an existing MaxCompute table.

Name

The name of the Topic.

Type

The Topic type. TUPLE represents structured data, and BLOB represents unstructured data.

Schema Details

Appears when you select TUPLE. Define fields as needed. If a field allows NULL, missing upstream values default to NULL. If NULL is not allowed, DataHub validates strictly and reports an error on type mismatch.

Number of Shards

Concurrent channel for data transmission within a Topic. Each Shard has an ID and a state such as Opening or Active. Each active Shard consumes server resources, so allocate only what you need.

Lifecycle

Data retention period for the Topic, in days (1–7). To change this value, use the Java SDK.

Description

The description of the Topic.

Step 3: Write data

DataHub supports multiple data ingestion methods: Flume for logs, DTS or Canal for databases, or an SDK. This example uses the console tool to upload a file.

  1. Download and decompress the console tool package, then configure the AccessKey pair and endpoint. console command-line tool.

  2. Use the uf command to upload the file.

    uf -f /temp/test.csv -p test_topic -t test_topic -m "," -n 1000
  3. Verify that the data was written. Check the latest write time and total data volume on the Shard List tab of the Topic Details page.

  4. Sample data to check data quality.

    1. Select the Shard and start time for sampling.

    2. Click Sample to view the data.

In the sampling dialog box, set the Sample Count (default: 20) and use Select Filter Fields to filter by specific fields.

Step 4: Synchronize data

This example demonstrates how to synchronize data to MaxCompute.

  1. Navigate to the Project List/Project Details/Topic Details page.

  2. In the upper-right corner, click + Sync to create a synchronization task.

  3. Select the MaxCompute job type:

    1) For TUPLE type synchronization, configure the following parameters in the New Connector dialog box: Project Name, Table Name, AccessKey ID, AccessKey Secret, Fields to Import, Partitioning Mode, Partition Configuration, Partition Interval, Time Zone, Start Time, and TimestampUnit. When finished, click Create.

Selected configuration notes:

Key configuration parameters for console-based synchronization tasks are described below. For advanced options, use the SDK.

  1. Fields to Import

    Synchronize only specific columns to a MaxCompute table.

  2. Partitioning Mode

    Determines which MaxCompute partition receives the data. Supported modes:

Partitioning mode

Partition basis

Supported types

Description

USER_DEFINE

Partition column value in the record. Column name must match the MaxCompute partition field.

TUPLE

(1) The DataHub schema must include the MaxCompute partition fields. (2) The value of this column must be a non-empty UTF-8 string.

SYSTEM_TIME

The time the record is written to DataHub.

TUPLE / BLOB

(1) In Partition Configuration, set the format for converting the timestamp to a MaxCompute partition. (2) Set the time zone.

EVENT_TIME

The value of the event_time (TIMESTAMP) column in the record.

TUPLE

(1) In Partition Configuration, set the format for converting the timestamp to a MaxCompute partition. (2) Set the time zone.

META_TIME

The value of the __dh_meta_time__ attribute field in the record.

TUPLE / BLOB

(1) In Partition Configuration, set the format for converting the timestamp to a MaxCompute partition. (2) Set the time zone.

The SYSTEM_TIME, EVENT_TIME, and META_TIME modes use a timestamp and time zone to determine the MaxCompute partition. The default timestamp unit is microseconds.

  1. The partition configuration converts a timestamp into a MaxCompute partition. The console uses a fixed partition format by default:

Partition

Time format

Description

ds

%Y%m%d

Day

hh

%H

Hour

mm

%M

Minute

  1. Time interval for converting timestamps to MaxCompute partitions. Range: 15 minutes to 1,440 minutes (1 day), in increments of 15 minutes.

  2. Time zone used to convert timestamps into MaxCompute partitions.

  3. For BLOB data, specify a hexadecimal delimiter to split records before synchronizing to MaxCompute. For example, 0A represents the newline character (\n).

  4. DataHub stores BLOB data as binary, but the MaxCompute column uses the STRING type. The console Base64-encodes BLOB data by default before synchronizing. For advanced options, use the SDK.

Step 5: View the synchronization task

The Connector details page shows the task status, checkpoint information, and monitoring metrics such as Sync Latency, DoneTime, and Dirty Data Count. You can restart or stop the task and manage Sync Task Fields. Updates take effect immediately.

Create a task to synchronize data to MaxCompute.