Step 1: Activate the DataHub service
-
Log in to the DataHub console.
-
Follow the on-screen instructions to activate the service.
Step 2: Create a Project and a Topic
-
Log in to the DataHub console.
-
Click Create Project. In the dialog box, set the Name (must start with a letter, 3–32 characters, only letters, digits, and underscores) and Description (up to 1,024 characters), then click Create.
|
Parameter |
Description |
|
Project |
A Project is the basic organizational unit in DataHub, containing one or more Topics. DataHub Projects are independent of MaxCompute projects — you must create a separate Project in DataHub. |
|
Description |
The description of the Project. |
3 . On the Project Details page, click Create Topic. In the New Topic dialog box, for Creation Method, select Create Directly or Import MaxCompute table schema.
|
Parameter |
Description |
|
Creation Method |
Create a Topic from scratch or import the schema from an existing MaxCompute table. |
|
Name |
The name of the Topic. |
|
Type |
The Topic type. |
|
Schema Details |
Appears when you select |
|
Number of Shards |
Concurrent channel for data transmission within a Topic. Each Shard has an ID and a state such as |
|
Lifecycle |
Data retention period for the Topic, in days (1–7). To change this value, use the Java SDK. |
|
Description |
The description of the Topic. |
Step 3: Write data
DataHub supports multiple data ingestion methods: Flume for logs, DTS or Canal for databases, or an SDK. This example uses the console tool to upload a file.
-
Download and decompress the console tool package, then configure the AccessKey pair and endpoint. console command-line tool.
-
Use the
ufcommand to upload the file.uf -f /temp/test.csv -p test_topic -t test_topic -m "," -n 1000 -
Verify that the data was written. Check the latest write time and total data volume on the Shard List tab of the Topic Details page.
-
Sample data to check data quality.
-
Select the Shard and start time for sampling.
-
Click Sample to view the data.
-
In the sampling dialog box, set the Sample Count (default: 20) and use Select Filter Fields to filter by specific fields.
Step 4: Synchronize data
This example demonstrates how to synchronize data to MaxCompute.
-
Navigate to the
Project List/Project Details/Topic Detailspage. -
In the upper-right corner, click
+ Syncto create a synchronization task. -
Select the MaxCompute job type:
1) For
TUPLEtype synchronization, configure the following parameters in the New Connector dialog box: Project Name, Table Name, AccessKey ID, AccessKey Secret, Fields to Import, Partitioning Mode, Partition Configuration, Partition Interval, Time Zone, Start Time, and TimestampUnit. When finished, click Create.
Selected configuration notes:
Key configuration parameters for console-based synchronization tasks are described below. For advanced options, use the SDK.
-
Fields to Import
Synchronize only specific columns to a MaxCompute table.
-
Partitioning Mode
Determines which MaxCompute partition receives the data. Supported modes:
|
Partitioning mode |
Partition basis |
Supported types |
Description |
|
USER_DEFINE |
Partition column value in the record. Column name must match the MaxCompute partition field. |
TUPLE |
(1) The DataHub schema must include the MaxCompute partition fields. (2) The value of this column must be a |
|
SYSTEM_TIME |
The time the record is written to DataHub. |
TUPLE / BLOB |
(1) In Partition Configuration, set the format for converting the timestamp to a MaxCompute partition. (2) Set the time zone. |
|
EVENT_TIME |
The value of the |
TUPLE |
(1) In Partition Configuration, set the format for converting the timestamp to a MaxCompute partition. (2) Set the time zone. |
|
META_TIME |
The value of the |
TUPLE / BLOB |
(1) In Partition Configuration, set the format for converting the timestamp to a MaxCompute partition. (2) Set the time zone. |
The SYSTEM_TIME, EVENT_TIME, and META_TIME modes use a timestamp and time zone to determine the MaxCompute partition. The default timestamp unit is microseconds.
-
The partition configuration converts a timestamp into a MaxCompute partition. The console uses a fixed partition format by default:
|
Partition |
Time format |
Description |
|
ds |
%Y%m%d |
Day |
|
hh |
%H |
Hour |
|
mm |
%M |
Minute |
-
Time interval for converting timestamps to MaxCompute partitions. Range:
15 minutes to 1,440 minutes (1 day), in increments of15 minutes. -
Time zone used to convert timestamps into MaxCompute partitions.
-
For BLOB data, specify a hexadecimal delimiter to split records before synchronizing to MaxCompute. For example,
0Arepresents the newline character (\n). -
DataHub stores BLOB data as binary, but the MaxCompute column uses the STRING type. The console Base64-encodes BLOB data by default before synchronizing. For advanced options, use the SDK.
Step 5: View the synchronization task
The Connector details page shows the task status, checkpoint information, and monitoring metrics such as Sync Latency, DoneTime, and Dirty Data Count. You can restart or stop the task and manage Sync Task Fields. Updates take effect immediately.