Step 1: Activate DataHub
Log on to the DataHub console.
Activate the service as prompted on the page.
Step 2: Create a project and a topic
Log on to the DataHub console.
Click the Create Project button and enter the required information to create a project.
Parameter | Description |
Project | A project is the basic organizational unit for DataHub data and contains multiple topics. DataHub projects and MaxCompute projects are independent of each other. Projects created in MaxCompute cannot be reused in DataHub. You must create them separately. |
Description | The description of the project. |
3. On the project details page, click the Create Topic button to create a topic.
Parameter | Description |
Creation | A project is the basic organizational unit for data in DataHub. A project contains multiple topics. DataHub projects are independent of MaxCompute projects. Projects created in MaxCompute cannot be used in DataHub and must be created separately. |
Name | The description of the topic. |
Type | The topic category. `TUPLE` represents structured data, and `BLOB` represents unstructured data. |
Schema details | Selecting the TUPLE type displays the schema details. Create fields as needed. If you allow NULL values, the field is automatically set to NULL when the upstream data is missing a value. Otherwise, the system performs a strict check, and a write error occurs if the field type does not match. |
Number of shards | A shard is a concurrent channel for data transmission within a topic. Each shard has a unique ID. A shard can have multiple states, such as `Opening` (starting) and `Active` (started and ready to serve). Each enabled shard consumes server-side resources. Request the number of shards as needed. |
Lifecycle | The maximum amount of time, in days, that data written to the topic can be stored. The minimum value is 1 and the maximum value is 7. To modify the lifecycle, use the Java SDK. |
Description | The description of the topic. |
Step 3: Write data
DataHub supports multiple methods for writing data. You can use plugins such as Flume for logs, or DTS and Canal for databases. You can also write data using an SDK. This example shows how to use the console tool to write data by uploading a file.
Download and decompress the console tool package. Configure the AccessKey and endpoint information. For more information, see Console command tool.
Run the uf command to upload the file.
uf -f /temp/test.csv -p test_topic -t test_topic -m "," -n 1000In the web console, check whether the data was written successfully. You can view the data writing status, the latest data write time, and the total data volume.
Sample data to check the data quality.
Select the shard and start time for sampling.
Click Sample to view the data.
Step 4: Synchronize data
The following example shows how to sync MaxCompute.
Navigate to the
Project List > Project Details > Topic Detailspage.In the upper-right corner, click the
+ Syncbutton to create a sync task.Select the MaxCompute job type.
Parameter descriptions
This section describes some of the parameters for creating a sync task in the console. For more flexible operations, you can use the SDK.
Import Fields
You can sync data from specific columns in DataHub to a MaxCompute table based on your configuration.
Partition Mode
The partition mode determines the MaxCompute partition to which data is written. DataHub supports the following partition modes:
Partition pattern | Partition basis | Supported Topic Types | Description |
USER_DEFINE | The value of the partition key column in the record. The column name is the same as the MaxCompute partition field. | TUPLE | (1). The DataHub schema must include the MaxCompute partition field. (2). The value of this column must be a |
SYSTEM_TIME | The time when the record is written to DataHub. | TUPLE / BLOB | (1). In the partition configuration, set the time transform format for the MaxCompute partition. (2). Set the time zone information. |
EVENT_TIME | The value of the | TUPLE | (1). In the partition configuration, set the time transform format for the MaxCompute partition. (2). Set the time zone information. |
META_TIME | The value of the | TUPLE / BLOB | (1). In the partition configuration, set the time transform format for the MaxCompute partition. (2). Set the time zone information. |
The SYSTEM_TIME, EVENT_TIME, and META_TIME modes use the timestamp and time zone configuration to determine the MaxCompute partition. The default unit for the timestamp is microseconds.
The partition configuration determines how timestamps are converted for MaxCompute partitions. By default, the console uses fixed MaxCompute partition formats. The partition configurations are as follows:
Partition | Time format | Description |
ds | %Y%m%d | day |
hh | %H | hour |
mm | %M | minute |
The partition interval determines the time interval for converting timestamps for MaxCompute partitions. The time range is from
15 minutes to 1440 minutes (1 day), and the step interval is15 minutes.The time zone parameter specifies the time zone used for conversion when you partition MaxCompute based on timestamps.
When you sync BLOB data, you can specify a hexadecimal separator to split the data before you sync it to MaxCompute. For example,
0Arepresents aline feed (\n).By default, DataHub BLOB topics store binary data. The corresponding column in the MaxCompute sync task is the STRING type. Therefore, when you create a sync task in the console, the data is Base64-encoded by default before synchronization. For more customization, you can use the SDK.
Step 6: View the sync task
On the details page of the corresponding connector, you can view the running status and checkpoint information of the sync task. This information includes the sync checkpoint, sync status, and operations such as restart and stop.
For more information, see Create a MaxCompute sync task.