All Products
Search
Document Center

DataHub:Get started with DataHub

Last Updated:Aug 16, 2021

Step 1: Activate DataHub

  1. Log on to the DataHub console.

  2. Activate DataHub as prompted.

Step 2: Create a project and a topic

  1. Log on to the DataHub console.

  2. On the Project List page, click Create Project in the upper-right corner and set the parameters as required to create a project.

Parameter

Description

Name

The name of the project. A project is an organizational unit in DataHub and contains one or more topics. DataHub projects are independent from MaxCompute projects. Projects that you created in MaxCompute cannot be used in DataHub.

Description

The description of the project.

3. On the details page of a project, click Create Topic in the upper-right corner and set the parameters as required to create a topic.

topic

Parameter

Description

Creation Type

The method that is used to create the topic. A project is an organizational unit in DataHub and contains one or more topics. DataHub projects are independent from MaxCompute projects. Projects that you created in MaxCompute cannot be used in DataHub.

Name

The name of the topic.

Type

The type of the data in the topic. TUPLE indicates structured data. BLOB indicates unstructured data.

Schema Details

The details of the schema. The Schema Details parameter is displayed if you set the Type parameter to TUPLE. You can create fields based on your business requirements. If you select Allow Null for a field, the field is set to NULL if the field does not exist in the upstream. If you clear Allow Null for a field, the field configuration is strictly verified. An error is returned if the type specified for the field is invalid.

Number of Shards

The number of shards in the topic. Shards ensure the concurrent data transmission of a topic. Each shard has a unique ID. A shard may be in one of the following states: Opening: The shard is starting. Active: The shard is started and available. Each available shard consumes resources on the server. We recommended that you create shards as needed.

Lifecycle

The maximum period during which data written to the topic can be stored in DataHub, in days. Minimum value: 1. Maximum value: 7. To modify the time-to-live (TTL) period of a topic, call the updateTopic method by using DataHub SDK for Java. For more information, see DataHub SDK for Java.

Description

The description of the topic.

Step 3: Write data to the created topic

DataHub provides multiple methods for you to write data. You can use plug-ins such as Apache Flume to write logs. If you want to write data stored in databases, you can use Data Transformation Services (DTS), Canal, or an SDK. In this example, the console command-line tool is used to write data by uploading a file.

  1. Download and decompress the installation package of the console command-line tool, and then specify an AccessKey pair and an endpoint as required. For more information, see Console command-line tool.

  2. Run the following command to upload a file:

    uf -f /temp/test.csv -p test_topic -t test_topic -m "," -n 1000
  3. Sample data to assess data quality.

    1. Select a shard, such as Shard 0. In the Sample: 0 panel, set the number of data entries to be sampled and the start time for sampling.

    2. Click Sample. The sampled data is displayed.chou

Step 4: Synchronize data

Synchronize data to MaxCompute.

  1. In the left-side navigation pane of the DataHub console, click Project Manager. On the Project List page, find a project and click View in the Actions column. On the details page of the project, find a topic and click View in the Actions column.

  2. On the details page of the topic, click Connector in the upper-right corner. In the Create Connector panel, create a DataConnector as required.

    3
  3. Click MaxCompute. The following parameters are displayed.

4Description of partial parameters:

The following part describes partial parameters that are used to create a DataConnector in the console. To create a DataConnector in a more flexible manner, use an SDK.

  1. Import Fields

    You can specify the columns to be synchronized to the destination MaxCompute table.

  2. Partition Mode

    The partition mode determines to which partition in MaxCompute data is written. The following table describes the partition modes supported by DataHub.

Partition mode

Partition basis

Supported data type of a topic

Description

USER_DEFINE

Based on the values in the partition key column in the records. The name of partition key column must be the same as that of the partition field in MaxCompute.

TUPLE

1. The schema of the topic must contain the partition field in MaxCompute. 2. The column values must be strings encoded in UTF-8 and cannot be NULL.

SYSTEM_TIME

Based on the timestamps when the records are written to DataHub.

TUPLE and BLOB

1. You must set the Partition Config parameter to specify the one of more formats to which timestamps are converted for time-based partitioning in MaxCompute. 2. You must set the Timezone parameter to specify a time zone.

EVENT_TIME

Based on the values in the event_time(TIMESTAMP) column in the records.

TUPLE

1. You must set the Partition Config parameter to specify the one of more formats to which timestamps are converted for time-based partitioning in MaxCompute. 2. You must set the Timezone parameter to specify a time zone.

META_TIME

Based on the values in the __dh_meta_time__ property column in the records.

TUPLE and BLOB

1. You must set the Partition Config parameter to specify the one of more formats to which timestamps are converted for time-based partitioning in MaxCompute. 2. You must set the Timezone parameter to specify a time zone.

In SYSTEM_TIME, EVENT_TIME, or META_TIME mode, data is synchronized to different partitions in the destination MaxCompute table based on the timestamps and the specified time zone. By default, the timestamps are in microseconds.

  1. The Partition Config parameter specifies the configurations that are used to convert timestamps to implement time-based partitioning in the destination MaxCompute table. The following table describes the default MaxCompute time formats that are supported in the DataHub console.

Partition type

Time format

Description

ds

%Y%m%d

Day

hh

%H

Hour

mm

%M

Minute

  1. The Time Range parameter specifies the intervals at which partitions are generated in the destination MaxCompute table. Valid values: 15 to 1440, in minutes. The step size is 15.

  2. The Timezone parameter specifies the time zone used to implement time-based partitioning.

  3. If you synchronize data of the BLOB type to MaxCompute, you can use hexadecimal delimiters to split the data before synchronization. For example, you can set the Split Key parameter to 0A, which indicates line feeds (\n).

  4. By default, topics whose data type is BLOB store binary data. However, such data is mapped to columns of the STRING type in MaxCompute. Therefore, Base64 encoding is automatically enabled when you create a DataConnector in the DataHub console. If you want to customize your DataConnectors, use an SDK.

Step 5: View the DataConnector

5

For more information, see Synchronize data to MaxCompute.