All Products
Search
Document Center

DataHub:Synchronize data to OSS

Last Updated:Aug 25, 2021

Preparations

1.Create an Object Storage Service (OSS) bucket. DataHub allows you to synchronize data to OSS. Before you create a DataConnector to synchronize data to OSS, you must create an OSS bucket in the OSS console for receiving data. 2.Create the service-linked role for DataHub. When you create a DataConnector, you can authenticate the access to an OSS bucket by using the AccessKey pair of an Alibaba Cloud account or a temporary access credential from Security Token Service (STS). If you use a temporary access credential from STS, the service-linked role for DataHub is automatically created. Then, DataHub uses the service-linked role to synchronize data to OSS.

3.Pay attention to the following points: 1. You can synchronize data in topics of the TUPLE and BLOB types from DataHub to OSS.

  • TUPLE: Data in TUPLE topics is in the CSV format. Columns in each record are separated by commas (,). Records are separated by line feeds (\n).

  • BLOB: Data in BLOB topics is appended to existing data. If you want to split OSS data, you must use delimiters to separate the DataHub data to be synchronized.

2. The name of an OSS file that stores synchronized data is generated by using meaningful information such as the DataConnector ID. You cannot modify the names of such OSS files. 3. A secondary directory is generated based on the time when data is written to DataHub. By default, the UTC+8 time zone is used. If you want to configure more DataConnector settings, use an SDK.

Create a DataConnector

  1. In the left-side navigation pane of the DataHub console, click Project Manager. On the Project List page, find a project and click View in the Actions column. On the details page of the project, find a topic and click View in the Actions column.

  2. On the details page of the topic, click Connector in the upper-right corner. In the Create Connector panel, create a DataConnector as required.

oss_01

The following part describes partial parameters that are used to create a DataConnector in the DataHub console. For more information about DataConnector configurations, see the descriptions of SDKs.

  1. Endpoint: the endpoint of OSS. Use an HTTP-based classic network endpoint. HTTPS-based endpoints are not supported.

  2. Import Fields: the fields to be synchronized to the OSS bucket. You can synchronize all or partial fields of the DataHub topic based on your business requirements.

  3. Directory Prefix: the name prefix of the top-level directory in the OSS bucket.

  4. Time Format: the format of the time when data is written to DataHub. The time is included in the name of the secondary directory that is generated in the top-level directory. Time Range: the interval at which data is synchronized to the secondary directory. Valid values: 15 to 1440, in minutes. The step size is 15.

Example

  1. Create an OSS bucket for receiving data in the OSS console. The following figure shows the details page of an OSS bucket.

oss_03
  1. Create a topic in the DataHub console. In this example, the created topic is of the TUPLE type. The following figure shows the Schema Details tab of the created topic.oss_04

  2. Create a DataConnector.

  3. Write data to the created TUPLE topic. The following figure shows the written data.

  4. Check the name of the OSS file that stores the synchronized data, as shown in the following figure. The path of the OSS file contains the names of the OSS bucket, top-level directory, and secondary directory to which the OSS file belongs, and the name of the OSS file.

    oss_06

    Download the file and view the file content. The data that is synchronized from the TUPLE topic is in the CSV format, as shown in the following figure.