This topic uses an offline sync from a single OSS table to a MaxCompute table as an example. It describes best practices for data source configuration, network connectivity, and sync task configuration.
Background information
Alibaba Cloud Object Storage Service (OSS) is a cloud storage service that provides massive storage capacity, high security, low cost, and high reliability. It guarantees 99.9999999999% (twelve 9s) data durability and 99.995% data availability. OSS offers multiple storage classes to help you optimize storage costs. Data Integration lets you sync data from OSS to other destinations and from other sources to OSS. This topic uses an offline sync from OSS to MaxCompute as an example to describe the entire process.
Get OSS bucket information
Go to the OSS console. In the Bucket list, find the OSS bucket that you want to use for data synchronization. On the bucket information page, obtain the Access Over Internet and Access from ECS over the VPC (internal network) from the overview section. You can choose different endpoints for different scenarios.
The public endpoint provides access over the Internet. Inbound traffic (writes) to OSS through the public endpoint is free, but outbound traffic (reads) is charged. For more information about OSS fees, see Billing items.
The internal endpoint provides access over the Alibaba Cloud internal network between products in the same region. For example, you can use a Data Integration resource group to access the OSS service in the same region. Both inbound and outbound traffic over the internal network are free. If you read data from or write data to an OSS bucket that is in the same region as the Data Integration resource group, configure the internal endpoint. Otherwise, configure the public endpoint.
For information about region and endpoint mappings, see OSS regions and endpoints.
Prerequisites
You have purchased a Serverless resource group.
You have created an OSS data source and a MaxCompute data source. For more information, see Data Source Configuration.
You have established a network connection between the resource group and the data sources. For more information, see Overview of network connection solutions.
Limits
Syncing source data to MaxCompute foreign tables is not supported.
Procedure
This topic uses the Data Studio (New) interface as an example to demonstrate how to configure an offline sync task.
1. Create a node and configure the task
For the general steps to create a node and use the codeless UI, see the Codeless UI configuration guide.
2. Configure the data source and destination
Configure the data source (OSS)
In this example, the data source is an OSS file. The following table describes the key configuration items.
Configuration item | Configuration details |
Text Type | Select the type of file that you want to sync. The codeless UI supports reading files in |
File Path | Enter the path of the file that you want to sync.
|
Column Delimiter | Specifies the column delimiter in the configuration file. |
Encoding | Set the encoding format used to read the source file. |
Null Value |
|
Compression Format | The compression format of the source file. Supported formats are |
Skip Header | CSV-like files may have a header row that acts as a title. You can choose whether to skip it. By default, it is not skipped. Note Skipping the header is not supported for compressed files. |
Table Data Structure | After you configure the parameters for the data source, click Confirm Table Schema to check if the data format meets your expectations. |
Configure the data destination (MaxCompute)
In this example, the data destination is a MaxCompute table. The following table describes the key configuration items.
You can use the default values for parameters that are not described in the following table.
Configuration item | Configuration details |
Tunnel Resource Group | The MaxCompute Data Transmission Service resource, Tunnel Quota. The default selection is "Public transmission resources", which is the free quota for MaxCompute. If your exclusive Tunnel Quota becomes unavailable due to overdue payments or expiration, the running task will automatically switch to "Public transmission resources". |
Table | Select the MaxCompute table to which you want to sync data. If you are using a standard DataWorks workspace, make sure that a MaxCompute table with the same name and schema exists in both the development and production environments of MaxCompute. You can also click Generate Destination Table Schema. The system will automatically create a table to receive the data. You can manually adjust the table creation statement. Note If:
|
Partition | If the table is a partitioned table, you can enter a value for the partition key column.
|
Write Mode | When writing to the destination table, you can choose to clear existing data or keep it. |
3. Configure field mapping
After you select the data source and destination, you must configure the field mapping between the source and the destination. You can click Map Fields with Same Name, Map Fields in Same Line, Clear Mappings, or Manually Edit Mapping.
4. Configure channel control
Offline sync tasks support settings such as Maximum Expected Concurrency and Policy for Dirty Data Records. In this example, Policy for Dirty Data Records is set to Do not tolerate dirty data, and the other settings use their default values. For more information, see Codeless UI configuration.
5. Configure and run a debug task
Click Debugging Configurations on the right side of the offline sync node's edit page. Set the Resource Group and Script Parameters for the debug run. Then, click Run in the top toolbar to test if the sync channel runs successfully.
In the navigation pane on the left, click the
icon. Then, click the
icon to the right of Personal Directory and create a file with the .sqlextension. Execute the following SQL query to check whether the data in the destination table is as expected.NoteThis query method requires you to bind the destination MaxCompute project as a computing resource for DataWorks.
On the
.sqlfile's edit page, click Debugging Configurations on the right. Specify the Type, Computing Resource, and Resource Group. Then, click Run in the top toolbar.
SELECT * FROM <destination_table_name_in_MaxCompute> WHERE pt=<specified_partition> LIMIT 20;
6. Configure scheduling and publish the task
Click Scheduling on the right side of the offline sync task. Configure the scheduling parameters for periodic runs. Then, click Publish in the top toolbar to open the publishing panel. Follow the on-screen instructions to publish the task.