This topic describes how to load offline data from a single Object Storage Service (OSS) table into MaxCompute, providing best practices for data source configuration, network connectivity, and synchronization task configuration.
OSS overview
Alibaba Cloud Object Storage Service (OSS) is a secure, cost-effective, and highly reliable cloud storage service that offers massive storage capacity, 99.9999999999% (twelve 9s) of data durability, and 99.995% data availability. OSS offers various storage classes to help you optimize costs. Data Integration allows you to sync data from OSS to other destinations and from other sources into OSS.
Get OSS bucket information
Navigate to the OSS console. In the Bucket list, find the bucket you want to use for data synchronization. On the bucket's overview page, get its Public endpoint and Internal endpoint. You can choose which endpoint to use based on your scenario.
-
The Public endpoint is used for access over the internet. When you access OSS over the internet, inbound traffic (writes) is free, but outbound traffic (reads) is charged. For more information about OSS fees, see OSS Pricing and Billing Items.
-
The internal network is the private communication network between Alibaba Cloud products in the same region. For example, you can use a Data Integration resource group to access OSS in the same region. Both inbound and outbound traffic over the internal network are free. If you are reading from or writing to an OSS bucket that is in the same region as your Data Integration resource group, use the internal endpoint. Otherwise, use the public endpoint.
-
For a list of regions and their endpoints, see Regions and Endpoints.
Prerequisites
-
You have purchased a Serverless resource group.
-
You have created an OSS data source and a MaxCompute data source. For more information, see Configure a data source.
-
You have established network connectivity between the resource group and the data source. For more information, see Overview of network connectivity solutions.
Limitations
Syncing source data to MaxCompute external tables is not supported.
Procedure
This topic uses the UI of Data Studio (New) to demonstrate how to configure an offline synchronization task.
Step 1: Create a node and task
For general steps on how to create a node and use the Codeless UI, see Codeless UI configuration.
Step 2: Configure the source and destination
Configure the source (OSS)
In this scenario, the data source is an OSS file. Key configuration items are described below.
|
Parameter |
Description |
|
File Type |
Select the file type to sync. The Codeless UI supports reading files in |
|
File Path |
Enter the path to the source file.
|
|
Field Delimiter |
Specify the column delimiter used in the file. |
|
Encoding |
Set the character encoding used to read the source file. |
|
Null String |
|
|
Compression Format |
The compression format of the source file. Supported formats are |
|
Skip Header |
For CSV-like files, you can choose whether to skip the header row. By default, the header is included. Note
Skipping the header is not supported for compressed files. |
|
Table Data Structure |
After you configure the data source parameters, click Confirm Data Structure to verify the data format. |
Configure the destination (MaxCompute)
In this scenario, the destination for the offline data sync from OSS is a MaxCompute table. The key configuration items are described below.
You can keep the default values for any parameters not mentioned in the table below.
|
Parameter |
Description |
|
Tunnel Resource Group |
This specifies the MaxCompute data transfer resource (Tunnel Quota). By default, Public transmission resources are used, which corresponds to the free quota for MaxCompute. If your exclusive Tunnel Quota becomes unavailable due to overdue payments or expiration, the task automatically reverts to using Public transmission resources. |
|
Table |
Select the MaxCompute table for data synchronization. If you are using a standard DataWorks workspace, ensure that a MaxCompute table with the same name and schema exists in both the development environment and the production environment. Alternatively, click Generate Destination Table Schema to automatically create a destination table. You can then manually adjust the creation statement. Note
Consider the following:
|
|
Partition Information |
If the destination is a partitioned table, you can specify the values for the partition columns.
|
|
Write Method |
Choose whether to clear existing data in the target table or append new data. |
Step 3: Configure field mapping
After configuring the source and destination, map the columns between them. You can choose to Map Fields with the Same Name, Map Fields in the Same Line, Delete All Mappings, or Edit Field Mappings.
Step 4: Configure advanced settings
You can configure advanced settings for the task, such as Expected Maximum Concurrency and Policy for Dirty Data Records. For this tutorial, set the Policy for Dirty Data Records to Disallow Dirty Data Records and use the default values for all other settings. For more information, see Codeless UI configuration.
Step 5: Configure and run the debug task
-
In the top-right corner of the editor, click Run Configuration, set the Resource Group and Script Parameters for the debug run, and then click Run to test the task.
-
In the left-side navigation pane, click
, then click the new icon next to Personal Directory to create a new SQL file. Run the following SQL statement to query the destination table and verify that the data meets your expectations.Note-
To run this query, you must first bind the destination MaxCompute project as a compute engine for DataWorks.
-
In the
.sqlfile editor, click Run Configuration on the right side. Specify the data source Type, Computing Resources, and Resource Group, and then click Run in the top toolbar.
SELECT * FROM <your_maxcompute_table_name> WHERE pt=<your_partition> LIMIT 20; -
Step 6: Configure scheduling and publish the task
In the right-side pane, click Scheduling Settings to set scheduling parameters for periodic runs. Then, click Publish and follow the prompts to publish the task.