All Products
Search
Document Center

DataWorks:Offline sync of a single table from OSS to MaxCompute

Last Updated:Nov 27, 2025

This topic uses an offline sync from a single OSS table to a MaxCompute table as an example. It describes best practices for data source configuration, network connectivity, and sync task configuration.

Background information

Alibaba Cloud Object Storage Service (OSS) is a cloud storage service that provides massive storage capacity, high security, low cost, and high reliability. It guarantees 99.9999999999% (twelve 9s) data durability and 99.995% data availability. OSS offers multiple storage classes to help you optimize storage costs. Data Integration lets you sync data from OSS to other destinations and from other sources to OSS. This topic uses an offline sync from OSS to MaxCompute as an example to describe the entire process.

Get OSS bucket information

Go to the OSS console. In the Bucket list, find the OSS bucket that you want to use for data synchronization. On the bucket information page, obtain the Access Over Internet and Access from ECS over the VPC (internal network) from the overview section. You can choose different endpoints for different scenarios.

  • The public endpoint provides access over the Internet. Inbound traffic (writes) to OSS through the public endpoint is free, but outbound traffic (reads) is charged. For more information about OSS fees, see Billing items.

  • The internal endpoint provides access over the Alibaba Cloud internal network between products in the same region. For example, you can use a Data Integration resource group to access the OSS service in the same region. Both inbound and outbound traffic over the internal network are free. If you read data from or write data to an OSS bucket that is in the same region as the Data Integration resource group, configure the internal endpoint. Otherwise, configure the public endpoint.

  • For information about region and endpoint mappings, see OSS regions and endpoints.

Prerequisites

Limits

Syncing source data to MaxCompute foreign tables is not supported.

Procedure

Note

This topic uses the Data Studio (New) interface as an example to demonstrate how to configure an offline sync task.

1. Create a node and configure the task

For the general steps to create a node and use the codeless UI, see the Codeless UI configuration guide.

2. Configure the data source and destination

Configure the data source (OSS)

In this example, the data source is an OSS file. The following table describes the key configuration items.

Configuration item

Configuration details

Text Type

Select the type of file that you want to sync. The codeless UI supports reading files in csv, text, orc, and parquet formats.

File Path

Enter the path of the file that you want to sync.

  • When you specify a single OSS object, OSS Reader can only use a single thread to extract data.

  • When you specify multiple OSS objects, OSS Reader can use multiple threads to extract data. You can configure the number of concurrent threads as needed.

  • When you specify a wildcard character, OSS Reader tries to traverse multiple object information. For example, if you set the path to abc*[0-9], it can match abc0, abc1, abc2, abc3, and so on. If you set the path to abc?.txt, it can match files that start with abc, end with .txt, and have one arbitrary character in the middle.

Column Delimiter

Specifies the column delimiter in the configuration file.

Encoding

Set the encoding format used to read the source file.

Null Value

  • If you select "Do not process", the values read from the source remain unchanged.

  • If you select "Visible characters", enter the string that represents a null value. If you leave it empty, it represents an empty string.

  • If you select "Invisible characters", enter a Unicode code, such as \u001b or \u007c, or an escape character, such as \t. You cannot leave it empty.

Compression Format

The compression format of the source file. Supported formats are Gzip, Bzip2, Zip, and no compression.

Skip Header

CSV-like files may have a header row that acts as a title. You can choose whether to skip it. By default, it is not skipped.

Note

Skipping the header is not supported for compressed files.

Table Data Structure

After you configure the parameters for the data source, click Confirm Table Schema to check if the data format meets your expectations.

Configure the data destination (MaxCompute)

In this example, the data destination is a MaxCompute table. The following table describes the key configuration items.

Note

You can use the default values for parameters that are not described in the following table.

Configuration item

Configuration details

Tunnel Resource Group

The MaxCompute Data Transmission Service resource, Tunnel Quota. The default selection is "Public transmission resources", which is the free quota for MaxCompute. If your exclusive Tunnel Quota becomes unavailable due to overdue payments or expiration, the running task will automatically switch to "Public transmission resources".

Table

Select the MaxCompute table to which you want to sync data. If you are using a standard DataWorks workspace, make sure that a MaxCompute table with the same name and schema exists in both the development and production environments of MaxCompute.

You can also click Generate Destination Table Schema. The system will automatically create a table to receive the data. You can manually adjust the table creation statement.

Note

If:

  • If the MaxCompute table does not exist in the development environment, you cannot find it in the destination table drop-down list when configuring the offline sync node.

  • If the MaxCompute table does not exist in the production environment, the sync task will fail after it is submitted and published because it cannot find the destination table at runtime.

  • If the table schemas in the development and production environments are inconsistent, the actual column mapping at runtime may differ from the mapping configured for the offline sync node. This can lead to incorrect data being written.

Partition

If the table is a partitioned table, you can enter a value for the partition key column.

  • The value can be a static value, such as ds=20220101.

  • The value can be a scheduling system parameter, such as ds=${bizdate}. The system parameter is automatically replaced with its value when the task runs.

Write Mode

When writing to the destination table, you can choose to clear existing data or keep it.

3. Configure field mapping

After you select the data source and destination, you must configure the field mapping between the source and the destination. You can click Map Fields with Same Name, Map Fields in Same Line, Clear Mappings, or Manually Edit Mapping.

4. Configure channel control

Offline sync tasks support settings such as Maximum Expected Concurrency and Policy for Dirty Data Records. In this example, Policy for Dirty Data Records is set to Do not tolerate dirty data, and the other settings use their default values. For more information, see Codeless UI configuration.

5. Configure and run a debug task

  1. Click Debugging Configurations on the right side of the offline sync node's edit page. Set the Resource Group and Script Parameters for the debug run. Then, click Run in the top toolbar to test if the sync channel runs successfully.

  2. In the navigation pane on the left, click the image icon. Then, click the image icon to the right of Personal Directory and create a file with the .sql extension. Execute the following SQL query to check whether the data in the destination table is as expected.

    Note
    SELECT * FROM <destination_table_name_in_MaxCompute> WHERE pt=<specified_partition> LIMIT 20;

6. Configure scheduling and publish the task

Click Scheduling on the right side of the offline sync task. Configure the scheduling parameters for periodic runs. Then, click Publish in the top toolbar to open the publishing panel. Follow the on-screen instructions to publish the task.