All Products
Search
Document Center

DataWorks:Offline synchronization of a single table from OSS to MaxCompute

Last Updated:Mar 01, 2026

This topic describes how to load offline data from a single Object Storage Service (OSS) table into MaxCompute, providing best practices for data source configuration, network connectivity, and synchronization task configuration.

OSS overview

Alibaba Cloud Object Storage Service (OSS) is a secure, cost-effective, and highly reliable cloud storage service that offers massive storage capacity, 99.9999999999% (twelve 9s) of data durability, and 99.995% data availability. OSS offers various storage classes to help you optimize costs. Data Integration allows you to sync data from OSS to other destinations and from other sources into OSS.

Get OSS bucket information

Navigate to the OSS console. In the Bucket list, find the bucket you want to use for data synchronization. On the bucket's overview page, get its Public endpoint and Internal endpoint. You can choose which endpoint to use based on your scenario.

  • The Public endpoint is used for access over the internet. When you access OSS over the internet, inbound traffic (writes) is free, but outbound traffic (reads) is charged. For more information about OSS fees, see OSS Pricing and Billing Items.

  • The internal network is the private communication network between Alibaba Cloud products in the same region. For example, you can use a Data Integration resource group to access OSS in the same region. Both inbound and outbound traffic over the internal network are free. If you are reading from or writing to an OSS bucket that is in the same region as your Data Integration resource group, use the internal endpoint. Otherwise, use the public endpoint.

  • For a list of regions and their endpoints, see Regions and Endpoints.

Prerequisites

Limitations

Syncing source data to MaxCompute external tables is not supported.

Procedure

Note

This topic uses the UI of Data Studio (New) to demonstrate how to configure an offline synchronization task.

Step 1: Create a node and task

For general steps on how to create a node and use the Codeless UI, see Codeless UI configuration.

Step 2: Configure the source and destination

Configure the source (OSS)

In this scenario, the data source is an OSS file. Key configuration items are described below.

Parameter

Description

File Type

Select the file type to sync. The Codeless UI supports reading files in csv, text, orc, and parquet formats.

File Path

Enter the path to the source file.

  • When you specify a single OSS Object, the OSS Reader uses a single thread for data extraction.

  • When you specify multiple OSS Objects, the OSS Reader uses multiple threads for data extraction. You can configure the concurrency based on your requirements.

  • When you use wildcards, the OSS Reader attempts to find multiple Objects. For example, if you set the path to abc*[0-9], it matches objects like abc0, abc1, abc2, and abc3. If you set the path to abc?.txt, it matches files that start with abc, end with .txt, and have a single character in between.

Field Delimiter

Specify the column delimiter used in the file.

Encoding

Set the character encoding used to read the source file.

Null String

  • If you select Do not process, values read from the source remain unchanged.

  • If you select Visible characters, enter the string that represents a null value. If you leave this field empty, it is treated as an empty string.

  • When you select “Invisible Characters”, enter a Unicode code, such as \u001b or \u007c, or an escape character such as \t. The value cannot be empty.

Compression Format

The compression format of the source file. Supported formats are Gzip, Bzip2, Zip, and uncompressed.

Skip Header

For CSV-like files, you can choose whether to skip the header row. By default, the header is included.

Note

Skipping the header is not supported for compressed files.

Table Data Structure

After you configure the data source parameters, click Confirm Data Structure to verify the data format.

Configure the destination (MaxCompute)

In this scenario, the destination for the offline data sync from OSS is a MaxCompute table. The key configuration items are described below.

Note

You can keep the default values for any parameters not mentioned in the table below.

Parameter

Description

Tunnel Resource Group

This specifies the MaxCompute data transfer resource (Tunnel Quota). By default, Public transmission resources are used, which corresponds to the free quota for MaxCompute. If your exclusive Tunnel Quota becomes unavailable due to overdue payments or expiration, the task automatically reverts to using Public transmission resources.

Table

Select the MaxCompute table for data synchronization. If you are using a standard DataWorks workspace, ensure that a MaxCompute table with the same name and schema exists in both the development environment and the production environment.

Alternatively, click Generate Destination Table Schema to automatically create a destination table. You can then manually adjust the creation statement.

Note

Consider the following:

  • If the target MaxCompute table does not exist in the development environment, it will not appear in the destination table list.

  • If the target MaxCompute table does not exist in the production environment, the published synchronization task will fail because it cannot find the target table.

  • If the table schemas in the development and production environments are inconsistent, the column mapping used at runtime may differ from the mapping you configured, leading to incorrect data writes.

Partition Information

If the destination is a partitioned table, you can specify the values for the partition columns.

  • You can use a fixed value, such as ds=20220101.

  • You can use scheduling parameters, such as ds=${bizdate}. The system automatically replaces these parameters with actual values at runtime.

Write Method

Choose whether to clear existing data in the target table or append new data.

Step 3: Configure field mapping

After configuring the source and destination, map the columns between them. You can choose to Map Fields with the Same Name, Map Fields in the Same Line, Delete All Mappings, or Edit Field Mappings.

Step 4: Configure advanced settings

You can configure advanced settings for the task, such as Expected Maximum Concurrency and Policy for Dirty Data Records. For this tutorial, set the Policy for Dirty Data Records to Disallow Dirty Data Records and use the default values for all other settings. For more information, see Codeless UI configuration.

Step 5: Configure and run the debug task

  1. In the top-right corner of the editor, click Run Configuration, set the Resource Group and Script Parameters for the debug run, and then click Run to test the task.

  2. In the left-side navigation pane, click image, then click the new icon next to Personal Directory to create a new SQL file. Run the following SQL statement to query the destination table and verify that the data meets your expectations.

    Note
    SELECT * FROM <your_maxcompute_table_name> WHERE pt=<your_partition> LIMIT 20;

Step 6: Configure scheduling and publish the task

In the right-side pane, click Scheduling Settings to set scheduling parameters for periodic runs. Then, click Publish and follow the prompts to publish the task.