Build an Offline Sync Task with Qwen NLP Integration - DataWorks

AI-assisted processing integrates large language model (LLM) inference directly into the DataWorks Data Integration pipeline. Instead of moving data as-is, the pipeline calls AI models in real time to analyze, enrich, or transform each record as it travels from the source to the destination—unlocking value from unstructured text without writing any AI invocation code.

This feature is designed for enterprise users who need to perform advanced analysis and processing on data during data synchronization. It is especially useful for companies that want to use AI to improve data quality and extract value from data.

Supported NLP tasks

Task	What it does
Sentiment analysis	Classifies the sentiment of text
Summary generation	Condenses long documents into key points
Keyword extraction	Pulls out the most relevant terms and phrases
Text translation	Converts text from one language to another

Use cases

Industry	How AI processing helps
Customer service / E-commerce	Analyze sentiment in user comments and support tickets; extract core issues and key feedback points
Compliance / Legal / Scientific research	Generate summaries and extract key information from policy documents, legal contracts, and research papers during sync
Manufacturing / Supply chain / Healthcare	Analyze device logs, supply chain feedback, and doctor-patient records to surface threat alerts and service quality signals
Cross-language collaboration	Translate social media comments, news articles, or business documents into a single language for centralized analysis

Prerequisites

Before you begin, ensure that you have:

A workspace using Data Studio (new version)
A large model service configured from one of the supported providers:
- Alibaba Cloud DataWorks model service — Deploy a model and start the model service in Model Service Management
- Alibaba Cloud Model Studio — Activate Alibaba Cloud Model Studio and obtain an API key
- PAI Model Gallery — Activate Platform for AI (PAI) and obtain a token for the model service
Data source information configured manually or as existing data sources for the offline sync task
A resource group attached to the workspace with connectivity to the data source

Billing

In addition to DataWorks subscription fees and resource group fees, AI-assisted processing incurs model inference fees.

Example: translate customer feedback during a Hologres-to-Hologres sync

This example walks through configuring AI-assisted processing in an offline sync task that copies data from one Hologres table to another. The task translates each value in the feedback_info column into English and writes the result to the destination table.

What the AI processing does in this example:

Source field	Processing description	Output field
`feedback_info`	`Translate '#{feedback_info}' into English`	`feedback_processed`

Step 1: Prepare source data

Create the source table and insert sample records:

CREATE TABLE customer_feedback (
    id BIGINT PRIMARY KEY,
    device STRING,
    feedback_info STRING,
    pt INT
)
PARTITIONED BY (pt)
DISTRIBUTED BY HASH(id)
WITH (table_type='Duplication');

INSERT INTO customer_feedback (id, device, feedback_info, pt)
VALUES
(8, 'Huawei MateBook D14', 'Affordable, suitable for students, performance is adequate', 2020),
(1, 'iphone', 'This product is okay, I have used it for 1 year', 2013),
(10, 'Bose QuietComfort 35 II', 'A classic among noise-canceling headphones, maximum comfort', 2021);

Step 2: Create an offline sync task

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a region. Find the target workspace and choose Shortcuts > Data Studio in the Actions column.
In the left navigation pane, click to open Data Studio. To the right of Workspace Directories, click and choose Create Node > Data Integration > Batch Synchronization. The New Node dialog box opens.
Set the Path, Data Source and Destination, and Name for the node, then click OK.

Step 3: Configure the sync task

After the node is created, the task configuration page opens. Configure each section in order.

Data source

Type — Set during node creation and cannot be changed. To use a different data source type, create a new node.
Data Source — Select an existing data source from the drop-down list, or click Add Data Source to create one.

Runtime resource

Select a Resource Group for the sync task. For a serverless resource group, specify the number of CUs in the Resource Usage(CU) field.
After selecting a resource group, Data Integration checks connectivity to the source and destination automatically. Click Connectivity Check to run the check manually.

Source

Configure the source table settings: Schema, Table, Partition, and Data Filtering conditions. Click Data Preview to preview the records that will be synced.

Data processing

In the data processing section, toggle Enable to turn on data processing. This requires additional computing resources.
Click Add Node and select AI Process.

Configure the AI Process node. The key parameters are:

Parameter	Description
Model Provider	Select the provider: DataWorks Model Service, Aliyun Bailian, or PAI Model Gallery
Model Endpoint	Applies when Model Provider is PAI Model Gallery. Enter the model invocation endpoint. To get the endpoint, see Test service invocation
Model Name	The model used for intelligent data processing. Select one from the list
API Key	The credential to access the model. For Alibaba Cloud Model Studio, see Obtain a Model Studio API key. For PAI Model Gallery, go to the deployed EAS task, start online debugging, and use the access token as the API key
Processing Description	A natural language instruction describing how to process the source field. Reference field names using the `#{column_name}` format. For this example, enter `Translate '#{feedback_info}' into English`
Output Field	The field where the processed result is stored. If the field does not exist, it is created automatically. For this example, enter `feedback_processed`

Click Data Output Preview in the upper-right corner of the data processing section to preview the processed output.
(Optional) Add more processing nodes. Multiple data processing nodes execute sequentially.

Destination

Configure the destination table: Schema, Table, and Partition.
- Click Generate Target Table Schema to generate the schema automatically.
- Select an existing table if one already exists.
Set the Write Mode and Write Conflict Strategy.
Configure whether to delete existing data in the Hologres table before the sync.
(Optional) Set Maximum Connections.
Maximum Connections applies only when Write Mode is SQL(INSERT INTO). A single task can use up to nine connections. Make sure the Hologres instance has enough idle connections before starting the task.

Destination field mapping

After configuring the source, data processing, and destination, the field mapping between the source and destination tables is shown. Fields are mapped by name and position by default.

For this example, in addition to the default mapping for id, device, feedback_info, and pt, manually map feedback_processed (the AI output field from the source) to translate_feedback in the destination table.

Step 4: Test the task

In the right panel, click Run Configuration. Set the Resource Group and any Script Parameters for this test run.
In the toolbar, click Save and then Run. After the task finishes, verify the result is successful and check the destination table to confirm the data is correct.

Step 5: Configure scheduling

To run the sync task on a schedule, set the Scheduling Policies in the Scheduling section on the right side of the page and configure the node scheduling properties.

Step 6: Publish the node

Click the Publish icon in the toolbar to start the publishing flow. This publishes the task to the production environment. Periodic scheduling takes effect only after the node is published.

What's next

After the node is published, you can backfill historical data or monitor and manage the task in Operation Center.

Backfill Data — Backfills data for the current node. For more advanced backfill workflows, use Operation Center. See Run a data backfill task and view the data backfill instance (New).
Perform O&M — After publishing, the task is automatically managed by Operation Center. View the running status or manually trigger execution. See Operation Center.

References

Manage data backfill instances