What is data preparation - Data Transmission Service - Alibaba Cloud Documentation Center

Building a retrieval-augmented generation (RAG) application requires moving data from source databases into vector stores—a pipeline that typically involves custom extraction code, file staging, chunking logic, and embedding calls. AI data preparation is a feature of Data Transmission Service (DTS) that handles this entire pipeline, connecting directly to your source database and delivering vectorized data to a vector database or data lakehouse without intermediate storage or manual exports.

Use AI data preparation to power enterprise knowledge bases, AI-assisted content creation, and intelligent customer service systems.

How it works

AI data preparation runs a four-stage ingestion pipeline from your source database to the destination vector store:

Connect: DTS connects directly to the source database and pulls both full and incremental data. No file exports or manual uploads are required.
Parse: Raw data—whether unstructured documents or structured relational tables—is parsed into a processable format.
Chunk: Each document is split into segments sized for the embedding model.
Embed: Each chunk is converted into a vector embedding and written to the destination vector database.

As the source data changes, DTS keeps the destination in sync through incremental updates.

Use cases

Enterprise knowledge bases: Continuously ingest and update company documents, wikis, and structured records so your RAG application always retrieves current information.
AI-assisted content creation: Feed a mixed corpus of unstructured documents and structured data into a unified retrieval index.
Intelligent customer service: Connect support documentation and CRM records directly to a RAG pipeline without building custom ingestion code.

Supported data flows

Data preparation tasks

Source	Destination
MySQL	AnalyticDB for PostgreSQL

RAGFlow knowledge base

Vector database	Configuration guide	Tutorials
AnalyticDB for PostgreSQL	Build and use a DTS RAGFlow knowledge base	Use DTS RAGFlow to register an external DMS Dify knowledge base Connect Lark to a DTS RAGFlow knowledge base Connect OSS to a DTS RAGFlow knowledge base
Lindorm	Build and use a DTS RAGFlow knowledge base

Supported regions

Data preparation tasks: See List of supported regions.
RAGFlow knowledge bases: China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), and China (Hong Kong).

Limitations

Data preparation tasks

Cross-region tasks are not supported.
Create the required table schemas in the destination database before starting a task.
Overwriting existing data in the destination database is not supported.

RAGFlow knowledge bases

Only the virtual private cloud (VPC) network type is supported.
The VPC, vector database, and OSS Bucket must be in the same region.

Billing

For pricing details, see AI data preparation billing methods.