Synchronize data from DLF to Elasticsearch - - Alibaba Cloud Documentation Center

If you need to perform vector retrieval or build Retrieval-Augmented Generation (RAG) applications on multimodal data, such as images and text stored in Data Lake Formation (DLF), the multimodal data processing and synchronization feature described in this topic provides a convenient way to ingest and process data. It automatically extracts text and image data from your DLF, calls AI models for vectorization, and seamlessly synchronizes the structured results to your Alibaba Cloud Elasticsearch instance. This simplifies AI application development, helping you quickly build solutions for multimodal search, RAG, and similar scenarios.

Prerequisites

Instance version: This feature is supported on Alibaba Cloud Elasticsearch 8.15 and later.
Region: The Data Lake Formation (DLF) service and the Alibaba Cloud Elasticsearch instance must be in the same region.

Synchronization mechanism

The synchronization task uses a full and incremental mechanism. When a task starts, it automatically reads data in Paimon format from the DLF source table, applies data processing rules to vectorize text and images, and converts the data into the Elasticsearch index format. This process uses Elasticsearch to accelerate data retrieval from the data lake.

Billing

Data synchronization tasks are free of charge. However, you are charged for node storage space after data is synchronized to your Elasticsearch instance. You can view your billing details in the Billing and Costs center.
AI model invocation fees: Data processing services, such as text and image vectorization, are provided by the AI Search Open Platform. You are billed based on the actual number of model calls. You will not be charged if you do not use these services.

Create and configure a synchronization task

Follow these steps to create and start a synchronization task.

Go to the task creation page
1. Log on to the Elasticsearch console. In the top menu bar, switch to the target region.
2. Find the target instance and click its Cluster ID to go to the Basic Information page.
3. In the left-side navigation pane, choose AI Service Center > Offline Data Processing.
4. In the Multimodal Data Processing Service area, select a Workspace Name of Model Service and click Initialize Model. After the initialization is complete, click Get Started.
  Workspace Name of Model Service: An existing space in the AI Search Open Platform. default is the default space, and other spaces (if any) are created by the user.
  Initialize Model: Initializes the model in the selected space to make it available.
5. On the synchronization task list page, click Create.

Configure basic information

Set a name for the task and connect to the DLF data source. Follow the on-screen prompts to configure the Basic Information Configuration.

Parameter	Description
Task ID/Name	An easy-to-identify name for the synchronization task.
API Key	You must create an API Key in the AI Search Open Platform in advance.
Data Source	Select Data Lake Formation (DLF).
Table Type	Currently, only data tables in Paimon format are supported.
Data Catalog	Enter the details for your environment.
Database
Data Table
RAM Role	Authorize Elasticsearch to use the AI Search Open Platform Default Role. When you perform this action, a service-linked role is automatically created to grant the necessary permissions. Role name: AliyunServiceRoleForSearchPlat Role policy: AliyunServiceRolePolicyForSearchPlat Description: Allows the AI Search Open Platform service to access your resources.

After completing the configuration, click Next.

Configure data processing
- In the Original Field section, select the fields to synchronize to Elasticsearch.
- In the Data Processing Configurations section, select fields and configure their processing rules. For example, by assigning vectorization models to text and image fields, the system can automatically generate vector data and synchronize it to Elasticsearch.
Configure field mapping
This step maps the source fields to the destination fields in the target Elasticsearch index.
Configure Elasticsearch connection information: Enter the Username, Password, and Target Index.
Important
The system automatically loads all indexes from the current Elasticsearch instance. If your target index does not appear in the Target Index drop-down list, manually create the index by using a command similar to the following example.
```
PUT /your_index_name
{
  "settings": {
    "index": {
      "number_of_shards": 1,         // Number of primary shards
      "number_of_replicas": 1,       // Number of replicas for each primary shard
      "refresh_interval": "30s",     // Data refresh interval (delay before writes are searchable)
      "analysis": {                  // Example analyzer configuration
        "analyzer": {
          "default": {
            "type": "standard"
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "url": {
        "type": "keyword"
      },
      "source": {
        "type": "keyword"
      },
      "embedding": {
        "type": "dense_vector",
        "dims": 768
      }
    }
  }
}
```
On the Field Mapping Configuration tab, complete the following configuration:
1. Configure field mappings
  The system automatically matches fields with the same name, but you can also adjust the mappings manually.
  - Fields to Sync: The source fields to synchronize from the data source.
  - Target Index Field: The target fields in the Elasticsearch index.
  - Primary Key: Select a field to serve as the unique identifier (_id) for documents in Elasticsearch.
  Currently, only a single primary key is supported. If your source table uses a composite primary key, we recommend that you create a new column in the source data, concatenate the composite key values into a unique string ID, and use this new column as the primary key for synchronization.
2. After confirming that all field mappings are correct, click Next.
Configure and start the synchronization task
Confirm the synchronization settings and start the task. On the Data Synchronization Configuration tab, verify that your settings are correct and then click Complete to start the task.

Manage and monitor synchronization tasks

After a task is created, you can manage and monitor it on the synchronization task list page.

View task status: Task statuses include Scheduled, Running, and Run Failed.
Management operations:
- Create Copy: Quickly copy an existing task's configuration to create similar tasks in bulk.
- Delete: Removes a task that is no longer needed. This action cannot be undone and the task cannot be recovered. Proceed with caution.

Verify the results

You can log on to Kibana to verify that the data has been synchronized successfully.