How do I rebuild, edit, or delete a MaxCompute data source? - OpenSearch

Use a MaxCompute table as a data source in OpenSearch Retrieval Engine Edition to build full-text search indexes over your data warehouse data. When you enable Automatic Reindexing and configure a done table, OpenSearch can automatically rebuild the index each time it detects a new semaphore in the done table.

Prerequisites

Before you begin, make sure that:

You are familiar with MaxCompute (formerly known as ODPS). For background, see What is MaxCompute?
The MaxCompute table is a partitioned internal table. External tables are not supported.
The table fields use only the following data types: STRING, BOOLEAN, DOUBLE, BIGINT, and DATETIME.
The account you use to log on to the OpenSearch console has the following permissions on the MaxCompute table: DESCRIBE, SELECT, and DOWNLOAD on the table, and LABEL permission on the table fields.

To grant the required permissions, run the following statements in MaxCompute:

-- Add the account.
add user ****@aliyun.com;

-- Grant table-level permissions.
GRANT describe,select,download ON TABLE table_xxx TO USER ****@aliyun.com;
GRANT describe,select,download ON TABLE table_xxx_done TO USER ****@aliyun.com;

-- Grant LABEL permissions.
-- Option 1: Grant permissions on all fields in the project.
SET LABEL 3 to USER ****@aliyun.com;

-- Option 2: Grant permissions on specific fields in a table.
GRANT LABEL 3 ON TABLE table_xxx(col1, col2) TO ****@aliyun.com;

Important

If field permission verification is enabled on your MaxCompute table, you must grant LABEL permissions on all fields in the table. Otherwise, OpenSearch cannot pull data and index creation fails.

For the CREATE TABLE statement used to build an index from a MaxCompute data source, see CREATE TABLE statement for creating a table in a MaxCompute data source.

How data sync works

MaxCompute data sources support two sync modes, which are typically used together:

Mode	How it works	When to use
Full indexing	OpenSearch reads the entire MaxCompute table and rebuilds the index	Initial setup; periodic full refreshes triggered by the done table
Incremental sync	Real-time updates via an API data source	After full indexing, to keep the index current with row-level changes

This topic covers full indexing. To set up incremental sync, use an API data source alongside your MaxCompute data source.

Add a MaxCompute data source

Log on to the OpenSearch console. In the upper-left corner, select OpenSearch Retrieval Engine Edition.
On the Instances page, find your instance and click Manage in the Actions column.
In the left-side navigation pane, choose Configuration Center > Data Source, then click Add Data Source.

In the panel that appears, select MaxCompute as the data source type and configure the parameters.

Parameter	Description	Example
Data Source Name	Name of the data source. Format: `InstanceName_CustomName`. Cannot be changed after creation.	`myinstance_orders`
Project	The MaxCompute project that contains your table.	`my_project`
AccessKey	The AccessKey ID of the account.	`LTAI5tXxx`
AccessKey Secret	The AccessKey secret of the account.	—
Table	The MaxCompute table to use as the data source. Must be a partitioned internal table.	`order_records`
Partition Key	The partition key of the table. Use the `yyyymmddhh` format (for example, `2022011314`) for hourly partitions to trigger multiple full indexing tasks per day.	`ds`
Automatic Reindexing	When enabled, OpenSearch automatically rebuilds indexes each time a change is detected in the data source. Requires a done table — see Configure automatic reindexing.	—

Click Verify. After the configuration passes verification, click OK.
Configure an index schema to create an index table for this data source. For details, see the Add an index table section of the index schema topic.
Update configurations and trigger reindexing to make the data source available to online clusters. For details, see Update configurations.

Configure automatic reindexing

When automatic reindexing is enabled, OpenSearch watches a done table in MaxCompute and rebuilds the index each time a new partition appears in that table. The done table acts as a signal: you insert a record into it to tell OpenSearch that new data is ready.

Scenario: Your MaxCompute table mytable is partitioned by ds and receives a new daily partition containing the full dataset. Each day, after the new partition is ready, you want OpenSearch to automatically pick it up and rebuild the index.

Done table requirements:

Name: {data_table_name}_done (for example, if the data table is mytable, the done table is mytable_done)
Partition key: must match the partition key of the data table (for example, ds)
Schema: exactly one field named attribute of type STRING
The partition you add to the done table must already exist in the data table

Follow these steps to set up the done table and trigger automatic reindexing:

Step 1: Enable automatic reindexing when adding the data source — see Add a MaxCompute data source.

Step 2: Create the done table in MaxCompute:

create table mytable_done (attribute string) partitioned by (ds string);

After creation, both tables are visible in MaxCompute:

odps:sql:xxx> show tables;
ALIYUN$****@aliyun.com:mytable          -- The data table
ALIYUN$****@aliyun.com:mytable_done     -- The done table

Step 3: Signal OpenSearch to reindex after each new partition is ready. When partition ds=20220114 is generated in mytable, run:

-- Add the partition to the done table.
alter table mytable_done add if not exists partition (ds="20220114");

-- Insert the semaphore to trigger automatic full data synchronization.
insert into table mytable_done partition (ds="20220114") select '{"swift_start_timestamp":1642003200}';

The swift_start_timestamp value is a Unix timestamp that specifies the start offset for real-time incremental synchronization.

After the insert, the done table contains:

odps:sql:xxx> select * from mytable_done where ds=20220114 limit 1;
+-----------+----+
| attribute | ds |
+-----------+----+
| {"swift_start_timestamp":1642003200} | 20220114 |
+-----------+----+

OpenSearch scans the done table, detects the new semaphore, and automatically starts a reindexing task.

Important

The attribute field value must be a JSON string in the format {"swift_start_timestamp":<unix_timestamp>}.

Modify a MaxCompute data source

On the Data Source page, find the data source and click Modify in the Actions column.
In the Modify Data Source panel, update the parameters you want to change: Project, AccessKey, AccessKey Secret, Table, or Partition Key.
The data source name cannot be changed.
Click Verify. After the modified configuration passes verification, click OK.
Update configurations and trigger reindexing to apply the changes to online clusters. For details, see Update configurations.

Delete a MaxCompute data source

On the Data Source page, find the data source and click Delete in the Actions column.
The system checks whether the data source is referenced by an index table:
- Not referenced: Click OK to delete. Then update configurations and rebuild indexes.
- Referenced: The system returns an error. Delete the referencing index table first, then delete the data source. For details, see the Delete an index table section of the index schema topic.

Limitations

The MaxCompute table must be a partitioned internal table. External tables are not supported.
Data source names cannot be changed after creation.
Supported field data types: STRING, BOOLEAN, DOUBLE, BIGINT, and DATETIME.
Full indexing uses MaxCompute table data directly. For real-time incremental updates, use an API data source alongside the MaxCompute data source.

Next steps

Index schema: Configure an index table for the data source.
Update configurations: Push configuration changes to online clusters and trigger reindexing.