Use a MaxCompute table as a data source in OpenSearch Retrieval Engine Edition to build full-text search indexes over your data warehouse data. When you enable Automatic Reindexing and configure a done table, OpenSearch can automatically rebuild the index each time it detects a new semaphore in the done table.
Prerequisites
Before you begin, make sure that:
You are familiar with MaxCompute (formerly known as ODPS). For background, see What is MaxCompute?
The MaxCompute table is a partitioned internal table. External tables are not supported.
The table fields use only the following data types: STRING, BOOLEAN, DOUBLE, BIGINT, and DATETIME.
The account you use to log on to the OpenSearch console has the following permissions on the MaxCompute table: DESCRIBE, SELECT, and DOWNLOAD on the table, and LABEL permission on the table fields.
To grant the required permissions, run the following statements in MaxCompute:
-- Add the account.
add user ****@aliyun.com;
-- Grant table-level permissions.
GRANT describe,select,download ON TABLE table_xxx TO USER ****@aliyun.com;
GRANT describe,select,download ON TABLE table_xxx_done TO USER ****@aliyun.com;
-- Grant LABEL permissions.
-- Option 1: Grant permissions on all fields in the project.
SET LABEL 3 to USER ****@aliyun.com;
-- Option 2: Grant permissions on specific fields in a table.
GRANT LABEL 3 ON TABLE table_xxx(col1, col2) TO ****@aliyun.com;If field permission verification is enabled on your MaxCompute table, you must grant LABEL permissions on all fields in the table. Otherwise, OpenSearch cannot pull data and index creation fails.
For the CREATE TABLE statement used to build an index from a MaxCompute data source, see CREATE TABLE statement for creating a table in a MaxCompute data source.
How data sync works
MaxCompute data sources support two sync modes, which are typically used together:
| Mode | How it works | When to use |
|---|---|---|
| Full indexing | OpenSearch reads the entire MaxCompute table and rebuilds the index | Initial setup; periodic full refreshes triggered by the done table |
| Incremental sync | Real-time updates via an API data source | After full indexing, to keep the index current with row-level changes |
This topic covers full indexing. To set up incremental sync, use an API data source alongside your MaxCompute data source.
Add a MaxCompute data source
Log on to the OpenSearch console. In the upper-left corner, select OpenSearch Retrieval Engine Edition.
On the Instances page, find your instance and click Manage in the Actions column.
In the left-side navigation pane, choose Configuration Center > Data Source, then click Add Data Source.
In the panel that appears, select MaxCompute as the data source type and configure the parameters.
Parameter Description Example Data Source Name Name of the data source. Format: InstanceName_CustomName. Cannot be changed after creation.myinstance_ordersProject The MaxCompute project that contains your table. my_projectAccessKey The AccessKey ID of the account. LTAI5tXxxAccessKey Secret The AccessKey secret of the account. — Table The MaxCompute table to use as the data source. Must be a partitioned internal table. order_recordsPartition Key The partition key of the table. Use the yyyymmddhhformat (for example,2022011314) for hourly partitions to trigger multiple full indexing tasks per day.dsAutomatic Reindexing When enabled, OpenSearch automatically rebuilds indexes each time a change is detected in the data source. Requires a done table — see Configure automatic reindexing. — Click Verify. After the configuration passes verification, click OK.
Configure an index schema to create an index table for this data source. For details, see the Add an index table section of the index schema topic.
Update configurations and trigger reindexing to make the data source available to online clusters. For details, see Update configurations.
Configure automatic reindexing
When automatic reindexing is enabled, OpenSearch watches a done table in MaxCompute and rebuilds the index each time a new partition appears in that table. The done table acts as a signal: you insert a record into it to tell OpenSearch that new data is ready.
Scenario: Your MaxCompute table mytable is partitioned by ds and receives a new daily partition containing the full dataset. Each day, after the new partition is ready, you want OpenSearch to automatically pick it up and rebuild the index.
Done table requirements:
Name:
{data_table_name}_done(for example, if the data table ismytable, the done table ismytable_done)Partition key: must match the partition key of the data table (for example,
ds)Schema: exactly one field named
attributeof type STRINGThe partition you add to the done table must already exist in the data table
Follow these steps to set up the done table and trigger automatic reindexing:
Step 1: Enable automatic reindexing when adding the data source — see Add a MaxCompute data source.
Step 2: Create the done table in MaxCompute:
create table mytable_done (attribute string) partitioned by (ds string);After creation, both tables are visible in MaxCompute:
odps:sql:xxx> show tables;
ALIYUN$****@aliyun.com:mytable -- The data table
ALIYUN$****@aliyun.com:mytable_done -- The done tableStep 3: Signal OpenSearch to reindex after each new partition is ready. When partition ds=20220114 is generated in mytable, run:
-- Add the partition to the done table.
alter table mytable_done add if not exists partition (ds="20220114");
-- Insert the semaphore to trigger automatic full data synchronization.
insert into table mytable_done partition (ds="20220114") select '{"swift_start_timestamp":1642003200}';The swift_start_timestamp value is a Unix timestamp that specifies the start offset for real-time incremental synchronization.
After the insert, the done table contains:
odps:sql:xxx> select * from mytable_done where ds=20220114 limit 1;
+-----------+----+
| attribute | ds |
+-----------+----+
| {"swift_start_timestamp":1642003200} | 20220114 |
+-----------+----+OpenSearch scans the done table, detects the new semaphore, and automatically starts a reindexing task.
The attribute field value must be a JSON string in the format {"swift_start_timestamp":<unix_timestamp>}.
Modify a MaxCompute data source
On the Data Source page, find the data source and click Modify in the Actions column.
In the Modify Data Source panel, update the parameters you want to change: Project, AccessKey, AccessKey Secret, Table, or Partition Key.
The data source name cannot be changed.
Click Verify. After the modified configuration passes verification, click OK.
Update configurations and trigger reindexing to apply the changes to online clusters. For details, see Update configurations.
Delete a MaxCompute data source
On the Data Source page, find the data source and click Delete in the Actions column.
The system checks whether the data source is referenced by an index table:
Not referenced: Click OK to delete. Then update configurations and rebuild indexes.
Referenced: The system returns an error. Delete the referencing index table first, then delete the data source. For details, see the Delete an index table section of the index schema topic.
Limitations
The MaxCompute table must be a partitioned internal table. External tables are not supported.
Data source names cannot be changed after creation.
Supported field data types: STRING, BOOLEAN, DOUBLE, BIGINT, and DATETIME.
Full indexing uses MaxCompute table data directly. For real-time incremental updates, use an API data source alongside the MaxCompute data source.
Next steps
Index schema: Configure an index table for the data source.
Update configurations: Push configuration changes to online clusters and trigger reindexing.