Configure a MaxCompute data source

MaxCompute is an open computing platform. If you want to import data generated by MaxCompute to OpenSearch Industry Algorithm Edition, you can connect a MaxCompute data source to your application in OpenSearch Industry Algorithm Edition. After reindexing is triggered in the application, OpenSearch automatically obtains full data from tables in the MaxCompute data source. To obtain incremental data from the MaxCompute data source, you must use the APIs or SDKs of OpenSearch.

Configure the AccessKey pair for your Alibaba Cloud account

After you configure a MaxCompute data source in OpenSearch Industry Algorithm Edition, OpenSearch Industry Algorithm Edition downloads data from MaxCompute tables based on the AccessKey ID and AccessKey secret that you enter. Therefore, before you configure a MaxCompute data source, you must specify the AccessKey ID and AccessKey secret of the account.

Note

Make sure that the MaxCompute project is created within the Alibaba Cloud account that you use to log on to the OpenSearch console.

You can use the AccessKey pair of your Alibaba Cloud account to access tables in MaxCompute projects that are created by using your Alibaba Cloud account.
To mitigate security risks, you can also use the AccessKey pair of a RAM user. To create a RAM user and grant permissions to the RAM user, perform the following steps:

Create a RAM user under your Alibaba Cloud account. For more information, see Create a RAM user.
Log on to the MaxCompute console and add a member for the RAM user, see Add a workspace member and configure roles.
After you add the member, you can run the list users; command to view the RAM user that the member belongs to on the MaxCompute query editor. For more information, see Query editor in the MaxCompute console.

Copy the name of the RAM user and run the following commands to grant permissions to the RAM user. xxx indicates the copied name.

-- 1. Grant the LIST permission on the project.
grant CreateInstance,List on project zy_ts_test to user xxx;

-- 2. Grant the SELECT, DESCRIBE, and DOWNLOAD permissions on MaxCompute tables.
GRANT select,describe,download ON TABLE people_info TO USER xxx;

-- 3. (Optional) Grant label-based permissions on MaxCompute tables.
set label 2 to USER  xxx;

-- Query the permissions of a specified user and information about the role that is assigned to the user.
show grants for xxx;

After you create a RAM user and grant permissions to the RAM user, you can configure a MaxCompute data source in the OpenSearch console.

On the Configure Application page, click Use Data Source in the Application Schema Creation Method section.

On the Select Data Source, select MaxCompute.

Click Connect to Database and configure the Project Name, AccessKey ID, and AccessKey Secret parameters.

Click Connect. Then, select a data table.

The system automatically maps corresponding fields. You can fine-tune the fields based on your business requirements. Click Next.

Important

When you configure the application schema, you must create a primary table and specify a unique primary key field for each table.

Configure the index schema. You can select an appropriate analyzer based on your search requirements. For more information, see Index schema. Then, click Next.

Configure a data source. In this step, you can configure field mappings, partition information, and concurrency control for data synchronization.

5.1. Configure field mappings: Click Edit in the Actions Column. OpenSearch Industry Algorithm Edition provides multiple data processing plug-ins for MaxCompute data. If you need to use a plug-in, click the plus sign (+) in the Content Conversion column when you configure a field mapping. This way, the source field is converted before it is synchronized to OpenSearch Industry Algorithm Edition. If the plug-in does not work due to errors such as configuration errors or connection failures, the source field is synchronized to the destination field without conversion.

Configure the plug-in:

Important

The following types of MaxCompute data are supported: BIGINT, DOUBLE, BOOLEAN, DATETIME, STRING, and DECIMAL.
The system automatically converts data of the DATETIME type in MaxCompute tables to values in milliseconds. You must set the data type of the corresponding OpenSearch fields to INT.

5.2 Configure the partition information: OpenSearch Industry Algorithm Edition allows you to specify partitions whose data you want to import based on the features of MaxCompute data. Regular expressions are supported. For example, the regular expression in the following figure specifies that you want to import data of the previous day. You can click Reindex on the Instance Details page to create a scheduled reindexing task. This way, incremental partition data can be imported every day.

Regular expression: Equal signs (=), commas (,), semicolons (;), and double vertical bars (||) are reserved characters of the system and cannot be contained in the names or values of partition fields. For example, ds=%Y%m%d || -1 days instructs the system to automatically import the full data of the previous day from the specified partition.

Note

ds specifies the name of the partition field. No other invisible characters such as spaces are allowed on either side of the equal sign (=).

The following section describes how to configure partition conditions of MaxCompute:

1: You can separate multiple partition filter rules with semicolons (;). For example, pt=1;pt=2 matches all partitions that meet the partition filter rule pt=1 or pt=2.
2: You can separate multiple partition fields with commas (,) in a partition filter rule. For example, pt1=1,pt2=2,pt3=3 matches all partitions that meet all the partition filter rules pt1=1, pt2=2, and pt3=3. Functions such as %Y%m%d || -1 days do not support multiple partition fields, but support a single partition field.

Example: The pt partitions in a MaxCompute table contain ds child partitions, as shown in the preceding figure.

Specify multiple partitions: pt=1;pt=2 instructs the system to synchronize all data in pt=1 and pt=2 partitions.
Set multiple partition fields: pt=1,ds=1 instructs the system to synchronize the data in the ds=1 child partition of the pt=1 partition.
pt=1,ds=%Y%m%d || -1 days or pt=1;pt=%Y%m%d || -1 days is not supported.
3: The value of a partition field can be an asterisk (*), which indicates that the value of the partition field can be a random value. In this case, this field is optional in the filter rule.
4: The value of a partition field can contain a regular expression. For example, pt=[0-9]* matches all partitions whose pt value is a number.
5: The value of a partition field supports time matching. The filter rule is in the following format: pt=Partition field value that contains formatted time || Expression that indicates a time interval. For example, ds=%Y%m%d || -1 days indicates that the partition field is ds, the formatted time is in the yyyymmdd format such as 20150510, and the data of the previous day needs to be imported.
5.1 Formatted time parameters support the standard time format.
5.2 The time interval expression can be in the following format: +/- n week|weeks|day|days|hour|hours|minute|minutes|second|seconds|microsecond|microseconds. The plus sign (+) indicates N weeks, days, hours, minutes, seconds, or milliseconds after a scheduled reindexing task is created. The minus sign (-) indicates N weeks, days, hours, minutes, seconds, or milliseconds before a scheduled reindexing task is created.
5.3 By default, the system converts time parameters in all filter rules by using the +0 days condition. Therefore, the field values that are used for filtering cannot contain the following strings as regular string parameters. For example, for tasks that are created on Wednesday, pt=%abc matches the partitions whose pt value is Wedbc instead of %abc.

The following list describes all parameters that can be contained in regular expressions:

%d: the sequence number of the day in the month.  
%H: the hour in a 24-hour system. Valid values: [0, 23].     
%m: the sequence number of the month in the year. Valid values: [01, 12].  
%M: the minute. Valid values: [00, 59].   
%S: the second. Valid values: [00, 61].   
%y: the year represented by two digits.  
%Y: the year represented by four digits.

5.3. Configure concurrency control for data synchronization:

If you select Use DONE File, you can upload a DONE file to control the timing for OpenSearch to pull full data. This ensures the data integrity. Before OpenSearch pulls full data from MaxCompute, OpenSearch checks whether the DONE file of the current day exists. If the file does not exist, OpenSearch waits for the DONE file to appear. The default timeout period is 1 hour.

You must download the installation package of the MaxCompute client from the official website of MaxCompute. The file name of the package is odps_clt_release_64.tar.gz.
You must have the CreateResource permission on the required MaxCompute project.
After you install the MaxCompute client, run the following command on your MaxCompute client. The DONE file is named in the $prefix_%Y-%m-%d format. $prefix: By default, the prefix of the name of the DONE file is the table name. %Y-%m-%d specifies the date of a scheduled reindexing task. The minimum interval for scheduled reindexing tasks is one day.
```
odpscmd -u accessid -p accesskey --project=<prj_name>-e "add file <done file>;"
```
For more information about how to use the MaxCompute client, see MaxCompute client (odpscmd).
The content of DONE files is in the JSON format. A DONE file needs to contain only the timestamp in milliseconds of the current full data. The system retains only the incremental data in the recent three days. Therefore, the point in time that is specified by the timestamp must be within the latest three days.
The timestamp in a DONE file indicates the point in time from which the incremental data is pulled. If you do not specify the timestamp, incremental data is considered as data that is generated from the start time of the scheduled reindexing task. OpenSearch retains only the incremental data in the recent three days. Therefore, the point in time must be within the last three days.
For example, full data is generated at 09:00 on the current day, MaxCompute processes the full data at 10:00, and the scheduled reindexing task in OpenSearch starts at 10:30. After MaxCompute processes the full data, the incremental data after 09:00 on the current day is appended. You must specify the timestamp that corresponds to 09:00 on the current day in milliseconds in the DONE file to ensure data integrity. Otherwise, the system appends only the incremental data that is generated after 10:30, the default start time of the scheduled reindexing task. The incremental data that is generated from 09:00 to 10:30 is lost. Proceed with caution. If no incremental data is generated, you do not need to specify the timestamp.
The following sample code shows an example of the content of a DONE file for an advanced application. The timestamp in the DONE file is used to append incremental data. You can use a similar method to specify the timestamp in DONE files for standard applications.

{
"timestamp":"1234567890000"
}

Priorities of a DONE file and the data time:

The data time of MaxCompute data sources is required and takes precedence over a DONE file.
If you create only one version for an application, you need to specify only the data time. In this case, you cannot use a DONE file alone.
If you need to use a scheduled reindexing task, you must specify both the data time and a DONE file. The data time takes precedence over the DONE file for the first version. The DONE file takes precedence over the data time for subsequent versions.

Additional considerations

Important

MaxCompute data sources support only full synchronization, but do not support incremental synchronization.