All Products
Search
Document Center

Configure a MaxCompute data source

Last Updated: Sep 09, 2021

MaxCompute is an open computing platform. If the data you want to import to OpenSearch is generated by MaxCompute, you can connect a MaxCompute data source to an application. After reindexing is triggered in the application, OpenSearch automatically obtains full data from tables in the MaxCompute data source. To obtain incremental data from the MaxCompute data source, you must use the API or SDKs of OpenSearch. MaxCompute data sources support only full synchronization, but do not support incremental synchronization.

Configure a MaxCompute data source

Step 1: Configure a MaxCompute data source

OpenSearch allows you to pull data from MaxCompute projects that belong to the current account or MaxCompute projects that the current account is authorized to access. Set the Select Data Source parameter to MaxCompute. Select an authorized MaxCompute project and enter the required information of the MaxCompute project to test the connectivity. After a MaxCompute project is connected, OpenSearch buffers the MaxCompute project. Then, you can click the name of the MaxCompute project to use the project as the data source without reconnection.

If the error message "You are not authorized to access MaxCompute" is reported, contact the owner of the MaxCompute project to use the MaxCompute client to grant you the List permission on the project and the All permission on the tables in the project. Note that the project owner must use an Alibaba Cloud account that has activated the OpenSearch service to complete the authorization.Note: RAM users cannot be authorized to access MaxCompute. The reason is that a RAM user that is authorized to access a MaxCompute project of its Alibaba Cloud account may fail to access other Alibaba Cloud resources that belong to the Alibaba Cloud account. As a result, the RAM user may fail to reference the project as the data source to connect to OpenSearch. We recommend that you use an Alibaba Cloud account to connect to MaxCompute, and then manage the application as a RAM user.

If the connectivity test fails, check whether the authorization is complete or has been changed.Note that if permissions on fields in MaxCompute tables are not granted or invalid, an error is also reported.

1

Configure field mappings: OpenSearch provides multiple data source plug-ins for MaxCompute data. If you need to use a plug-in, click the plus sign (+) in the Content Conversion column when you configure a field mapping. This way, the source field is converted before it is synchronized to OpenSearch. If the plug-in does not work due to errors such as configuration errors or connection failures, the source field is synchronized to the destination field without conversion.

24

Note:

  • The following types of MaxCompute data are supported: BIGINT, DOUBLE, BOOLEAN, DATETIME, STRING, DECIMAL, MAP, and ARRAY.

  • The system automatically converts data of the DATETIME type in MaxCompute tables to milliseconds. You must set the type of the corresponding OpenSearch fields to INT.

Step 2: Configure partition information

2.1 OpenSearch allows you to specify partitions whose data you want to import based on the characteristics of MaxCompute data. Regular expressions are supported. For example, the regular expression in the following figure specifies to import the data of the previous day. You can click Reindex on the Application Details page to create a scheduled reindexing task. This way, incremental partition data can be imported every day.

2.2 Regular expressions. Equal signs (=), commas (,), semicolons (;), and double vertical bars (||) are reserved characters of the system and cannot be contained in the names or values of partition fields. For example, pt=%Y%m%d || -1 days specifies to automatically import the full data of the specified partition in the previous day.

5

The following part describes how to configure partition conditions of MaxCompute:

  • 1. You can specify multiple partition filter rules by separating them with semicolons (;). For example, pt=1;pt=2 matches all partitions that meet the partition filter rule pt=1 or pt=2.

  • 2. You can set multiple partition fields for a partition filter rule by separating them with commas (,). For example, pt1=1,pt2=2,pt3=3 matches all partitions that meet all the partition filter rules pt1=1, pt2=2, and pt3=3. Functions such as %Y%m%d || -1 days do not support multiple partition fields, but support a single partition field.

6

Example: The pt partitions in a MaxCompute table contain ds child partitions, as shown in the preceding figure.

  • Specify multiple partitions: pt=1;pt=2 specifies to synchronize all data in pt=1 and pt=2 partitions.

  • Set multiple partition fields: pt=1,ds=1 specifies to synchronize the data in the ds=1 child partition of the pt=1 partition.

  • pt=1,ds=%Y%m%d || -1 days or pt=1;pt=%Y%m%d || -1 days is not supported.

  • 3: The value of a partition field can be an asterisk (*), which indicates that the value of the partition field can be an arbitrary value. In this case, this field is optional in the filter rule.

  • 4: The value of a partition field can contain a regular expression. For example, pt=[0-9]* matches all partitions whose pt value is a number.

  • 5: The value of a partition field supports time matching. The filter rule is in the following format: pt=Partition field value that contains formatted time || Time interval expression. For example, ds=%Y%m%d || -1 days indicates that the partition field is ds, the formatted time is in the same format as 20150510, and the data of the previous day needs to be imported.

  • 5.1 Formatted time parameters can be standard time format parameters.

  • 5.2 The time interval expression can be in the following format: +/- n week|weeks|day|days|hour|hours|minute|minutes|second|seconds|microsecond|microseconds. The plus sign (+) indicates N weeks, days, hours, minutes, seconds, or milliseconds after a scheduled reindexing task is created. The minus sign (-) indicates N weeks, days, hours, minutes, seconds, or milliseconds before a scheduled reindexing task is created.

  • 5.3 By default, the system converts time parameters in all filter rules by using the +0 days method. Therefore, the field values that are used for filtering cannot contain the following strings as regular string parameters. For example, for tasks that are created on Wednesday, pt=%abc matches the partitions whose pt value is Wedbc instead of pt=%abc.

The following part describes all parameters that can be contained in regular expressions:

%a: the shortened name of the day in the week, such as Wed for Wednesday.  
%A: the full name of the day in the week, such as Wednesday.  
%b: the shortened name of the month, such as Apr for April.  
%B: the full name of the month, such as April.   
%c:  the string that indicates the datetime,such as 04/07/10 10:43:39.  
%d:  the sequence number of the day in the month.  
%f:  the microsecond. Valid values: [0, 999999].  
%H:  the hour in 24-hour clock time. Valid values: [0, 23].  
%I:  the hour in 12-hour clock time. Valid values: [0, 11].  
%j:  the sequence number of the day in the year. Valid values: [001, 366].  
%m: the sequence number of the month in the year. Valid values: [01, 12].  
%M:  the minute. Valid values: [00, 59].  
%p:  AM or PM.  
%S:  the second. Valid values: [00, 61]. For more information, see the Python manual.  
%U:  the sequence number of the week in the year, where Sunday is the first day of the week.  
%w:  the sequence number of the day (today) in the week. Valid values: [0, 6]. A value of 6 represents Sunday.  
%W:  the sequence number of the week in the year, where Monday is the first day of the week.  
%x:  the string that indicates the date, such as 04/07/10.  
%X:  the string that indicates the time, such as 10:43:39.  
%y:  the year that is represented by two digits.  
%Y:  the year that is represented by four digits.  
%z:  The time difference from the UTC + 0 time zone. If the server time in the UTC + 0 time zone is used, an empty string is returned.

Step 3: Configure concurrency control for data synchronization

If you select Use DONE File, you can upload a DONE file to control the timing for OpenSearch to pull full data. This ensures the integrity of the full data. Before OpenSearch pulls full data from MaxCompute, OpenSearch checks whether the DONE file of the current day exists. If the file does not exist, OpenSearch waits for the DONE file. The default timeout period is 1 hour.

  • You must download the installation package of the MaxCompute client from the official website of MaxCompute.

  • You must have the CreateResource permission on the required project.

  • After you install the MaxCompute client, run the following command on your MaxCompute instance. The DONE file is named in the $prefix_%Y-%m-%d format. $prefix: By default, the prefix of the name of the DONE file is the table name. %Y-%m-%d specifies the date of a scheduled reindexing task. The minimum interval for scheduled reindexing tasks is one day.

    odpscmd –u accessid –p accesskey --project=<prj_name>–e "add file <done file>;"
  • The content of DONE files is in the JSON format. A DONE file needs to contain only the timestamp in milliseconds of the current full data. The system retains only the incremental data in the recent three days. Therefore, the point in time that is specified by the timestamp must be within the latest three days.

  • The timestamp in a DONE file indicates the point in time of the incremental data to be pulled. If you do not specify the timestamp, incremental data from the start time of the scheduled reindexing task is appended. OpenSearch retains only the incremental data in recent three days. Therefore, the point in time must be within the latest three days.

  • For example, full data is generated at 09:00 on the current day, MaxCompute processes the full data at 10:00, and the scheduled reindexing task in OpenSearch starts at 10:30. After MaxCompute processes the full data, the incremental data after 09:00 on the current day is appended. You must specify the timestamp that corresponds to 09:00 on the current day in milliseconds in the DONE file to ensure data integrity. Otherwise, the incremental data after 10:30, the default start time of the scheduled reindexing task, is appended. The incremental data from 09:00 to 10:30 is lost. Proceed with caution.If no incremental data is generated, you do not need to specify the timestamp.

  • The following sample code shows an example of the content of a DONE file for an advanced application. The timestamp in the DONE file is used to append incremental data. You can use a similar way to specify the timestamp in DONE files for standard applications.

{
"timestamp":"1234567890000"
}

Priority of a DONE file and data time:

  1. The data time of MaxCompute data sources is required and takes precedence over a DONE file.

  2. If you create only one version for an application, you need to specify only the data time. In this case, you cannot use a DONE file alone.

  3. If you need to use a scheduled reindexing task, you must specify both the data time and a DONE file. The data time takes precedence over the DONE file for the first version. The DONE file takes precedence over the data time for subsequent versions.