All Products
Search
Document Center

OpenSearch:OSS + API data source

Last Updated:Dec 20, 2024

This topic describes how to use Object Storage Service (OSS) + API as the data source when you add a table.

To add an OSS + API data source, perform the following two steps:

  1. Activate OSS in the region where your instance is located.

  2. Follow the instructions in Add an OSS data source for configuration.

Activate OSS

  1. Activate OSS.

    Note
    • Make sure that the OSS service is activated in the same region as the OpenSearch service you have purchased.

    • During OSS activation, you need to add the opensearch:opensearch tag.

    • Retrieval Engine Edition does not support OSS buckets that lack regional attributes.

    • When you add an OSS data source, the system automatically creates a service-linked role named AliyunServiceRoleForSearchEngine if the role does not exist. OpenSearch assumes this role to access your resources in other services to implement related features.

  2. Create a bucket in the OSS console.

  3. Upload objects to the bucket.

  4. Add the opensearch:opensearch tag to the bucket.

Add an OSS data source

  1. Choose Instance Details > Table Management. On the page that appears, click Add Table.

  2. On the Table Management page, configure basic information, such as the table name, number of shards, and number of data update resources.

    Parameters:

    • Table Name: Specify a custom name.

    • Number of Data Shards: Enter a positive integer that is less than or equal to 256. The integer should be no more than three times the number of instance data nodes.

    • Number of Data Update Resources: the number of resources allocated for data updates. By default, two free update resources with 4 vCPUs and 8 GiB of memory are provided for each index. You are charged for data update resources that exceed the free quota. For more information, see Retrieval Engine Version International Site Billing Document.

  3. Configure a data source for data synchronization. Once verified, click Next.

    • Full Data Source: Choose Object Storage Service (OSS) + API.

    • OSS Path: the path to access OSS objects. The path must start with a forward slash (/) and cannot contain question marks (?), equal signs (=), and ampersands (&). Objects cannot be placed in the root directory. They must be in a folder, with the path specified accordingly.

    • OSS Bucket: the name of the OSS bucket.

    • Data Format: Select either the HA3 format or JSON format.

    • Data Source Verification: Proceed to the next step after verification is passed.

Note
  • The directory name must include opensearch, or it must have the opensearch:opensearch tag. Otherwise, data cannot be read. The name cannot contain special characters such as question marks (?), equal signs (=), and ampersands (&).

  • OSS Path Source: After accessing the created bucket, create a new directory and select its path. In this example, the path /opensearch_index_data/ is used.

  • Bucket: the name of the OSS bucket. The name is the same as that displayed on the Buckets page of the OSS console.

  1. Configure the fields. After the configuration is complete, click Next.

    In this example, two fields pk and namespace are configured. For more information about the sample data, see oss_test.txt.

CMD=add
pk=999000
namespace=0.00.0039257140.0098142860.0039257140.00
pk=999000
namespace=0.00.0039257140

For more information about the content of an object in OSS, see the Object format section.

  1. Configure the index schema. After completion, click Next.

  2. Click Confirm Creation. Then, the system automatically generates the configured table.

  3. View the table creation progress in the change history.

  4. Once the table is available, conduct query tests on the query test page.

Object format

An object serves as a data source for index creation. The object must be encoded in UTF-8 format. Currently, HA3 and JSON formats are supported.

HA3 format

  • The following code shows the content of a complete data file named standard_sample.data:

CMD=add^_
PK=12345321^_
url=http://www.aliyun.com/index.html^_
title=Alibaba Cloud Computing Co., Ltd.^_
body=xxxxxx xxx^_
time=3123423421^_
multi_value_field=1234^]324^]342^_
bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_
^^
CMD=delete^_
PK=12345321^_CMD=add^_
PK=12345321^_
url=http://www.aliyun.com/index.html^_
title=Alibaba Cloud Computing Co., Ltd.^_
body=xxxxxx xxx^_
time=3123423421^_
multi_value_field=1234^]324^]342^_
bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_
^^
CMD=delete^_
PK=12345321^_

The data file above includes two commands: add and delete. Each command is composed of multiple lines, with each line representing a key-value pair. Commands are separated by '^^\n', key-value pairs are separated by '^_\n', and multiple values are separated by '^]'. The following table describes the delimiters.

  • Delimiters

C++ Encoding

ASCII

Hexadecimal

Description

Display form in (emacs/vi)

Input method in emacs

Input method in vi

"\x1F\n"

1F0A

Key-value delimiter

^_ (followed by a line feed)

C-q C-7

C-v C-7

"\x1E\n"

1E0A

Command delimiter

^^ (followed by a line feed)

C-q C-6

C-v C-6

"\x1D"

1D

Multi-value delimiter

^]

C-q C-5

C-v C-5

"\x1C"

1C

Section weight flag

^\

C-q C-4

C-v C-4

"\x1D"

1D

Section delimiter

^]

C-q C-5

C-v C-5

"\x03"

03

Sub-doc field delimiter

^C

C-q C-c

C-v C-c

  • Command format

  • Format of the add command: The add command is used to insert new content into the index. The first line of the add command must be CMD=add, followed by fields. The order of fields should match the schema, and all fields present must be declared in the schema.

CMD=add^_
PK=12345321^_
url=http://www.aliyun.com/index.html^_
title=Alibaba Cloud Computing Co., Ltd.^_
body=xxxxxx xxx^_
time=3123423421^_
multi_value_field=1234^]324^]342^_
bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_
^^CMD=add^_
PK=12345321^_
url=http://www.aliyun.com/index.html^_
title=Alibaba Cloud Computing Co., Ltd.^_
body=xxxxxx xxx^_
time=3123423421^_
multi_value_field=1234^]324^]342^_
bidwords=mp3^\price=35.8^Ptime=13867236221^]mp4^\price=32.8^Ptime=13867236221^_
^^
  • Format of the delete command: The delete command is used for removing specified content from the index. The first line of the delete command must be CMD=delete, followed by the primary key field as defined in the index schema and the field used for partition hashing. If these two fields are identical, only one needs to be listed.

CMD=delete^_
PK=12345321^_
^^CMD=delete^_
PK=12345321^_
^^

JSON format

The following code shows an example with multiple records, where '\n' denotes a line feed. A single record should not contain any line feeds. Sample code:

{"field_double": ["100.0", "221.123", "500.3333333"], "field_int32": ["100", "200", "300"], "title": "Huawei Mate 9 Kirin 960 chip Leica dual lens", "color": "Red", "empty_int32": "", "price": "3599", "CMD": "add", "nid": "1", "gather_cn_str": "", "desc": ["str1", "str2", "str3"], "brand": "Huawei", "size": "5.9","__subdocs__":[{"sub_pk":"100","sub_field1":"200","sub_field2":["100","200","300"]},{"sub_pk":"200","sub_field1":"200","sub_field2":["100","200","300"]}]}
{"field_double": ["100.0", "221.123", "500.3333333", "100.0", "221.123", "500.3333333"], "field_int32": ["100", "200", "300", "100", "200", "300"], "title": "Huawei/Huawei P10 Plus all-network phone", "color": "Blue", "empty_int32": "", "price": "4388", "CMD": "add", "nid": "2", "gather_cn_str": "color Blue", "desc": ["str1", "str2", "str3", "str1", "str2", "str3"], "brand": "Huawei", "size": "5.5","__subdocs__":[{"sub_pk":"100","sub_field1":"200","sub_field2":["100","200","300"]},{"sub_pk":"200","sub_field1":"200","sub_field2":["100","200","300"]}]}