All Products
Search
Document Center

OpenSearch:Document chunking

Last Updated:Oct 09, 2025

The AI search development workbench supports the Document Split Service API, allowing you to integrate the service into your business workflow to enhance retrieval or processing efficiency.

Service name

Service ID

Service description

QPS limit for API calls (Alibaba Cloud account and RAM users)

Document split service-001

ops-document-split-001

This service offers a general text slicing strategy capable of splitting structured data in HTML, Markdown, and TXT formats. It leverages document paragraph formatting, text semantics, and predefined rules, and can extract code, images, and tables from rich text.

2

Note

To apply for higher QPS, submit a ticket.

In Retrieval-Augmented Generation (RAG), it is common practice to process text for retrieval into vectors and store them in a vector database for subsequent retrieval. The splitting service divides long documents into smaller chunks, which effectively match the length requirements of the text embedding model for each segment, thereby vectorizing lengthy documents.

Basic usage

The input is a string of plain text with additional configurations, and the output is the text split into segments, with potential rich text. The API returns four lists: chunks, nodes, rich_texts, sentences. To use the document split results for embedding, you can simply extract the content fields from the chunks and rich_texts lists, excluding images. Refer to the code template in the scenario center. The Python code is as follows:

# Extract chunk results, note that only ["chunks"] and ["rich_texts"] except for images are used here
doc_list = []
for chunk in document_split_result.body.result.chunks:
    doc_list.append({"id": chunk.meta.get("id"), "content": chunk.content})

for rich_text in document_split_result.body.result.rich_texts:
    if rich_text.meta.get("type") != "image":
        doc_list.append({"id": rich_text.meta.get("id"), "content": rich_text.content})

Advanced usage method

The document split service can segment complex document content into a specified number of tokens, forming a tree structure of multiple nodes. This structure is utilized during the retrieval phase of RAG to enrich the context of recalled chunks, improving answer accuracy.

The service logic aims to split the text based on the most macro structures possible. If the resulting chunks do not meet the length requirements, the service will recursively continue to split each chunk until all lengths are satisfactory. This recursive process generates a chunk tree, with each leaf node corresponding to an actual chunk result-the final node.

During the vector recall chunk process, you can use the chunk tree information to complete the context. For instance, you can include other chunks from the same level as the recalled chunk within the model token number limit to ensure information integrity.

For example, given a text segment

After successfully opening the AI search development workbench service for the first time, the system will automatically create a default workspace: Default.
Click to create a space. Enter a custom workspace name and click confirm. After clicking to create a new API-KEY, the system will generate an API-KEY. Here, the customer can click the copy button to copy and save the content of the API-KEY.

A possible chunk tree is as follows:

root (6b15)
  |
  +-- paragraph_node (557b)
       |
       +-- newline_node (ef4d)[After successfully opening the AI search development workbench...Default.]
       |
       +-- newline_node (c618)
            |
            +-- sentence_node (98ce)[Click to create a space...click confirm.]
            |
            +-- sentence_node (922a)[After clicking to create a new API-KEY...click the copy button to copy and save the content of the API-KEY.]

Given a maximum chunk length, the complete chunk tree contains two types of nodes: final nodes (with chunk content) and intermediate nodes (logical nodes without content). The entire tree is returned as a list of all nodes (nodes), and the final nodes are also returned in a separate list (chunks). Below are some possible node types:

  • root: The root node

  • paragraph_node: A paragraph node, representing a split at the "\n\n" separator and marking the paragraph position (as there is no \n\n in the example, there is only one such intermediate node)

  • newline_node: A newline node, representing a split at the "\n" separator (the newline_node (ef4d) in the example meets the chunk length requirement and is a final node. The newline_node (c618) needs further splitting and is an intermediate node)

  • sentence_node: A sentence node, representing a split at the "。" separator

  • subsentence_node: A clause node, representing a split at the "," separator (not shown in the example)

For content in Markdown and HTML formats, the chunk service also outputs rich text (rich_texts) separately. For instance, the <img>, <table>, and <code> tags in HTML. These rich texts are replaced with placeholders like [image_0], <table>table_0</table>, and <code>code_0</code> in the original text. For example, an image URL like "![image](www.example.com)" in the input content will be replaced with a placeholder "[img_69646]", corresponding to the rich text chunk with id=img_69646-0 in rich_texts (note the naming suffix of the id). Meanwhile, each rich text block is returned in the rich_texts field. This design allows for the separate recall of rich text blocks and their reintegration into the original text as needed. Each rich text block corresponds to the final node chunk of a unique chunk.

To enhance the recall rate for short queries, customers can opt to configure strategy.need_sentence=true. In this case, the original text is split by sentence and returned in the sentences list for independent recall. To aid sentence expansion, each sentence block is part of the final node chunk of a unique chunk. (Note that this sentences list is unrelated to the sentence_node mentioned earlier)

The bolded chunks, nodes, rich_texts, sentences above represent all fields returned by the API. Detailed usage can be found in the parameter description below. For simplicity, each chunk output uses a simplified version of HTML syntax.

Prerequisites

  • The authentication information is obtained.

    When you call an AI Search Open Platform service by using an API, you need to authenticate the caller's identity.

  • The service access address is obtained.

    You can call a service over the Internet or a virtual private cloud (VPC). For more information, see Get service registration address.

Request description

General description

  • The maximum request body size must not exceed 8 MB.

Request method

POST

URL

{host}/v3/openapi/workspaces/{workspace_name}/document-split/{service_id} 
  • host: The endpoint of the service, accessible over the Internet or through a VPC. For more information, see Obtain service access addresses.

  • workspace_name: The name of the workspace, such as default.

  • service_id: The built-in service ID, such as ops-document-split-001.

Request parameters

Header parameters

API-KEY authentication

Parameter

Type

Required

Description

Example value

Content-Type

String

Yes

Request type: application/json

application/json

Authorization

String

Yes

API-Key

Bearer OS-d1**2a

Body parameters

Parameter

Type

Required

Description

Example value

document.content

String

Yes

The plain text content to be split. According to JSON standards, escape the following characters in string fields: "\\, \", \/, \b, \f, \n, \r, \t". Common JSON libraries will automatically escape these characters in generated JSON strings.

"Title\nFirst line\nSecond line"

document.content_encoding

String

No

The encoding type of the content

  • utf8: The default encoding type

utf8

document.content_type

String

No

The format of the content

  • html

  • markdown

  • text: The default format, compatible with plain text

html

strategy.type

String

No

The paragraph slicing strategy

  • default: The default strategy, which splits according to the document's paragraph format

default

strategy.max_chunk_size

Int

No

The maximum length of a chunk, with a default of 300.

300

strategy.compute_type

String

No

The method used to calculate length

  • token: The default method, calculated using the tokenizer from the ops-text-embedding-001 vector model

token

strategy.need_sentence

Boolean

No

Indicates whether to return sentence-level chunks to optimize short query recall

  • By default, this is set to false

  • If true is selected, token usage will double

false

Additional notes:

  • The strategy.need_sentence parameter enables sentence-level slicing, which is independent of paragraph slicing. Essentially, it returns each sentence as an individual chunk. Activating this strategy allows for the simultaneous recall of short and long chunks, enhancing the overall recall rate.

Response parameters

Parameter

Type

Description

Example value

request_id

String

The unique identifier assigned by the system for an API call.

B4AB89C8-B135-****-A6F8-2BAB801A2CE4

latency

Float/Int

The duration of the request in milliseconds.

10

usage

Object

Billing information associated with this call.

"usage": {

"token_count": 3072

}

usage.token_count

Int

The number of tokens used.

3072

result.chunks

List(Chunk)

A list of chunk results (final nodes), including content and identification information for each chunk.

[{

"content" : "xxx",

"meta":{'parent_id':x, 'id': x, 'type': 'text'}

}]

result.chunks[].content

String

The content of each chunk within the result list.

"xxx"

result.chunks[].meta

Map

Identification information for each chunk in the result list, with all fields being string type

  • parent_id: The ID of the chunk's parent node

  • id: The ID of the chunk node

  • type: The type of content output for the chunk, currently all are text

  • token: The number of tokens in the current chunk

{

'parent_id': '3b94a18555c44b67b193c6ab4f****',

'id': 'c9edcb38fdf34add90d62f6bf5c6****,

'type': 'text'

'token': 10,

}

result.rich_texts

List(RichText)

The output form for rich text. When document.content_type is set to markdown or html, elements like images, code, and tables within the chunk content are replaced with rich text placeholders. For example, an image URL like "![image](www.example.com)" in the input content will be replaced with a placeholder "[img_69646]", corresponding to the rich text chunk with id=img_69646-0 in rich_texts (note the naming suffix of the id)

Note

This form is not supported when document.content_type is set to text.

[{

"content" : "xxx",

"meta":{'belonged_chunk_id':x, 'id': x, 'type': 'table'}

}]

result.rich_texts[].content

String

The content for each rich text chunk. Image content is a URL and will not be split, potentially exceeding max_chunk_size. Tables are split into headers and row content. Code is split similarly to text.

"<table><tr>\n<th>Action</th>\n<th>Description</th>\n</tr><tr>\n<td>Hide component</td>\n<td>Hide component, no parameters required.</td>\n</tr></table>"

result.rich_texts[].meta

Map

Identification information for each rich text chunk, with all fields being string type

  • belonged_chunk_id: The ID of the chunk node to which the rich text belongs (each rich text must be associated with a chunk node)

  • id: The ID of the rich text

  • type: code/image/table

  • token: The number of tokens in the current chunk (the token count for images is fixed at -1)

{

'type': 'table',

'belonged_chunk_id': 'f0254cb7a5144a1fb3e5e024a3****b',

'id': 'table_2-1'

'token': 10

}

result.nodes

List(Node)

A list of nodes from the chunk tree.

[{'parent_id':x, 'id': x, 'type': 'text'}]

result.nodes[]

Map

Information for each node in the chunk tree, with all fields being string type

  • id: The node ID, which corresponds to the chunk ID if the node is also a chunk

  • type: string, which includes paragraph_node, newline_node, sentence_node, subsentence_node, and for HTML or Markdown, may include <h1> to <h6>, representing different separators

  • parent_id: The parent node ID

{

'id': 'f0254cb7a5144a1fb3e5e024a3****b',

'type': 'paragraph_node',

'parent_id': 'f0254cb7a5144a1fb3e5e024a3****b'

}

result.sentences (optional)

List(sentence)

When strategy.need_sentence is set to true in the request, this returns a list of sentences from each chunk.

[{

"content" : "xxx",

"meta":{'belonged_chunk_id':x, 'id': x, 'type': 'sentence'}

}]

result.sentences[].content (optional)

String

The content of each sentence.

"123"

result.sentences[].meta (optional)

Map

Information for each sentence:

  • belonged_chunk_id: The ID of the chunk node to which the sentence belongs

  • id: The ID of the sentence

  • type: sentence, a static field

  • token: The number of tokens in the current chunk

{

'id': 'f0254cb7a5144a1fb3e5e024a3****b1-1',

'type': 'sentence',

'belonged_chunk_id': 'f0254cb7a5144a1fb3e5e024a3****b',

'token': 10

}

Curl request example

curl -XPOST -H"Content-Type: application/json"  
"http://***-hangzhou.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-split/ops-document-split-001"  
-H "Authorization: Bearer Your API-KEY"  
-d "{
    \"document\":{
          \"content\":\"Product benefits\\nIndustry algorithm edition\\nIntelligent\\nBuilt-in rich customizable algorithm models, combined with the search characteristics of different industries, launch industry recall and sorting algorithms to ensure better search results.\\n\\nFlexible and customizable\\nDevelopers can customize corresponding algorithm models, application schemas, data processing, query analysis, sorting, and other configurations based on their own business characteristics and data to meet personalized search needs, improve click-through rate of search results, achieve rapid business iteration, and greatly shorten the cycle of demand going online.\\n\\nSafe and stable\\nProvides 7×24 hours of operation and maintenance, and provides technical support through online work orders and telephone fault reporting. It has a complete set of fault monitoring, automatic alert, quick positioning, and other fault emergency response mechanisms. Based on Alibaba Cloud's AccessKeyId and AccessKeySecret security encryption, access control and isolation are performed from the access interface to ensure user-level data isolation and user data security. Data redundancy backup ensures that data will not be lost.\\n\\nAuto Scaling\\nHas elastic expansion capabilities, users can expand or reduce the resources they use as needed.\\n\\nRich peripheral functions\\nSupports a series of search peripheral functions such as top search, hint, drop-down suggestion, and statistical reports, making it convenient for users to display and analyze.\\n\\nOut of the box\\nNo need to maintain and deploy clusters, quickly access search services in one stop\\n\\nHigh-performance Search Edition\\nHigh throughput\\nSingle table supports tens of thousands of write TPS, second-level updates.\\n\\nSafe and stable\\nProvides 7×24 hours of operation and maintenance, and provides technical support through online work orders and telephone fault reporting. It has a complete set of fault monitoring, automatic alert, quick positioning, and other fault emergency response mechanisms. Based on Alibaba Cloud's AccessKeyId and AccessKeySecret security encryption, access control and isolation are performed from the access interface to ensure user-level data isolation and user data security. Data redundancy backup ensures that data will not be lost.\\n\\nAuto Scaling\\nHas elastic expansion capabilities,users can expand or reduce the resources they use as needed.\\n\\nOut of the box\\nNo need to maintain and deploy clusters, quickly access search services in one stop\\n\\nVector Search Edition\\nStable\\nThe underlying implementation uses C++, which has supported multiple core businesses after more than ten years of development. It is very stable and is very suitable for core search scenarios with high stability requirements.\\n\\nEfficient\\nDistributed search engine, which can efficiently support the retrieval of massive data, and also supports real-time data updates (effective in seconds), very suitable for search scenarios that are sensitive to query time and have high timeliness requirements.\\n\\nLow cost\\nSupports multiple index compression strategies, and also supports multi-value index loading tests, which can meet users' query needs at a lower cost.\\n\\nVector algorithm\\nSupports vector retrieval of various unstructured data (such as voice, images, videos, text, behavior, etc.).\\n\\nSQL query\\nSupports SQL query syntax, supports multi-table online join, provides rich built-in UDF functions and UDF function customization mechanisms to meet the retrieval needs of different users. In the operation and maintenance system, we have integrated SQL studio to facilitate users to develop and test SQL.\\n\\nRecall Engine Edition\\nStable\\nThe underlying implementation uses C++, which has supported multiple core businesses after more than ten years of development. It is very stable and is very suitable for core search scenarios with high stability requirements.\\n\\nEfficient\\nHavenask is a distributed search engine that can efficiently support the retrieval of massive data, and also supports real-time data updates (effective in seconds), very suitable for search scenarios that are sensitive to query time and have high timeliness requirements.\\n\\nLow cost\\nHavenask supports multiple index compression strategies, and also supports multi-value index loading tests, which can meet users' query needs at a lower cost.\\n\\nRich features\\nHavenask supports various types of analyzers, multiple index types, and powerful query syntax, which can well meet users' retrieval needs. We also provide a plugin mechanism to facilitate users to customize their own business processing logic.\\n\\nSQL query\\nHavenask supports SQL query syntax, supports multi-table online join, provides rich built-in UDF functions and UDF function customization mechanisms to meet the retrieval needs of different users. In the operation and maintenance system, we will soon integrate SQL studio to facilitate users to develop and test SQL.\",
          \"content_encoding\":\"utf8\",\"content_type\":\"text\"
    },
    \"strategy\":{
          \"type\":\"default\",
          \"max_chunk_size\":300,
          \"compute_type\":\"token\",
          \"need_sentence\":false
    }
}"

Response example

Normal response example

{
	"request_id": "47EA146B-****-448C-A1D5-50B89D7EA434",
	"latency": 161,
	"usage": {
		"token_count": 800
	},
	"result": {
		"chunks": [
			{
				"content": "Product benefits\\nIndustry algorithm edition\\nIntelligent\\nBuilt-in rich customizable algorithm models, combined with the search characteristics of different industries, launch industry recall and sorting algorithms to ensure better search results.\\n\\nFlexible and customizable\\nDevelopers can customize corresponding algorithm models, application schemas, data processing, query analysis, sorting, and other configurations based on their own business characteristics and data to meet personalized search needs, improve click-through rate of search results, achieve rapid business iteration, and greatly shorten the cycle of demand going online.\\n\\nSafe and stable\\nProvides 7×24 hours of operation and maintenance, and provides technical support through online work orders and telephone fault reporting. It has a complete set of fault monitoring, automatic alert, quick positioning, and other fault emergency response mechanisms. Based on Alibaba Cloud's AccessKeyId and AccessKeySecret security encryption, access control and isolation are performed from the access interface to ensure user-level data isolation and user data security. Data redundancy backup ensures that data will not be lost.\\n\\nAuto Scaling\\nHas elastic expansion capabilities, users can expand or reduce the resources they use as needed.\\n\\nRich peripheral functions\\nSupports a series of search peripheral functions such as top search, hint, drop-down suggestion, and statistical reports, making it convenient for users to display and analyze.\\n\\nOut of the box\\nNo need to maintain and deploy clusters, quickly access search services in one stop\\n\\nHigh-performance Search Edition\\nHigh throughput\\nSingle table supports tens of thousands of write TPS, second-level updates",
				"meta": {
					"parent_id": "dee776dda3ff4b078bccf989a6bd****",
					"id": "27eea7c6b2874cb7a5bf6c71afbf****",
					"type": "text"
				}
			},
			{
				"content": ".\\n\\nSafe and stable\\nProvides 7×24 hours of operation and maintenance, and provides technical support through online work orders and telephone fault reporting. It has a complete set of fault monitoring, automatic alert, quick positioning, and other fault emergency response mechanisms. Based on Alibaba Cloud's AccessKeyId and AccessKeySecret security encryption, access control and isolation are performed from the access interface to ensure user-level data isolation and user data security. Data redundancy backup ensures that data will not be lost.\\n\\nAuto Scaling\\nHas elastic expansion capabilities, users can expand or reduce the resources they use as needed.\\n\\nOut of the box\\nNo need to maintain and deploy clusters, quickly access search services in one stop\\n\\nVector Search Edition\\nStable\\nThe underlying implementation uses C++, which has supported multiple core businesses after more than ten years of development. It is very stable and is very suitable for core search scenarios with high stability requirements.\\n\\nEfficient\\nDistributed search engine, which can efficiently support the retrieval of massive data, and also supports real-time data updates (effective in seconds), very suitable for search scenarios that are sensitive to query time and have high timeliness requirements.\\n\\nLow cost\\nSupports multiple index compression strategies, and also supports multi-value index loading tests, which can meet users' query needs at a lower cost.\\n\\nVector algorithm\\nSupports vector retrieval of various unstructured data (such as voice, images, videos, text, behavior, etc.).\\n\\nSQL query\\nSupports SQL query syntax, supports multi-table online join, provides rich built-in UDF functions and UDF function customization mechanisms to meet the retrieval needs of different users",
				"meta": {
					"parent_id": "dee776dda3ff4b078bccf989a6bd****",
					"id": "bf9fcfb47fcf410aa05216e268df****",
					"type": "text"
				}
			},
			{
				"content": ". In the operation and maintenance system, we have integrated SQL studio to facilitate users to develop and test SQL.\\n\\nRecall Engine Edition\\nStable\\nThe underlying implementation uses C++, which has supported multiple core businesses after more than ten years of development. It is very stable and is very suitable for core search scenarios with high stability requirements.\\n\\nEfficient\\nHavenask is a distributed search engine that can efficiently support the retrieval of massive data, and also supports real-time data updates (effective in seconds), very suitable for search scenarios that are sensitive to query time and have high timeliness requirements.\\n\\nLow cost\\nHavenask supports multiple index compression strategies, and also supports multi-value index loading tests, which can meet users' query needs at a lower cost.\\n\\nRich features\\nHavenask supports various types of analyzers, multiple index types, and powerful query syntax, which can well meet users' retrieval needs. We also provide a plugin mechanism to facilitate users to customize their own business processing logic.\\n\\nSQL query\\nHavenask supports SQL query syntax, supports multi-table online join, provides rich built-in UDF functions and UDF function customization mechanisms to meet the retrieval needs of different users. In the operation and maintenance system, we will soon integrate SQL studio to facilitate users to develop and test SQL.",
				"meta": {
					"parent_id": "dee776dda3ff4b078bccf989a6bd****",
					"id": "26ab0e4f7665487bb0a82c5a226a****",
					"type": "text"
				}
			}
		],
		"nodes": [
			{
				"id": "dee776dda3ff4b078bccf989a6bd****",
				"type": "root",
				"parent_id": "dee776dda3ff4b078bccf989a6bd****"
			},
			{
				"id": "27eea7c6b2874cb7a5bf6c71afbf****",
				"type": "sentence",
				"parent_id": "dee776dda3ff4b078bccf989a6bd****"
			},
			{
				"id": "bf9fcfb47fcf410aa05216e268df****",
				"type": "sentence",
				"parent_id": "dee776dda3ff4b078bccf989a6bd****"
			},
			{
				"id": "26ab0e4f7665487bb0a82c5a226a****",
				"type": "sentence",
				"parent_id": "dee776dda3ff4b078bccf989a6bd****"
			}
		],
		"rich_texts": []
	}
}

Exception response example

In the event of an error during an access request, the output will specify the reason for the error through the code and message fields.

{
    "request_id": "817964CD-1B84-4AE1-9B63-4FB99734****",
    "latency": 0,
    "code": "InvalidParameter",
    "message": "JSON parse error: Invalid UTF-8 start byte 0xbc; nested exception is com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xbc\n at line: 2, column: 19]"
}

Status codes

For more information about the status codes, see Status codes.