The AI search development workbench supports the Document Split Service API, allowing you to integrate the service into your business workflow to enhance retrieval or processing efficiency.
Service name | Service ID | Service description | QPS limit for API calls (Alibaba Cloud account and RAM users) |
Document split service-001 | ops-document-split-001 | This service offers a general text slicing strategy capable of splitting structured data in HTML, Markdown, and TXT formats. It leverages document paragraph formatting, text semantics, and predefined rules, and can extract code, images, and tables from rich text. | 2 Note To apply for higher QPS, submit a ticket. |
In Retrieval-Augmented Generation (RAG), it is common practice to process text for retrieval into vectors and store them in a vector database for subsequent retrieval. The splitting service divides long documents into smaller chunks, which effectively match the length requirements of the text embedding model for each segment, thereby vectorizing lengthy documents.
Basic usage
The input is a string of plain text with additional configurations, and the output is the text split into segments, with potential rich text. The API returns four lists: chunks, nodes, rich_texts, sentences. To use the document split results for embedding, you can simply extract the content fields from the chunks and rich_texts lists, excluding images. Refer to the code template in the scenario center. The Python code is as follows:
# Extract chunk results, note that only ["chunks"] and ["rich_texts"] except for images are used here
doc_list = []
for chunk in document_split_result.body.result.chunks:
doc_list.append({"id": chunk.meta.get("id"), "content": chunk.content})
for rich_text in document_split_result.body.result.rich_texts:
if rich_text.meta.get("type") != "image":
doc_list.append({"id": rich_text.meta.get("id"), "content": rich_text.content})Advanced usage method
The document split service can segment complex document content into a specified number of tokens, forming a tree structure of multiple nodes. This structure is utilized during the retrieval phase of RAG to enrich the context of recalled chunks, improving answer accuracy.
The service logic aims to split the text based on the most macro structures possible. If the resulting chunks do not meet the length requirements, the service will recursively continue to split each chunk until all lengths are satisfactory. This recursive process generates a chunk tree, with each leaf node corresponding to an actual chunk result-the final node.
During the vector recall chunk process, you can use the chunk tree information to complete the context. For instance, you can include other chunks from the same level as the recalled chunk within the model token number limit to ensure information integrity.
For example, given a text segment
After successfully opening the AI search development workbench service for the first time, the system will automatically create a default workspace: Default.
Click to create a space. Enter a custom workspace name and click confirm. After clicking to create a new API-KEY, the system will generate an API-KEY. Here, the customer can click the copy button to copy and save the content of the API-KEY.A possible chunk tree is as follows:
root (6b15)
|
+-- paragraph_node (557b)
|
+-- newline_node (ef4d)[After successfully opening the AI search development workbench...Default.]
|
+-- newline_node (c618)
|
+-- sentence_node (98ce)[Click to create a space...click confirm.]
|
+-- sentence_node (922a)[After clicking to create a new API-KEY...click the copy button to copy and save the content of the API-KEY.]Given a maximum chunk length, the complete chunk tree contains two types of nodes: final nodes (with chunk content) and intermediate nodes (logical nodes without content). The entire tree is returned as a list of all nodes (nodes), and the final nodes are also returned in a separate list (chunks). Below are some possible node types:
root: The root node
paragraph_node: A paragraph node, representing a split at the "\n\n" separator and marking the paragraph position (as there is no \n\n in the example, there is only one such intermediate node)
newline_node: A newline node, representing a split at the "\n" separator (the newline_node (ef4d) in the example meets the chunk length requirement and is a final node. The newline_node (c618) needs further splitting and is an intermediate node)
sentence_node: A sentence node, representing a split at the "。" separator
subsentence_node: A clause node, representing a split at the "," separator (not shown in the example)
For content in Markdown and HTML formats, the chunk service also outputs rich text (rich_texts) separately. For instance, the <img>, <table>, and <code> tags in HTML. These rich texts are replaced with placeholders like [image_0], <table>table_0</table>, and <code>code_0</code> in the original text. For example, an image URL like "" in the input content will be replaced with a placeholder "[img_69646]", corresponding to the rich text chunk with id=img_69646-0 in rich_texts (note the naming suffix of the id). Meanwhile, each rich text block is returned in the rich_texts field. This design allows for the separate recall of rich text blocks and their reintegration into the original text as needed. Each rich text block corresponds to the final node chunk of a unique chunk.
To enhance the recall rate for short queries, customers can opt to configure strategy.need_sentence=true. In this case, the original text is split by sentence and returned in the sentences list for independent recall. To aid sentence expansion, each sentence block is part of the final node chunk of a unique chunk. (Note that this sentences list is unrelated to the sentence_node mentioned earlier)
The bolded chunks, nodes, rich_texts, sentences above represent all fields returned by the API. Detailed usage can be found in the parameter description below. For simplicity, each chunk output uses a simplified version of HTML syntax.
Prerequisites
The authentication information is obtained.
When you call an AI Search Open Platform service by using an API, you need to authenticate the caller's identity.
The service access address is obtained.
You can call a service over the Internet or a virtual private cloud (VPC). For more information, see Get service registration address.
Request description
General description
The maximum request body size must not exceed 8 MB.
Request method
POST
URL
{host}/v3/openapi/workspaces/{workspace_name}/document-split/{service_id} host: The endpoint of the service, accessible over the Internet or through a VPC. For more information, see Obtain service access addresses.
workspace_name: The name of the workspace, such as default.
service_id: The built-in service ID, such as ops-document-split-001.
Request parameters
Header parameters
API-KEY authentication
Parameter | Type | Required | Description | Example value |
Content-Type | String | Yes | Request type: application/json | application/json |
Authorization | String | Yes | API-Key | Bearer OS-d1**2a |
Body parameters
Parameter | Type | Required | Description | Example value |
document.content | String | Yes | The plain text content to be split. According to JSON standards, escape the following characters in string fields: "\\, \", \/, \b, \f, \n, \r, \t". Common JSON libraries will automatically escape these characters in generated JSON strings. | "Title\nFirst line\nSecond line" |
document.content_encoding | String | No | The encoding type of the content
| utf8 |
document.content_type | String | No | The format of the content
| html |
strategy.type | String | No | The paragraph slicing strategy
| default |
strategy.max_chunk_size | Int | No | The maximum length of a chunk, with a default of 300. | 300 |
strategy.compute_type | String | No | The method used to calculate length
| token |
strategy.need_sentence | Boolean | No | Indicates whether to return sentence-level chunks to optimize short query recall
| false |
Additional notes:
The strategy.need_sentence parameter enables sentence-level slicing, which is independent of paragraph slicing. Essentially, it returns each sentence as an individual chunk. Activating this strategy allows for the simultaneous recall of short and long chunks, enhancing the overall recall rate.
Response parameters
Parameter | Type | Description | Example value |
request_id | String | The unique identifier assigned by the system for an API call. | B4AB89C8-B135-****-A6F8-2BAB801A2CE4 |
latency | Float/Int | The duration of the request in milliseconds. | 10 |
usage | Object | Billing information associated with this call. | "usage": { "token_count": 3072 } |
usage.token_count | Int | The number of tokens used. | 3072 |
result.chunks | List(Chunk) | A list of chunk results (final nodes), including content and identification information for each chunk. | [{ "content" : "xxx", "meta":{'parent_id':x, 'id': x, 'type': 'text'} }] |
result.chunks[].content | String | The content of each chunk within the result list. | "xxx" |
result.chunks[].meta | Map | Identification information for each chunk in the result list, with all fields being string type
| { 'parent_id': '3b94a18555c44b67b193c6ab4f****', 'id': 'c9edcb38fdf34add90d62f6bf5c6****, 'type': 'text' 'token': 10, } |
result.rich_texts | List(RichText) | The output form for rich text. When document.content_type is set to markdown or html, elements like images, code, and tables within the chunk content are replaced with rich text placeholders. For example, an image URL like "" in the input content will be replaced with a placeholder "[img_69646]", corresponding to the rich text chunk with id=img_69646-0 in rich_texts (note the naming suffix of the id) Note This form is not supported when document.content_type is set to text. | [{ "content" : "xxx", "meta":{'belonged_chunk_id':x, 'id': x, 'type': 'table'} }] |
result.rich_texts[].content | String | The content for each rich text chunk. Image content is a URL and will not be split, potentially exceeding max_chunk_size. Tables are split into headers and row content. Code is split similarly to text. | "<table><tr>\n<th>Action</th>\n<th>Description</th>\n</tr><tr>\n<td>Hide component</td>\n<td>Hide component, no parameters required.</td>\n</tr></table>" |
result.rich_texts[].meta | Map | Identification information for each rich text chunk, with all fields being string type
| { 'type': 'table', 'belonged_chunk_id': 'f0254cb7a5144a1fb3e5e024a3****b', 'id': 'table_2-1' 'token': 10 } |
result.nodes | List(Node) | A list of nodes from the chunk tree. | [{'parent_id':x, 'id': x, 'type': 'text'}] |
result.nodes[] | Map | Information for each node in the chunk tree, with all fields being string type
| { 'id': 'f0254cb7a5144a1fb3e5e024a3****b', 'type': 'paragraph_node', 'parent_id': 'f0254cb7a5144a1fb3e5e024a3****b' } |
result.sentences (optional) | List(sentence) | When strategy.need_sentence is set to true in the request, this returns a list of sentences from each chunk. | [{ "content" : "xxx", "meta":{'belonged_chunk_id':x, 'id': x, 'type': 'sentence'} }] |
result.sentences[].content (optional) | String | The content of each sentence. | "123" |
result.sentences[].meta (optional) | Map | Information for each sentence:
| { 'id': 'f0254cb7a5144a1fb3e5e024a3****b1-1', 'type': 'sentence', 'belonged_chunk_id': 'f0254cb7a5144a1fb3e5e024a3****b', 'token': 10 } |
Curl request example
curl -XPOST -H"Content-Type: application/json"
"http://***-hangzhou.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-split/ops-document-split-001"
-H "Authorization: Bearer Your API-KEY"
-d "{
\"document\":{
\"content\":\"Product benefits\\nIndustry algorithm edition\\nIntelligent\\nBuilt-in rich customizable algorithm models, combined with the search characteristics of different industries, launch industry recall and sorting algorithms to ensure better search results.\\n\\nFlexible and customizable\\nDevelopers can customize corresponding algorithm models, application schemas, data processing, query analysis, sorting, and other configurations based on their own business characteristics and data to meet personalized search needs, improve click-through rate of search results, achieve rapid business iteration, and greatly shorten the cycle of demand going online.\\n\\nSafe and stable\\nProvides 7×24 hours of operation and maintenance, and provides technical support through online work orders and telephone fault reporting. It has a complete set of fault monitoring, automatic alert, quick positioning, and other fault emergency response mechanisms. Based on Alibaba Cloud's AccessKeyId and AccessKeySecret security encryption, access control and isolation are performed from the access interface to ensure user-level data isolation and user data security. Data redundancy backup ensures that data will not be lost.\\n\\nAuto Scaling\\nHas elastic expansion capabilities, users can expand or reduce the resources they use as needed.\\n\\nRich peripheral functions\\nSupports a series of search peripheral functions such as top search, hint, drop-down suggestion, and statistical reports, making it convenient for users to display and analyze.\\n\\nOut of the box\\nNo need to maintain and deploy clusters, quickly access search services in one stop\\n\\nHigh-performance Search Edition\\nHigh throughput\\nSingle table supports tens of thousands of write TPS, second-level updates.\\n\\nSafe and stable\\nProvides 7×24 hours of operation and maintenance, and provides technical support through online work orders and telephone fault reporting. It has a complete set of fault monitoring, automatic alert, quick positioning, and other fault emergency response mechanisms. Based on Alibaba Cloud's AccessKeyId and AccessKeySecret security encryption, access control and isolation are performed from the access interface to ensure user-level data isolation and user data security. Data redundancy backup ensures that data will not be lost.\\n\\nAuto Scaling\\nHas elastic expansion capabilities,users can expand or reduce the resources they use as needed.\\n\\nOut of the box\\nNo need to maintain and deploy clusters, quickly access search services in one stop\\n\\nVector Search Edition\\nStable\\nThe underlying implementation uses C++, which has supported multiple core businesses after more than ten years of development. It is very stable and is very suitable for core search scenarios with high stability requirements.\\n\\nEfficient\\nDistributed search engine, which can efficiently support the retrieval of massive data, and also supports real-time data updates (effective in seconds), very suitable for search scenarios that are sensitive to query time and have high timeliness requirements.\\n\\nLow cost\\nSupports multiple index compression strategies, and also supports multi-value index loading tests, which can meet users' query needs at a lower cost.\\n\\nVector algorithm\\nSupports vector retrieval of various unstructured data (such as voice, images, videos, text, behavior, etc.).\\n\\nSQL query\\nSupports SQL query syntax, supports multi-table online join, provides rich built-in UDF functions and UDF function customization mechanisms to meet the retrieval needs of different users. In the operation and maintenance system, we have integrated SQL studio to facilitate users to develop and test SQL.\\n\\nRecall Engine Edition\\nStable\\nThe underlying implementation uses C++, which has supported multiple core businesses after more than ten years of development. It is very stable and is very suitable for core search scenarios with high stability requirements.\\n\\nEfficient\\nHavenask is a distributed search engine that can efficiently support the retrieval of massive data, and also supports real-time data updates (effective in seconds), very suitable for search scenarios that are sensitive to query time and have high timeliness requirements.\\n\\nLow cost\\nHavenask supports multiple index compression strategies, and also supports multi-value index loading tests, which can meet users' query needs at a lower cost.\\n\\nRich features\\nHavenask supports various types of analyzers, multiple index types, and powerful query syntax, which can well meet users' retrieval needs. We also provide a plugin mechanism to facilitate users to customize their own business processing logic.\\n\\nSQL query\\nHavenask supports SQL query syntax, supports multi-table online join, provides rich built-in UDF functions and UDF function customization mechanisms to meet the retrieval needs of different users. In the operation and maintenance system, we will soon integrate SQL studio to facilitate users to develop and test SQL.\",
\"content_encoding\":\"utf8\",\"content_type\":\"text\"
},
\"strategy\":{
\"type\":\"default\",
\"max_chunk_size\":300,
\"compute_type\":\"token\",
\"need_sentence\":false
}
}"Response example
Normal response example
{
"request_id": "47EA146B-****-448C-A1D5-50B89D7EA434",
"latency": 161,
"usage": {
"token_count": 800
},
"result": {
"chunks": [
{
"content": "Product benefits\\nIndustry algorithm edition\\nIntelligent\\nBuilt-in rich customizable algorithm models, combined with the search characteristics of different industries, launch industry recall and sorting algorithms to ensure better search results.\\n\\nFlexible and customizable\\nDevelopers can customize corresponding algorithm models, application schemas, data processing, query analysis, sorting, and other configurations based on their own business characteristics and data to meet personalized search needs, improve click-through rate of search results, achieve rapid business iteration, and greatly shorten the cycle of demand going online.\\n\\nSafe and stable\\nProvides 7×24 hours of operation and maintenance, and provides technical support through online work orders and telephone fault reporting. It has a complete set of fault monitoring, automatic alert, quick positioning, and other fault emergency response mechanisms. Based on Alibaba Cloud's AccessKeyId and AccessKeySecret security encryption, access control and isolation are performed from the access interface to ensure user-level data isolation and user data security. Data redundancy backup ensures that data will not be lost.\\n\\nAuto Scaling\\nHas elastic expansion capabilities, users can expand or reduce the resources they use as needed.\\n\\nRich peripheral functions\\nSupports a series of search peripheral functions such as top search, hint, drop-down suggestion, and statistical reports, making it convenient for users to display and analyze.\\n\\nOut of the box\\nNo need to maintain and deploy clusters, quickly access search services in one stop\\n\\nHigh-performance Search Edition\\nHigh throughput\\nSingle table supports tens of thousands of write TPS, second-level updates",
"meta": {
"parent_id": "dee776dda3ff4b078bccf989a6bd****",
"id": "27eea7c6b2874cb7a5bf6c71afbf****",
"type": "text"
}
},
{
"content": ".\\n\\nSafe and stable\\nProvides 7×24 hours of operation and maintenance, and provides technical support through online work orders and telephone fault reporting. It has a complete set of fault monitoring, automatic alert, quick positioning, and other fault emergency response mechanisms. Based on Alibaba Cloud's AccessKeyId and AccessKeySecret security encryption, access control and isolation are performed from the access interface to ensure user-level data isolation and user data security. Data redundancy backup ensures that data will not be lost.\\n\\nAuto Scaling\\nHas elastic expansion capabilities, users can expand or reduce the resources they use as needed.\\n\\nOut of the box\\nNo need to maintain and deploy clusters, quickly access search services in one stop\\n\\nVector Search Edition\\nStable\\nThe underlying implementation uses C++, which has supported multiple core businesses after more than ten years of development. It is very stable and is very suitable for core search scenarios with high stability requirements.\\n\\nEfficient\\nDistributed search engine, which can efficiently support the retrieval of massive data, and also supports real-time data updates (effective in seconds), very suitable for search scenarios that are sensitive to query time and have high timeliness requirements.\\n\\nLow cost\\nSupports multiple index compression strategies, and also supports multi-value index loading tests, which can meet users' query needs at a lower cost.\\n\\nVector algorithm\\nSupports vector retrieval of various unstructured data (such as voice, images, videos, text, behavior, etc.).\\n\\nSQL query\\nSupports SQL query syntax, supports multi-table online join, provides rich built-in UDF functions and UDF function customization mechanisms to meet the retrieval needs of different users",
"meta": {
"parent_id": "dee776dda3ff4b078bccf989a6bd****",
"id": "bf9fcfb47fcf410aa05216e268df****",
"type": "text"
}
},
{
"content": ". In the operation and maintenance system, we have integrated SQL studio to facilitate users to develop and test SQL.\\n\\nRecall Engine Edition\\nStable\\nThe underlying implementation uses C++, which has supported multiple core businesses after more than ten years of development. It is very stable and is very suitable for core search scenarios with high stability requirements.\\n\\nEfficient\\nHavenask is a distributed search engine that can efficiently support the retrieval of massive data, and also supports real-time data updates (effective in seconds), very suitable for search scenarios that are sensitive to query time and have high timeliness requirements.\\n\\nLow cost\\nHavenask supports multiple index compression strategies, and also supports multi-value index loading tests, which can meet users' query needs at a lower cost.\\n\\nRich features\\nHavenask supports various types of analyzers, multiple index types, and powerful query syntax, which can well meet users' retrieval needs. We also provide a plugin mechanism to facilitate users to customize their own business processing logic.\\n\\nSQL query\\nHavenask supports SQL query syntax, supports multi-table online join, provides rich built-in UDF functions and UDF function customization mechanisms to meet the retrieval needs of different users. In the operation and maintenance system, we will soon integrate SQL studio to facilitate users to develop and test SQL.",
"meta": {
"parent_id": "dee776dda3ff4b078bccf989a6bd****",
"id": "26ab0e4f7665487bb0a82c5a226a****",
"type": "text"
}
}
],
"nodes": [
{
"id": "dee776dda3ff4b078bccf989a6bd****",
"type": "root",
"parent_id": "dee776dda3ff4b078bccf989a6bd****"
},
{
"id": "27eea7c6b2874cb7a5bf6c71afbf****",
"type": "sentence",
"parent_id": "dee776dda3ff4b078bccf989a6bd****"
},
{
"id": "bf9fcfb47fcf410aa05216e268df****",
"type": "sentence",
"parent_id": "dee776dda3ff4b078bccf989a6bd****"
},
{
"id": "26ab0e4f7665487bb0a82c5a226a****",
"type": "sentence",
"parent_id": "dee776dda3ff4b078bccf989a6bd****"
}
],
"rich_texts": []
}
}Exception response example
In the event of an error during an access request, the output will specify the reason for the error through the code and message fields.
{
"request_id": "817964CD-1B84-4AE1-9B63-4FB99734****",
"latency": 0,
"code": "InvalidParameter",
"message": "JSON parse error: Invalid UTF-8 start byte 0xbc; nested exception is com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xbc\n at line: 2, column: 19]"
}Status codes
For more information about the status codes, see Status codes.