All Products
Search
Document Center

OpenSearch:Perform text segmentation and vectorization

Last Updated:Mar 31, 2026

Splits a document into semantic chunks and optionally generates vector embeddings for each chunk. Use this API to prepare text for ingestion into an OpenSearch vector index as part of a retrieval-augmented generation (RAG) pipeline.

Endpoint

POST /v3/openapi/apps/{app_group_identity}/actions/knowledge-split

app_group_identity is the name of your OpenSearch instance.

Request parameters

The request body is a SplitDoc object.

ParameterTypeRequiredDefaultDescription
contentStringYesThe text to process.
titleStringNoThe document title.
use_embeddingBooleanNofalseSpecifies whether to generate vector embeddings for each chunk. Set to true to enable vectorization.
modelStringNoThe vectorization model to be used.

Example request

curl -X POST "https://<endpoint>/v3/openapi/apps/<app_group_identity>/actions/knowledge-split" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Getting started with OSS",
    "content": "Object Storage Service (OSS) is a cloud storage service provided by Alibaba Cloud. It allows you to store, access, and manage unstructured data such as images, videos, and documents.",
    "use_embedding": true,
    "model": "<embedding-model-name>"
  }'

Replace the following placeholders with actual values:

PlaceholderDescription
<endpoint>The OpenSearch service endpoint for your region
<app_group_identity>The name of your OpenSearch instance
<embedding-model-name>The embedding model configured for your instance

Response parameters

Top-level fields

ParameterTypeDescription
request_idStringThe request ID.
statusStringThe request status. OK indicates success.
errorsArrayA list of errors, if any. Empty on success.
resultList\<ChunkContext\>The list of chunks produced by the segmentation.

ChunkContext fields

Each item in result is a ChunkContext object.

ParameterTypeDescription
chunk_idStringThe chunk ID.
chunkStringThe chunk text.
embeddingStringThe vector after the vectorization, as a comma-separated list of floating-point values.
typeStringThe content type of the chunk. Valid values: text, image.
img_urlStringThe image URL. Returned only when type is image.

Example response

{
  "request_id": "111111111",
  "status": "OK",
  "errors": [],
  "result": [
    {
      "chunk_id": "1",
      "chunk": "Chunk 1",
      "embedding": "-0.010441,-0.002826,-0.022911,0.000847,0.025610,0.019213,-0.019912,0.008210,0.011974,-0.010120,-0.003866,-0.008091,-0.006889,-0.034774,...,-0.012572,0.009668,0.010963,-0.005273,-0.005072,-0.002190,-0.001554,-0.000058",
      "type": "text"
    },
    {
      "chunk_id": "2",
      "chunk": "Chunk 2",
      "embedding": "-0.010441,-0.002826,-0.022911,0.000847,0.025610,0.019213,-0.019912,0.008210,0.011974,-0.010120,-0.003866,-0.008091,-0.006889,-0.034774,...,-0.012572,0.009668,0.010963,-0.005273,-0.005072,-0.002190,-0.001554,-0.000058",
      "type": "image",
      "img_url": "http://127.0.0.1"
    },
    {
      "chunk_id": "3",
      "chunk": "Chunk 3",
      "type": "text"
    }
  ]
}