All Products
Search
Document Center

OpenSearch:Parse and embed multimodal data

Last Updated:Feb 28, 2026

This topic describes how to build a complete pipeline in AI Search Open Platform that parses documents and images, chunks text, and generates dense and sparse vector embeddings.

Prerequisites

Before you begin, ensure the following requirements are met:

AI Search Open Platform supports service calls over the Internet or a virtual private cloud (VPC), including cross-region calls through VPC. Users in the Germany (Frankfurt) region can access services in the AI Search Open Platform console by using VPC endpoints.

Pipeline overview

Multimodal data processing follows a sequential pipeline that transforms raw documents and images into vector representations for search and retrieval:

Raw Document / Image
        |
        v
  +---------------------+
  | 1. Document Parsing  |  Extract structured content from unstructured documents
  |    Image Parsing     |  Extract text from images (VLM or OCR)
  +---------------------+
        |
        v
  +---------------------+
  | 2. Document Chunking |  Segment parsed content into smaller pieces by paragraph,
  |                      |  semantics, or custom rules
  +---------------------+
        |
        v
  +---------------------+
  | 3. Text Embedding    |  Convert chunks into dense vectors (text embedding)
  |    Sparse Embedding  |  Convert chunks into sparse vectors (text sparse embedding)
  +---------------------+
        |
        v
  Vector Store / Search Engine

Each stage is powered by a dedicated API service. All services require AI Search Open Platform API access and are billed based on actual usage.

Available services

Each pipeline stage uses one or more services identified by a unique service ID that you specify in API calls.

Document parsing

Service nameService IDDescription
Document Content Parsing Serviceops-document-analyze-001General-purpose document parsing service. Extracts logical structures such as titles and paragraphs from unstructured documents -- including text, tables, and images -- to generate structured data.

Image parsing

Choose between VLM-based and OCR-based parsing depending on your accuracy and cost requirements.

Service nameService IDDescription
Image Content Recognition Service 001ops-image-analyze-vlm-001Parses and understands image content and identifies text based on multimodal LLMs. The parsed text can be used for image retrieval and conversational search scenarios.
Image Text Recognition Service 001ops-image-analyze-ocr-001Uses optical character recognition (OCR) capabilities for image text recognition. The parsed text can be used for image retrieval and conversational search scenarios.

Document chunking

Service nameService IDDescription
Common Document Slicing Serviceops-document-split-001General-purpose text chunking service. Segments structured data in HTML, Markdown, and TXT formats based on paragraphs, semantics, and specific rules. Can also extract code, images, and tables from rich text.

Text embedding

Choose an embedding model based on your language support needs, maximum input length, and desired vector dimensions.

Service nameService IDLanguagesMax tokensDimensions
OpenSearch text embedding service -001ops-text-embedding-00140+ languages3001,536
OpenSearch Universal Text Vectorization Service -002ops-text-embedding-002100+ languages8,1921,024
OpenSearch text vectorization service-Chinese -001ops-text-embedding-zh-001Chinese1,024768
OpenSearch text vectorization service-English -001ops-text-embedding-en-001English512768

Text sparse embedding

Text sparse embedding converts text into sparse vectors that occupy less storage space. Sparse vectors express keywords and frequently used term information. Combine sparse and dense vectors in a hybrid search to improve retrieval performance.

Service nameService IDLanguagesMax tokens
OpenSearch text sparse vectorization service-genericops-text-sparse-embedding-001100+ languages8,192

Development frameworks

Note

For ease of use, AI Search Open Platform provides four types of development frameworks:

  • Java SDK.

  • Python SDK.

  • If your business already uses the LangChain development framework, select LangChain.

  • If your business already uses the LlamaIndex development framework, select LlamaIndex.

Step 1: Select services and download the code

This example uses SDK for Python as the development framework to build a multimodal data processing solution.

  1. Log on to the AI Search Open Platform console.

  2. In the top navigation bar, select the Germany (Frankfurt) region.

  3. In the left-side navigation pane, select Scene Center. On the Scene Center page, click Enter in the Multimodal Data Processing Scenario - Data Parsing and Vectorization section.

  4. On the Basic configuration tab of the Scene Development tab, select the services that you want to use from the drop-down lists in the Service Name column. On the Service Details tab, you can view the service details.

Note
  • If you want to use an algorithm service in the RAG-based solution by calling an API operation, you must specify the service ID by using the service_id parameter. For example, the ID of the document content parsing service is ops-document-analyze-001.

  • After you select a service, the service_id parameter in the generated code is modified accordingly. After you download the code to your local environment, you can modify the service_id parameter in the code to call other services.

After you select the services, click After the configuration is completed, enter the code query to view and download the code based on the execution flow.

Pipeline execution flow

The pipeline code executes the following stages in sequence. Call the main function document_pipeline_execute to run the entire pipeline. You can specify the document to process by using a document URL or Base64-encoded file.

  1. Parse the document or image. For more information, see Document content parsing and Image content extraction.

    • Call the asynchronous operation for document parsing to extract content from a document URL or Base64-encoded file.

    • Call the asynchronous operation for image parsing to extract content from an image URL or Base64-encoded file.

  2. Chunk the document. For more information, see Document chunking.

    • Call the document chunking operation to segment the parsed document based on a specific policy.

    • Call the document_split function to segment the document. This process includes text segmentation and rich text parsing.

  3. Generate text embeddings. For more information, see Text embedding and Sparse text embedding.

    • Call the text embedding operation to convert the chunked data into dense vectors.

    • Call the text sparse embedding operation to convert the chunked data into sparse vectors. If you need to perform content retrieval later, you can write the vectors to a search engine.

On the Code Query tab, click Text Parsing and Vectorization. In the code editor, click Copy Code or Download File to download the code to your device.

Step 2: Test the code in your local environment

After you download the code files, configure the required parameters described in the following table.

SectionParameterDescription
AI Search Open Platformapi_keyThe API key. For more information about how to obtain the API key, see Manage API keys.
aisearch_endpointThe API endpoint. For more information about how to obtain the API endpoint, see Query service endpoint.
Note

You must remove the http:// prefix. You can call API operations over the Internet or a VPC.

workspace_nameThe AI Search Open Platform workspace name.
service_idThe service ID. To facilitate code development, you can configure services and specify service IDs separately in the offline.py and online.py files by using the service_id_config parameter. image

After you configure the parameters, run the code in Python 3.8.1 or later to verify the results.

If you preprocess AI Search Open Platform data in the code, the running results are as follows.

image

Sample code: Document parsing and embedding

# Multimodal data processing solution

# Requirements:
# Python 3.8.1 or later
# pip install alibabacloud_searchplat20240529

# AI Search Open Platform configuration
aisearch_endpoint = "xxx.platform-cn-shanghai.opensearch.aliyuncs.com"
api_key = "OS-xxx"
workspace_name = "default"
service_id_config = {
    "document_analyze": "ops-document-analyze-001",
    "split": "ops-document-split-001",
    "text_embedding": "ops-text-embedding-001",
    "text_sparse_embedding": "ops-text-sparse-embedding-001",
    "image_analyze": "ops-image-analyze-ocr-001"
}

# Specify the document URL.
# In this example, the OpenSearch product description document is used.
document_url = "https://www.alibabacloud.com/help/zh/open-search/search-platform/product-overview/introduction-to-search-platform?spm=a2c4g.11186623.0.0.7ab93526WDzQ8z"

import asyncio
from operator import attrgetter
from typing import List
from Tea.exceptions import TeaException, RetryError
from alibabacloud_tea_openapi.models import Config
from alibabacloud_searchplat20240529.client import Client
from alibabacloud_searchplat20240529.models import (
    GetDocumentSplitRequest,
    CreateDocumentAnalyzeTaskRequest,
    CreateDocumentAnalyzeTaskRequestDocument,
    GetDocumentAnalyzeTaskStatusRequest,
    GetDocumentSplitRequestDocument,
    GetTextEmbeddingRequest,
    GetTextEmbeddingResponseBodyResultEmbeddings,
    GetTextSparseEmbeddingRequest,
    GetTextSparseEmbeddingResponseBodyResultSparseEmbeddings,
    GetImageAnalyzeTaskStatusResponse,
    CreateImageAnalyzeTaskRequest,
    GetImageAnalyzeTaskStatusRequest,
    CreateImageAnalyzeTaskRequestDocument,
    CreateImageAnalyzeTaskResponse,
)


async def poll_doc_analyze_task_result(ops_client, task_id, service_id, interval=5):
    """Poll until the document analysis task completes or fails."""
    while True:
        request = GetDocumentAnalyzeTaskStatusRequest(task_id=task_id)
        response = await ops_client.get_document_analyze_task_status_async(
            workspace_name, service_id, request
        )
        status = response.body.result.status
        if status == "PENDING":
            await asyncio.sleep(interval)
        elif status == "SUCCESS":
            return response
        else:
            print("error: " + response.body.result.error)
            raise Exception("document analyze task failed")


def is_analyzable_url(url: str):
    """Check whether the URL points to a supported image format."""
    if not url:
        return False
    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.tiff'}
    return url.lower().endswith(tuple(image_extensions))


async def image_analyze(ops_client, url):
    """Parse an image URL and return the extracted text content."""
    try:
        print("image analyze: " + url)
        if url.startswith("//"):
            url = "https:" + url
        if not is_analyzable_url(url):
            print(url + " is unanalyzable.")
            return url
        image_analyze_service_id = service_id_config["image_analyze"]
        document = CreateImageAnalyzeTaskRequestDocument(url=url)
        request = CreateImageAnalyzeTaskRequest(document=document)
        response: CreateImageAnalyzeTaskResponse = (
            ops_client.create_image_analyze_task(
                workspace_name, image_analyze_service_id, request
            )
        )
        task_id = response.body.result.task_id
        while True:
            request = GetImageAnalyzeTaskStatusRequest(task_id=task_id)
            response: GetImageAnalyzeTaskStatusResponse = (
                ops_client.get_image_analyze_task_status(
                    workspace_name, image_analyze_service_id, request
                )
            )
            status = response.body.result.status
            if status == "PENDING":
                await asyncio.sleep(5)
            elif status == "SUCCESS":
                return url + response.body.result.data.content
            else:
                print("image analyze error: " + response.body.result.error)
                return url
    except Exception as e:
        print(f"image analyze Exception: {e}")


def chunk_list(lst, chunk_size):
    """Split a list into batches of the specified size."""
    for i in range(0, len(lst), chunk_size):
        yield lst[i:i + chunk_size]


async def document_pipeline_execute(
    document_url: str = None,
    document_base64: str = None,
    file_name: str = None
):
    # Initialize the AI Search Open Platform client.
    config = Config(
        bearer_token=api_key,
        endpoint=aisearch_endpoint,
        protocol="http"
    )
    ops_client = Client(config=config)

    # Stage 1: Parse the document or image.
    document_analyze_request = CreateDocumentAnalyzeTaskRequest(
        document=CreateDocumentAnalyzeTaskRequestDocument(
            url=document_url,
            content=document_base64,
            file_name=file_name,
            file_type='html'
        )
    )
    document_analyze_response = await ops_client.create_document_analyze_task_async(
        workspace_name=workspace_name,
        service_id=service_id_config["document_analyze"],
        request=document_analyze_request
    )
    print("document_analyze task_id: " + document_analyze_response.body.result.task_id)
    extraction_result = await poll_doc_analyze_task_result(
        ops_client,
        document_analyze_response.body.result.task_id,
        service_id_config["document_analyze"]
    )
    print("document_analyze done")
    document_content = extraction_result.body.result.data.content
    content_type = extraction_result.body.result.data.content_type

    # Stage 2: Chunk the document.
    document_split_request = GetDocumentSplitRequest(
        GetDocumentSplitRequestDocument(
            content=document_content,
            content_type=content_type
        )
    )
    document_split_result = await ops_client.get_document_split_async(
        workspace_name,
        service_id_config["split"],
        document_split_request
    )
    print(
        "document-split done, chunks count: "
        + str(len(document_split_result.body.result.chunks))
        + " rich text count: "
        + str(len(document_split_result.body.result.rich_texts))
    )

    # Stage 3: Generate text embeddings.
    # Extract the chunking results. For image chunks, use the image parsing
    # service to extract text before embedding.
    doc_list = (
        [
            {"id": chunk.meta.get("id"), "content": chunk.content}
            for chunk in document_split_result.body.result.chunks
        ]
        + [
            {"id": chunk.meta.get("id"), "content": chunk.content}
            for chunk in document_split_result.body.result.rich_texts
            if chunk.meta.get("type") != "image"
        ]
        + [
            {
                "id": chunk.meta.get("id"),
                "content": await image_analyze(ops_client, chunk.content),
            }
            for chunk in document_split_result.body.result.rich_texts
            if chunk.meta.get("type") == "image"
        ]
    )

    # A maximum of 32 vectors can be generated per request.
    chunk_size = 32
    all_text_embeddings: List[GetTextEmbeddingResponseBodyResultEmbeddings] = []
    for chunk in chunk_list([text["content"] for text in doc_list], chunk_size):
        response = await ops_client.get_text_embedding_async(
            workspace_name,
            service_id_config["text_embedding"],
            GetTextEmbeddingRequest(chunk)
        )
        all_text_embeddings.extend(response.body.result.embeddings)

    all_text_sparse_embeddings: List[
        GetTextSparseEmbeddingResponseBodyResultSparseEmbeddings
    ] = []
    for chunk in chunk_list([text["content"] for text in doc_list], chunk_size):
        response = await ops_client.get_text_sparse_embedding_async(
            workspace_name,
            service_id_config["text_sparse_embedding"],
            GetTextSparseEmbeddingRequest(
                chunk, input_type="document", return_token=True
            ),
        )
        all_text_sparse_embeddings.extend(response.body.result.sparse_embeddings)

    for i in range(len(doc_list)):
        doc_list[i]["embedding"] = all_text_embeddings[i].embedding
        doc_list[i]["sparse_embedding"] = all_text_sparse_embeddings[i].embedding

    print("text-embedding done.")


if __name__ == "__main__":
    # Run the asynchronous pipeline.
    #    import nest_asyncio  # Uncomment if running in Jupyter Notebook.
    #    nest_asyncio.apply()  # Uncomment if running in Jupyter Notebook.
    asyncio.run(document_pipeline_execute(document_url))
    # You can also use a Base64-encoded file to specify the document:
    # asyncio.run(document_pipeline_execute(
    #     document_base64="eHh4eHh4eHg...", file_name="attention.pdf"
    # ))