All Products
Search
Document Center

OpenSearch:Document content parsing

Last Updated:Aug 05, 2025

AI Search Open Platform allows you to call the document content parsing service by using an API. You can integrate the service into your business processing chain to parse unstructured data into structured data and apply the structured data to your business.

Service name

Service ID

Service description

QPS limit for API calls (For Alibaba Cloud account and RAM users)

Document Parsing Service-001

ops-document-analyze-001

Supports extracting logical hierarchical structures such as titles and segments from unstructured documents, as well as text, tables, images, and other information, and outputs them in a structured format.

The supported document types include TXT, PDF, HTML, DOC, DOCX, PPT, and PPTX.

10

Note

To apply for higher QPS, submit a ticket.

Prerequisites

  • The authentication information is obtained.

    When you call an AI Search Open Platform service by using an API, you need to authenticate the caller's identity.

  • The service access address is obtained.

    You can call a service over the Internet or a virtual private cloud (VPC). For more information, see Get service registration address.

General information

  • The maximum request body size cannot exceed 8 MB.

Overview

Document content parsing provides both synchronous and asynchronous interfaces. Due to the risk of HTTP timeout, synchronous interfaces are not recommended for production environments and can be used for debugging. Asynchronous interfaces are recommended for production environments and involve two steps: first, create an asynchronous extraction task to obtain the task_id, then call the asynchronous task retrieval interface to continuously query the status until the task is completed.

Create an asynchronous extraction task

Request method

POST

URL

{host}/v3/openapi/workspaces/{workspace_name}/document-analyze/{service_id}/async
  • host: the address for invoking the service. You can call an API service over the Internet or a VPC. For more information, see Get service registration address.

  • workspace_name: the name of the workspace, such as default.

  • service_id: the built-in service ID, such as ops-document-analyze-001.

Request parameters

Header parameters

API key authentication

Parameter

Type

Required

Description

Example

Content-Type

String

Yes

The request type. Valid values: application and json.

application/json

Authorization

String

Yes

The API key.

Bearer OS-d1**2a

Body parameters

Parameter

Type

Required

Description

Example

service_id

String

Yes

The built-in service ID.

ops-document-analyze-001

document.url

String

No

The document URL. Valid values: HTTP and HTTPS protocols. Ensure that the URL can be downloaded statelessly from the public network.

Either document.content or document.url is required.

http://opensearch-shanghai.oss-cn-shanghai.aliyuncs.com/chatos/***/file-parser/samples/GB10767.pdf

document.content

String

No

The document content encoded in Base64.

Either document.content or document.url is required.

"aGVsbG8gd29ybGQ="

document.file_name

String

No

The document name. If you leave this parameter empty, the name can be inferred from the URL. If you leave this parameter and the document.url parameter empty, the document name needs to be explicitly specified.

test.pdf

document.file_type

String

No

The document type. If you leave this parameter empty, the document type can be inferred from the suffix of the document name. If it cannot be inferred, the document type needs to be explicitly specified. The supported document types include TXT, PDF, HTML, DOC, DOCX, PPT, and PPTX.

pdf

output.image_storage

String

No

The image storage method.

  • base64: the default method.

  • url: The URL is valid for 3 days.

url

strategy.enable_semantic

Boolean

No

Specifies whether to enable semantic hierachy extraction during parsing of TXT documents or documents with unclear hierarchical structure. Valid values:

  • true: After you enable this feature, the model service will return the document in a Markdown hierarchical format, which helps to improve the accuracy of subsequent document slicing.

    • This feature does not support documents of the HTML, PPT, and PPTX types.

    • After you enable this feature, the overall document parsing time will increase. Parsing times exceeding 400 seconds or super-long documents (over 100 pages) may result in the system automatically disabling the feature.

    • The usage billing parameter will include semantic_token_count to display the number of tokens used by the model, with charges based on this token count.

  • false: the default value.

false

For documents without clear distinction between the table of contents and the main text, semantic hierachy extraction makes the hierarchical structure in the results more accurate.

Note

If a value is returned for the usage.semantic_token_count parameter, semantic hierachy extraction is enabled and you are billed for the semantic token consumption. No return value indicates that the feature fails and you are not billed.

You can estimate the time and token consumption after enabling semantic hierachy extraction based on the following table.

PDF pages

Token count

Without semantic hierachy extraction

With semantic hierachy extraction

Time (s)

Time (s)

Semantic token

7

11,504

2

49

36,243

25

10,375

1

33

59,332

42

41,435

5

68

130,717

Response parameters

Parameter

Type

Description

Example

result.task_id

String

The ID of the document parsing asynchronous task.

d5a4019e-853a-****-b5b6-8053d9f5a9fc

cURL request example

curl --location 'http://****shanghai.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-analyze/ops-document-analyze-001/async/' \
--header 'Authorization: Bearer your API Key' \
--header 'Content-Type: application/json' \
--data '{  
  "document":{
      "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241018/jahnyn/%E8%A7%A3%E6%9E%90%E6%B5%8B%E8%AF%95.doc"
    },
    "output" :{
      "image_storage":"base64"
    },
    "strategy": {
      "enable_semantic":true
    }
}'

Response example

Normal response example

{
    "request_id": "D5A4019E-853A-4E20-****-8053D9F5A9FC",
    "latency": 5.0,
    "http_code": 200,
    "result": {
        "task_id": "d5a4019e-853a-****-b5b6-8053d9f5a9fc"
    }
}

Abnormal response example

In case of an access request error, the output result will indicate the error reason through code and message.

{
    "request_id": "590A7EB8-AA84-****-AF31-8C35DC965972",
    "latency": 0.0,
    "code": "InvalidParameter",
    "http_code": 400,
    "message": "document.file_name required"
}

Get an asynchronous extraction task

Request method

GET

URL

{host}/v3/openapi/workspaces/{workspace_name}/document-analyze/{service_id}/async/task-status?task_id=${task_id}
  • host: the address for invoking the service. You can call an API service over the Internet of a VPC. For more information, see Get service registration address.

  • workspace_name: the name of the workspace, such as default.

  • service_id: the built-in service ID, such as ops-document-analyze-001.

  • task_id: the asynchronous task ID returned in the document parsing response, such as d5a4019e-853a-****-b5b6-8053d9f5a9fc.

Request parameters

Header parameters

API key authentication

Parameter

Type

Required

Description

Example

Content-Type

String

Yes

The request type. Valid values: application and json.

application/json

Authorization

String

Yes

The API key.

Bearer OS-d1**2a

Response parameters

Parameter

Type

Description

Example

result.task_id

String

The ID of the document parsing asynchronous task.

24c3ad59-****-40cf-974b-b63d63e0571

result.status

String

The task status. Valid values:

  • PENDING

  • SUCCESS

  • FAIL

PENDING

result.error

String

The error message when you set result.status to FAIL. The value of this parameter is empty if the task succeeds.

Document decryption failed

result.data

Object

The document parsing result.

markdown

result.data.content

String

The document parsing result - content.

  • PDF documents output in markdown format

  • Other documents output in HTML format

"XXX"

result.data.content_type

String

The document parsing result - content format.

  • markdown

  • html

markdown

result.data.page_num

Int

The document parsing result - number of pages.

15

request_id

String

The unique identifier assigned to an API call by the system.

B4AB89C8-B135-****-A6F8-2BAB8018688

latency

Float/Int

The request duration. Unit: milliseconds.

10

usage

Object

The billing information generated by this call.

"usage": {

"token_count": 123,

"table_count": 5,

"image_count": 6,

"semantic_token_count":3068

}

usage.token_count

Int

The number of characters in the document.

1234

usage.table_count

Int

The number of tables in the document.

5

usage.image_count

Int

The number of images in the document.

6

usage.semantic_token_count

Int

The input token of the semantic extraction model.

3068

cURL request example

curl -XGET -H"Content-Type: application/json" 
"http://****-hangzhou.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-analyze/ops-document-analyze-001/async/task-status?task_id=110d6349-2e51-****-8bfb-25e5de434686" 
-H "Authorization: Bearer Your API-KEY"

Response example

Normal response example

{
    "request_id": "27F9CEC3-9052-****-83FF-E7957B680492",
    "latency": 13.0,
    "http_code": 200,
    "result": {
        "status": "SUCCESS",
        "data": {
            "content": "Provided proper attribution is provided, Alibaba hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works....",
            "content_type": "markdown",
            "page_num": 15
        },
        "task_id": "24c3ad59-b196-****-974b-b63d63e05895"
    },
    "usage": {
        "token_count": 31867,
        "table_count": 4,
        "image_count": 8,
        "semantic_token_count":3068
    }
}

Abnormal response example

In case of an access request error, the output result will indicate the error reason through code and message.

{
    "request_id": "0F94BD89-989C-****-963C-6E4F3FF99445",
    "latency": 3.0,
    "code": "BadRequest.TaskNotExist",
    "http_code": 404,
    "message": "task[2fda34f5-40b4-****-a9a2-3e2c1e807361] not exist"
}

Create a synchronous extraction task

Important

We recommend that you do not use the synchronization interface in a production environment due to the HTTP timeout risk. This interface can be used for debugging.

Request method

POST

URL

{host}/v3/openapi/workspaces/{workspace_name}/document-analyze/{service_id}/sync

Parameter description

  • host: the address for invoking the service. You can call an API service over the Internet of a VPC. For more information, see Get service registration address.

  • workspace_name: the name of the workspace, such as default.

  • service_id: the built-in service ID, such as ops-document-analyze-001.

Request parameters

Header parameters

API key authentication

Parameter

Type

Required

Description

Example

Content-Type

String

Yes

The request type. Valid values: application and json.

application/json

Authorization

String

Yes

The API key.

Bearer OS-d1**2a

Body parameters

Parameter

Type

Required

Description

Example

document.url

String

No

The document URL. Valid values: HTTP and HTTPS protocols. Ensure that it can be downloaded statelessly from the public network.

Either document.content or document.url is required.

http://opensearch-shanghai.oss-cn-shanghai.aliyuncs.com/chatos/***/file-parser/samples/GB10767.pdf

document.content

String

No

The document content encoded in Base64.

Either document.content or document.url is required.

"aGVsbG8gd29ybGQ="

document.file_name

String

No

The document name. If you leave this parameter empty, the name can be inferred from the URL. If you leave this parameter and the document.url parameter empty, the document name needs to be explicitly specified.

test.pdf

document.file_type

String

No

The document type. If you leave this parameter empty, the document type can be inferred from the suffix of the document name. If it cannot be inferred, the document type needs to be explicitly specified.

The supported document types include TXT, PDF, HTML, DOC, DOCX, PPT, and PPTX.

pdf

output.image_storage

String

No

The image storage method.

  • base64: the default method.

  • url: The URL is valid for 3 days.

url

strategy.enable_semantic

Boolean

No

Specifies whether to enable semantic hierachy extraction. Enabling this feature will return the document in a Markdown hierarchical format with better accuracy.

When you enable this feature, the overall document parsing time will increase. Parsing times exceeding 400 seconds or super-long documents (over 100 pages) may result in the system automatically disabling the feature.

This feature does not support documents of the HTML, PPT, or PPTX type.

false

Response parameters

Parameter

Type

Description

Example

result.status

String

The task status. Valid values:

  • PENDING

  • SUCCES

  • FAIL

PENDING

result.error

String

The error message when you set result.status to FAIL. The value of this parameter is empty if the task succeeds.

Document decryption failed

result.data

Object

The document parsing result.

markdown

result.data.content

String

The document parsing result - content.

  • PDF documents output in markdown format

  • Other documents output in HTML format

"XXX"

result.data.content_type

String

The document parsing result - content format.

  • markdown

  • html

markdown

result.data.page_num

Int

The document parsing result - number of pages.

15

request_id

String

The unique identifier assigned to an API call by the system.

B4AB89C8-B135-****-A6F8-2BAB801A2CE4

latency

Float/Int

The request duration. Unit: milliseconds.

10

usage

Object

The billing information generated by this call.

"usage": {

"token_count": 123,

"table_count": 5,

"image_count": 6,

"semantic_token_count":3068

}

usage.token_count

Int

The number of characters in the document.

1234

usage.table_count

Int

The number of tables in the document.

5

usage.image_count

Int

The number of images in the document.

6

usage.semantic_token_count

Int

The input token of the semantic extraction model.

3068

cURL request example

curl --location 'http://****shanghai.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-analyze/ops-document-analyze-001/sync/' \
--header 'Authorization: Bearer your API Key' \
--header 'Content-Type: application/json' \
--data '{  
  "document":{
      "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241018/jahnyn/%E8%A7%A3%E6%9E%90%E6%B5%8B%E8%AF%95.doc"
    },
    "output" :{
      "image_storage":"base64"
    },
    "strategy": {
      "enable_semantic":true
    }
}'

Response example

Normal response example

{
    "request_id": "27F9CEC3-9052-****-83FF-E7957B689D04",
    "latency": 13.0,
    "http_code": 200,
    "result": {
        "status": "SUCCESS",
        "data": {
            "content": "Provided proper attribution is provided, Alibaba hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works....",
            "content_type": "markdown",
            "page_num": 15
        }
    },
    "usage": {
        "token_count": 31867,
        "table_count": 4,
        "image_count": 8,
        "semantic_token_count":3068
    }
}

Abnormal response example

In case of an access request error, the output result will indicate the error reason through code and message.

{
    "request_id": "6F33AFB6-A35C-****-AFD2-9EA16CCF4383",
    "latency": 2.0,
    "code": "InvalidParameter",
    "http_code": 400,
    "message": "JSON parse error: Cannot deserialize value of type `ImageStorage` from String \\"xxx\\"
}

Status code description

HTTP status code

Error message

Description

200

-

Request successful, including task failure scenarios. The actual task status needs to be determined from result.status

404

BadRequest.TaskNotExist

Task does not exist

400

InvalidParameter

Invalid request

500

InternalServerError

Internal error