Document content parsing - OpenSearch - Alibaba Cloud Documentation Center

AI Search Open Platform allows you to call the document content parsing service by using an API. You can integrate the service into your business processing chain to parse unstructured data into structured data and apply the structured data to your business.

Service name

Service ID

Service description

QPS limit for API calls (For Alibaba Cloud account and RAM users)

Document Parsing Service

ops-document-analyze-001

Supports extracting logical hierarchical structures such as titles and segments from unstructured documents, as well as text, tables, images, and other information, and outputs them in a structured format.

The supported document types include TXT, PDF, HTML, DOC, DOCX, PPT, and PPTX.

Note

To apply for higher QPS, submit a ticket.

ops-document-analyze-002

Parses various unstructured document formats, such as PDFs and images, excels at detecting complex elements, including tables, formulas, and charts, and delivers fast inference speed.

Prerequisites

The authentication information is obtained.
When you call an AI Search Open Platform service by using an API, you need to authenticate the caller's identity.
The service access address is obtained.
You can call a service over the Internet or a virtual private cloud (VPC). For more information, see Get service registration address.

General information

The maximum request body size cannot exceed 8 MB.

Overview

Document content parsing provides both synchronous and asynchronous interfaces. Due to the risk of HTTP timeout, synchronous interfaces are not recommended for production environments and can be used for debugging. Asynchronous interfaces are recommended for production environments and involve two steps: first, create an asynchronous extraction task to obtain the task_id, then call the asynchronous task retrieval interface to continuously query the status until the task is completed.

Create an asynchronous extraction task

Request method

POST

URL

{host}/v3/openapi/workspaces/{workspace_name}/document-analyze/{service_id}/async

host: the address for invoking the service. You can call an API service over the Internet or a VPC. For more information, see Get service registration address.
workspace_name: the name of the workspace, such as default.
service_id: the built-in service ID, such as ops-document-analyze-001.

Request parameters

Header parameters

API key authentication

Parameter	Type	Required	Description	Example
Content-Type	String	Yes	The request type. Valid values: application and json.	application/json
Authorization	String	Yes	The API key.	Bearer OS-d1**2a

Body parameters

Parameter	Type	Required	Description	Example
service_id	String	Yes	The built-in service ID.	ops-document-analyze-001
document.url	String	No	The document URL. Valid values: HTTP and HTTPS protocols. Ensure that the URL can be downloaded statelessly from the public network. Either document.content or document.url is required.	http://opensearch-shanghai.oss-cn-shanghai.aliyuncs.com/chatos/***/file-parser/samples/GB10767.pdf
document.content	String	No	The document content encoded in Base64. Either document.content or document.url is required.	"aGVsbG8gd29ybGQ="
document.file_name	String	No	The document name. If you leave this parameter empty, the name can be inferred from the URL. If you leave this parameter and the document.url parameter empty, the document name needs to be explicitly specified.	test.pdf
document.file_type	String	No	The document type. If you leave this parameter empty, the document type can be inferred from the suffix of the document name. If it cannot be inferred, the document type needs to be explicitly specified. The supported document types include TXT, PDF, HTML, DOC, DOCX, PPT, and PPTX.	pdf
output.image_storage	String	No	The image storage method. base64: the default method. url: The URL is valid for 3 days.	url
strategy.enable_semantic	Boolean	No	Specifies whether to enable semantic hierachy extraction during parsing of TXT documents or documents with unclear hierarchical structure. Valid values: true: After you enable this feature, the model service will return the document in a Markdown hierarchical format, which helps to improve the accuracy of subsequent document slicing. This feature does not support documents of the HTML, PPT, and PPTX types. After you enable this feature, the overall document parsing time will increase. Parsing times exceeding 400 seconds or super-long documents (over 100 pages) may result in the system automatically disabling the feature. The `usage` billing parameter will include `semantic_token_count` to display the number of tokens used by the model, with charges based on this token count. false: the default value.	false

For documents without clear distinction between the table of contents and the main text, semantic hierachy extraction makes the hierarchical structure in the results more accurate.

Note

If a value is returned for the usage.semantic_token_count parameter, semantic hierachy extraction is enabled and you are billed for the semantic token consumption. No return value indicates that the feature fails and you are not billed.

You can estimate the time and token consumption after enabling semantic hierachy extraction based on the following table.

PDF pages	Token count	Without semantic hierachy extraction	With semantic hierachy extraction
PDF pages	Token count	Time (s)	Time (s)	Semantic token
7	11,504	2	49	36,243
25	10,375	1	33	59,332
42	41,435	5	68	130,717

Response parameters

Parameter	Type	Description	Example
result.task_id	String	The ID of the document parsing asynchronous task.	d5a4019e-853a-****-b5b6-8053d9f5a9fc

cURL request example

curl --location 'http://****shanghai.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-analyze/ops-document-analyze-001/async/' \
--header 'Authorization: Bearer your API Key' \
--header 'Content-Type: application/json' \
--data '{  
  "document":{
      "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241018/jahnyn/%E8%A7%A3%E6%9E%90%E6%B5%8B%E8%AF%95.doc"
    },
    "output" :{
      "image_storage":"base64"
    },
    "strategy": {
      "enable_semantic":true
    }
}'

Response example

Normal response example

{
    "request_id": "D5A4019E-853A-4E20-****-8053D9F5A9FC",
    "latency": 5.0,
    "http_code": 200,
    "result": {
        "task_id": "d5a4019e-853a-****-b5b6-8053d9f5a9fc"
    }
}

Abnormal response example

In case of an access request error, the output result will indicate the error reason through code and message.

{
    "request_id": "590A7EB8-AA84-****-AF31-8C35DC965972",
    "latency": 0.0,
    "code": "InvalidParameter",
    "http_code": 400,
    "message": "document.file_name required"
}

Get an asynchronous extraction task

Request method

GET

URL

{host}/v3/openapi/workspaces/{workspace_name}/document-analyze/{service_id}/async/task-status?task_id=${task_id}

host: the address for invoking the service. You can call an API service over the Internet of a VPC. For more information, see Get service registration address.
workspace_name: the name of the workspace, such as default.
service_id: the built-in service ID, such as ops-document-analyze-001.
task_id: the asynchronous task ID returned in the document parsing response, such as d5a4019e-853a-****-b5b6-8053d9f5a9fc.

Request parameters

Header parameters

API key authentication

Parameter	Type	Required	Description	Example
Content-Type	String	Yes	The request type. Valid values: application and json.	application/json
Authorization	String	Yes	The API key.	Bearer OS-d1**2a

Response parameters

Parameter	Type	Description	Example
result.task_id	String	The ID of the document parsing asynchronous task.	24c3ad59-****-40cf-974b-b63d63e0571
result.status	String	The task status. Valid values: PENDING SUCCESS FAIL	PENDING
result.error	String	The error message when you set result.status to FAIL. The value of this parameter is empty if the task succeeds.	Document decryption failed
result.data	Object	The document parsing result.	markdown
result.data.content	String	The document parsing result - content. PDF documents output in markdown format Other documents output in HTML format	"XXX"
result.data.content_type	String	The document parsing result - content format. markdown html	markdown
result.data.page_num	Int	The document parsing result - number of pages.	15
request_id	String	The unique identifier assigned to an API call by the system.	B4AB89C8-B135-****-A6F8-2BAB8018688
latency	Float/Int	The request duration. Unit: milliseconds.	10
usage	Object	The billing information generated by this call.	"usage": { "token_count": 123, "table_count": 5, "image_count": 6, "semantic_token_count":3068 }
usage.token_count	Int	The number of characters in the document.	1234
usage.table_count	Int	The number of tables in the document.	5
usage.image_count	Int	The number of images in the document.	6
usage.semantic_token_count	Int	The input token of the semantic extraction model.	3068

cURL request example

curl -XGET -H"Content-Type: application/json" 
"http://****-hangzhou.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-analyze/ops-document-analyze-001/async/task-status?task_id=110d6349-2e51-****-8bfb-25e5de434686" 
-H "Authorization: Bearer Your API-KEY"

Response example

Normal response example

{
    "request_id": "27F9CEC3-9052-****-83FF-E7957B680492",
    "latency": 13.0,
    "http_code": 200,
    "result": {
        "status": "SUCCESS",
        "data": {
            "content": "Provided proper attribution is provided, Alibaba hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works....",
            "content_type": "markdown",
            "page_num": 15
        },
        "task_id": "24c3ad59-b196-****-974b-b63d63e05895"
    },
    "usage": {
        "token_count": 31867,
        "table_count": 4,
        "image_count": 8,
        "semantic_token_count":3068
    }
}

Abnormal response example

In case of an access request error, the output result will indicate the error reason through code and message.

{
    "request_id": "0F94BD89-989C-****-963C-6E4F3FF99445",
    "latency": 3.0,
    "code": "BadRequest.TaskNotExist",
    "http_code": 404,
    "message": "task[2fda34f5-40b4-****-a9a2-3e2c1e807361] not exist"
}

Create a synchronous extraction task

Important

We recommend that you do not use the synchronization interface in a production environment due to the HTTP timeout risk. This interface can be used for debugging.

Request method

POST

URL

{host}/v3/openapi/workspaces/{workspace_name}/document-analyze/{service_id}/sync

Parameter description

host: the address for invoking the service. You can call an API service over the Internet of a VPC. For more information, see Get service registration address.
workspace_name: the name of the workspace, such as default.
service_id: the built-in service ID, such as ops-document-analyze-001.

Request parameters

Header parameters

API key authentication

Parameter	Type	Required	Description	Example
Content-Type	String	Yes	The request type. Valid values: application and json.	application/json
Authorization	String	Yes	The API key.	Bearer OS-d1**2a

Body parameters

Parameter	Type	Required	Description	Example
document.url	String	No	The document URL. Valid values: HTTP and HTTPS protocols. Ensure that it can be downloaded statelessly from the public network. Either document.content or document.url is required.	http://opensearch-shanghai.oss-cn-shanghai.aliyuncs.com/chatos/***/file-parser/samples/GB10767.pdf
document.content	String	No	The document content encoded in Base64. Either document.content or document.url is required.	"aGVsbG8gd29ybGQ="
document.file_name	String	No	The document name. If you leave this parameter empty, the name can be inferred from the URL. If you leave this parameter and the document.url parameter empty, the document name needs to be explicitly specified.	test.pdf
document.file_type	String	No	The document type. If you leave this parameter empty, the document type can be inferred from the suffix of the document name. If it cannot be inferred, the document type needs to be explicitly specified. The supported document types include TXT, PDF, HTML, DOC, DOCX, PPT, and PPTX.	pdf
output.image_storage	String	No	The image storage method. base64: the default method. url: The URL is valid for 3 days.	url
strategy.enable_semantic	Boolean	No	Specifies whether to enable semantic hierachy extraction. Enabling this feature will return the document in a Markdown hierarchical format with better accuracy. When you enable this feature, the overall document parsing time will increase. Parsing times exceeding 400 seconds or super-long documents (over 100 pages) may result in the system automatically disabling the feature. This feature does not support documents of the HTML, PPT, or PPTX type.	false

Response parameters

Parameter	Type	Description	Example
result.status	String	The task status. Valid values: PENDING SUCCES FAIL	PENDING
result.error	String	The error message when you set result.status to FAIL. The value of this parameter is empty if the task succeeds.	Document decryption failed
result.data	Object	The document parsing result.	markdown
result.data.content	String	The document parsing result - content. PDF documents output in markdown format Other documents output in HTML format	"XXX"
result.data.content_type	String	The document parsing result - content format. markdown html	markdown
result.data.page_num	Int	The document parsing result - number of pages.	15
request_id	String	The unique identifier assigned to an API call by the system.	B4AB89C8-B135-****-A6F8-2BAB801A2CE4
latency	Float/Int	The request duration. Unit: milliseconds.	10
usage	Object	The billing information generated by this call.	"usage": { "token_count": 123, "table_count": 5, "image_count": 6, "semantic_token_count":3068 }
usage.token_count	Int	The number of characters in the document.	1234
usage.table_count	Int	The number of tables in the document.	5
usage.image_count	Int	The number of images in the document.	6
usage.semantic_token_count	Int	The input token of the semantic extraction model.	3068

cURL request example

curl --location 'http://****shanghai.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-analyze/ops-document-analyze-001/sync/' \
--header 'Authorization: Bearer your API Key' \
--header 'Content-Type: application/json' \
--data '{  
  "document":{
      "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241018/jahnyn/%E8%A7%A3%E6%9E%90%E6%B5%8B%E8%AF%95.doc"
    },
    "output" :{
      "image_storage":"base64"
    },
    "strategy": {
      "enable_semantic":true
    }
}'

Response example

Normal response example

{
    "request_id": "27F9CEC3-9052-****-83FF-E7957B689D04",
    "latency": 13.0,
    "http_code": 200,
    "result": {
        "status": "SUCCESS",
        "data": {
            "content": "Provided proper attribution is provided, Alibaba hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works....",
            "content_type": "markdown",
            "page_num": 15
        }
    },
    "usage": {
        "token_count": 31867,
        "table_count": 4,
        "image_count": 8,
        "semantic_token_count":3068
    }
}

Abnormal response example

In case of an access request error, the output result will indicate the error reason through code and message.

{
    "request_id": "6F33AFB6-A35C-****-AFD2-9EA16CCF4383",
    "latency": 2.0,
    "code": "InvalidParameter",
    "http_code": 400,
    "message": "JSON parse error: Cannot deserialize value of type `ImageStorage` from String \\"xxx\\"
}

Status code description

HTTP status code	Error message	Description
200	-	Request successful, including task failure scenarios. The actual task status needs to be determined from result.status
404	BadRequest.TaskNotExist	Task does not exist
400	InvalidParameter	Invalid request
500	InternalServerError	Internal error