AI Search Open Platform allows you to call the document content parsing service by using an API. You can integrate the service into your business processing chain to parse unstructured data into structured data and apply the structured data to your business.
Service name | Service ID | Service description | QPS limit for API calls (For Alibaba Cloud account and RAM users) |
Document Parsing Service-001 | ops-document-analyze-001 | Supports extracting logical hierarchical structures such as titles and segments from unstructured documents, as well as text, tables, images, and other information, and outputs them in a structured format. The supported document types include TXT, PDF, HTML, DOC, DOCX, PPT, and PPTX. | 10 Note To apply for higher QPS, submit a ticket. |
Prerequisites
The authentication information is obtained.
When you call an AI Search Open Platform service by using an API, you need to authenticate the caller's identity.
The service access address is obtained.
You can call a service over the Internet or a virtual private cloud (VPC). For more information, see Get service registration address.
General information
The maximum request body size cannot exceed 8 MB.
Overview
Document content parsing provides both synchronous and asynchronous interfaces. Due to the risk of HTTP timeout, synchronous interfaces are not recommended for production environments and can be used for debugging. Asynchronous interfaces are recommended for production environments and involve two steps: first, create an asynchronous extraction task to obtain the task_id, then call the asynchronous task retrieval interface to continuously query the status until the task is completed.
Create an asynchronous extraction task
Request method
POST
URL
{host}/v3/openapi/workspaces/{workspace_name}/document-analyze/{service_id}/asynchost: the address for invoking the service. You can call an API service over the Internet or a VPC. For more information, see Get service registration address.
workspace_name: the name of the workspace, such as default.
service_id: the built-in service ID, such as ops-document-analyze-001.
Request parameters
Header parameters
API key authentication
Parameter | Type | Required | Description | Example |
Content-Type | String | Yes | The request type. Valid values: application and json. | application/json |
Authorization | String | Yes | The API key. | Bearer OS-d1**2a |
Body parameters
Parameter | Type | Required | Description | Example |
service_id | String | Yes | The built-in service ID. | ops-document-analyze-001 |
document.url | String | No | The document URL. Valid values: HTTP and HTTPS protocols. Ensure that the URL can be downloaded statelessly from the public network. Either document.content or document.url is required. | http://opensearch-shanghai.oss-cn-shanghai.aliyuncs.com/chatos/***/file-parser/samples/GB10767.pdf |
document.content | String | No | The document content encoded in Base64. Either document.content or document.url is required. | "aGVsbG8gd29ybGQ=" |
document.file_name | String | No | The document name. If you leave this parameter empty, the name can be inferred from the URL. If you leave this parameter and the document.url parameter empty, the document name needs to be explicitly specified. | test.pdf |
document.file_type | String | No | The document type. If you leave this parameter empty, the document type can be inferred from the suffix of the document name. If it cannot be inferred, the document type needs to be explicitly specified. The supported document types include TXT, PDF, HTML, DOC, DOCX, PPT, and PPTX. | |
output.image_storage | String | No | The image storage method.
| url |
strategy.enable_semantic | Boolean | No | Specifies whether to enable semantic hierachy extraction during parsing of TXT documents or documents with unclear hierarchical structure. Valid values:
| false |
For documents without clear distinction between the table of contents and the main text, semantic hierachy extraction makes the hierarchical structure in the results more accurate.
If a value is returned for the usage.semantic_token_count parameter, semantic hierachy extraction is enabled and you are billed for the semantic token consumption. No return value indicates that the feature fails and you are not billed.
You can estimate the time and token consumption after enabling semantic hierachy extraction based on the following table.
PDF pages | Token count | Without semantic hierachy extraction | With semantic hierachy extraction | |
Time (s) | Time (s) | Semantic token | ||
7 | 11,504 | 2 | 49 | 36,243 |
25 | 10,375 | 1 | 33 | 59,332 |
42 | 41,435 | 5 | 68 | 130,717 |
Response parameters
Parameter | Type | Description | Example |
result.task_id | String | The ID of the document parsing asynchronous task. | d5a4019e-853a-****-b5b6-8053d9f5a9fc |
cURL request example
curl --location 'http://****shanghai.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-analyze/ops-document-analyze-001/async/' \
--header 'Authorization: Bearer your API Key' \
--header 'Content-Type: application/json' \
--data '{
"document":{
"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241018/jahnyn/%E8%A7%A3%E6%9E%90%E6%B5%8B%E8%AF%95.doc"
},
"output" :{
"image_storage":"base64"
},
"strategy": {
"enable_semantic":true
}
}'Response example
Normal response example
{
"request_id": "D5A4019E-853A-4E20-****-8053D9F5A9FC",
"latency": 5.0,
"http_code": 200,
"result": {
"task_id": "d5a4019e-853a-****-b5b6-8053d9f5a9fc"
}
}Abnormal response example
In case of an access request error, the output result will indicate the error reason through code and message.
{
"request_id": "590A7EB8-AA84-****-AF31-8C35DC965972",
"latency": 0.0,
"code": "InvalidParameter",
"http_code": 400,
"message": "document.file_name required"
}Get an asynchronous extraction task
Request method
GET
URL
{host}/v3/openapi/workspaces/{workspace_name}/document-analyze/{service_id}/async/task-status?task_id=${task_id}host: the address for invoking the service. You can call an API service over the Internet of a VPC. For more information, see Get service registration address.
workspace_name: the name of the workspace, such as default.
service_id: the built-in service ID, such as ops-document-analyze-001.
task_id: the asynchronous task ID returned in the document parsing response, such as d5a4019e-853a-****-b5b6-8053d9f5a9fc.
Request parameters
Header parameters
API key authentication
Parameter | Type | Required | Description | Example |
Content-Type | String | Yes | The request type. Valid values: application and json. | application/json |
Authorization | String | Yes | The API key. | Bearer OS-d1**2a |
Response parameters
Parameter | Type | Description | Example |
result.task_id | String | The ID of the document parsing asynchronous task. | 24c3ad59-****-40cf-974b-b63d63e0571 |
result.status | String | The task status. Valid values:
| PENDING |
result.error | String | The error message when you set result.status to FAIL. The value of this parameter is empty if the task succeeds. | Document decryption failed |
result.data | Object | The document parsing result. | markdown |
result.data.content | String | The document parsing result - content.
| "XXX" |
result.data.content_type | String | The document parsing result - content format.
| markdown |
result.data.page_num | Int | The document parsing result - number of pages. | 15 |
request_id | String | The unique identifier assigned to an API call by the system. | B4AB89C8-B135-****-A6F8-2BAB8018688 |
latency | Float/Int | The request duration. Unit: milliseconds. | 10 |
usage | Object | The billing information generated by this call. | "usage": { "token_count": 123, "table_count": 5, "image_count": 6, "semantic_token_count":3068 } |
usage.token_count | Int | The number of characters in the document. | 1234 |
usage.table_count | Int | The number of tables in the document. | 5 |
usage.image_count | Int | The number of images in the document. | 6 |
usage.semantic_token_count | Int | The input token of the semantic extraction model. | 3068 |
cURL request example
curl -XGET -H"Content-Type: application/json"
"http://****-hangzhou.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-analyze/ops-document-analyze-001/async/task-status?task_id=110d6349-2e51-****-8bfb-25e5de434686"
-H "Authorization: Bearer Your API-KEY"Response example
Normal response example
{
"request_id": "27F9CEC3-9052-****-83FF-E7957B680492",
"latency": 13.0,
"http_code": 200,
"result": {
"status": "SUCCESS",
"data": {
"content": "Provided proper attribution is provided, Alibaba hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works....",
"content_type": "markdown",
"page_num": 15
},
"task_id": "24c3ad59-b196-****-974b-b63d63e05895"
},
"usage": {
"token_count": 31867,
"table_count": 4,
"image_count": 8,
"semantic_token_count":3068
}
}Abnormal response example
In case of an access request error, the output result will indicate the error reason through code and message.
{
"request_id": "0F94BD89-989C-****-963C-6E4F3FF99445",
"latency": 3.0,
"code": "BadRequest.TaskNotExist",
"http_code": 404,
"message": "task[2fda34f5-40b4-****-a9a2-3e2c1e807361] not exist"
}Create a synchronous extraction task
We recommend that you do not use the synchronization interface in a production environment due to the HTTP timeout risk. This interface can be used for debugging.
Request method
POST
URL
{host}/v3/openapi/workspaces/{workspace_name}/document-analyze/{service_id}/syncParameter description
host: the address for invoking the service. You can call an API service over the Internet of a VPC. For more information, see Get service registration address.
workspace_name: the name of the workspace, such as default.
service_id: the built-in service ID, such as ops-document-analyze-001.
Request parameters
Header parameters
API key authentication
Parameter | Type | Required | Description | Example |
Content-Type | String | Yes | The request type. Valid values: application and json. | application/json |
Authorization | String | Yes | The API key. | Bearer OS-d1**2a |
Body parameters
Parameter | Type | Required | Description | Example |
document.url | String | No | The document URL. Valid values: HTTP and HTTPS protocols. Ensure that it can be downloaded statelessly from the public network. Either document.content or document.url is required. | http://opensearch-shanghai.oss-cn-shanghai.aliyuncs.com/chatos/***/file-parser/samples/GB10767.pdf |
document.content | String | No | The document content encoded in Base64. Either document.content or document.url is required. | "aGVsbG8gd29ybGQ=" |
document.file_name | String | No | The document name. If you leave this parameter empty, the name can be inferred from the URL. If you leave this parameter and the document.url parameter empty, the document name needs to be explicitly specified. | test.pdf |
document.file_type | String | No | The document type. If you leave this parameter empty, the document type can be inferred from the suffix of the document name. If it cannot be inferred, the document type needs to be explicitly specified. The supported document types include TXT, PDF, HTML, DOC, DOCX, PPT, and PPTX. | |
output.image_storage | String | No | The image storage method.
| url |
strategy.enable_semantic | Boolean | No | Specifies whether to enable semantic hierachy extraction. Enabling this feature will return the document in a Markdown hierarchical format with better accuracy. When you enable this feature, the overall document parsing time will increase. Parsing times exceeding 400 seconds or super-long documents (over 100 pages) may result in the system automatically disabling the feature. This feature does not support documents of the HTML, PPT, or PPTX type. | false |
Response parameters
Parameter | Type | Description | Example |
result.status | String | The task status. Valid values:
| PENDING |
result.error | String | The error message when you set result.status to FAIL. The value of this parameter is empty if the task succeeds. | Document decryption failed |
result.data | Object | The document parsing result. | markdown |
result.data.content | String | The document parsing result - content.
| "XXX" |
result.data.content_type | String | The document parsing result - content format.
| markdown |
result.data.page_num | Int | The document parsing result - number of pages. | 15 |
request_id | String | The unique identifier assigned to an API call by the system. | B4AB89C8-B135-****-A6F8-2BAB801A2CE4 |
latency | Float/Int | The request duration. Unit: milliseconds. | 10 |
usage | Object | The billing information generated by this call. | "usage": { "token_count": 123, "table_count": 5, "image_count": 6, "semantic_token_count":3068 } |
usage.token_count | Int | The number of characters in the document. | 1234 |
usage.table_count | Int | The number of tables in the document. | 5 |
usage.image_count | Int | The number of images in the document. | 6 |
usage.semantic_token_count | Int | The input token of the semantic extraction model. | 3068 |
cURL request example
curl --location 'http://****shanghai.opensearch.aliyuncs.com/v3/openapi/workspaces/default/document-analyze/ops-document-analyze-001/sync/' \
--header 'Authorization: Bearer your API Key' \
--header 'Content-Type: application/json' \
--data '{
"document":{
"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241018/jahnyn/%E8%A7%A3%E6%9E%90%E6%B5%8B%E8%AF%95.doc"
},
"output" :{
"image_storage":"base64"
},
"strategy": {
"enable_semantic":true
}
}'Response example
Normal response example
{
"request_id": "27F9CEC3-9052-****-83FF-E7957B689D04",
"latency": 13.0,
"http_code": 200,
"result": {
"status": "SUCCESS",
"data": {
"content": "Provided proper attribution is provided, Alibaba hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works....",
"content_type": "markdown",
"page_num": 15
}
},
"usage": {
"token_count": 31867,
"table_count": 4,
"image_count": 8,
"semantic_token_count":3068
}
}Abnormal response example
In case of an access request error, the output result will indicate the error reason through code and message.
{
"request_id": "6F33AFB6-A35C-****-AFD2-9EA16CCF4383",
"latency": 2.0,
"code": "InvalidParameter",
"http_code": 400,
"message": "JSON parse error: Cannot deserialize value of type `ImageStorage` from String \\"xxx\\"
}Status code description
HTTP status code | Error message | Description |
200 | - | Request successful, including task failure scenarios. The actual task status needs to be determined from result.status |
404 | BadRequest.TaskNotExist | Task does not exist |
400 | InvalidParameter | Invalid request |
500 | InternalServerError | Internal error |