Document content extraction automatically extracts plain text from documents that come in various formats and languages. This topic describes how to extract text from a document.
Scenarios
Search engine optimization: To improve SEO efficiency, it is sometimes necessary to convert content to plain text, which helps extract keywords and metadata.
Programming and development: In programming and software development, source code and configuration files need to be saved and shared as plain text.
Format clearing: Converting documents to plain text files clears all text formatting.
Limits
Supported document file formats
The following table provides a list of the supported document file formats and their extensions.
Format | Extension |
Word | .doc and .docx |
PPT | .ppt and .pptx |
Excel | .xls and .xlsx |
TXT | .txt |
Limits on document size
The size of the document from which you want to extract text cannot exceed 20 MB.
The extracted text is stored in a plain text file. The file cannot exceed 100 KB in size.
If the size of a document to extract text from exceeds the size limits, you can first convert the document to TXT by using the document format conversion feature.
Prerequisites
An AccessKey pair is created and obtained. For more information, see Create an AccessKey pair.
Object Storage Service (OSS) is activated, a bucket is created, and document objects are uploaded to the bucket. For more information, see Upload objects.
Intelligent Media Management (IMM) is activated. For more information, see Activate IMM.
A project is created in the IMM console. For more information, see Create a project.
NoteYou can also call the CreateProject operation to create a project. For more information, see CreateProject.
You can call the ListProjects operation to query the existing projects in a specific region. For more information, see ListProjects.
Usage
Call the ExtractDocumentText operation to extract text from a document.
The operation does not support output in spreadsheet formats, such as Excel.
Example of content extraction
IMM project name: test-project
The URI of the document: oss://test-bucket/test-object.docx
Sample request
{
"ProjectName": "test-project",
"SourceURI": "oss://test-bucket/test-object.docx"
}Sample response
{
"DocumentText": "Alibaba Cloud Intelligent Media Management provides advanced and intelligent media content management features",
"RequestId": "5C04D1DD-8B54-5670-9868-C30D186E5E20"
}The following text is extracted from the document: Alibaba Cloud Intelligent Media Management provides advanced and intelligent media content management features. Line breaks and spaces in original documents are converted to escape characters in the output files.
Sample code
# -*- coding: utf-8 -*-
# This file is auto-generated, don't edit it. Thanks.
import os
import sys
from typing import List
from alibabacloud_imm20200930.client import Client as imm20200930Client
from alibabacloud_tea_openapi import models as open_api_models
from alibabacloud_imm20200930 import models as imm_20200930_models
from alibabacloud_tea_util import models as util_models
from alibabacloud_tea_util.client import Client as UtilClient
class Sample:
def __init__(self):
pass
@staticmethod
def create_client() -> imm20200930Client:
"""
Initialize the client by using the AccessKey ID and AccessKey secret.
@return: Client
@throws Exception
"""
# Leaked project code can expose your AccessKey pair and all the resources in the account to security risks.
# For security reasons, we recommend that you use temporary access credentials that are provided by Security Token Service (STS). For more information, visit https://www.alibabacloud.com/help/en/sdk/developer-reference/v2-manage-python-access-credentials.
config = open_api_models.Config(
# Required. Make sure that the ALIBABA_CLOUD_ACCESS_KEY_ID environment variable is configured.
access_key_id=os.environ['ALIBABA_CLOUD_ACCESS_KEY_ID'],
# Required. Make sure that the ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variable is configured.
access_key_secret=os.environ['ALIBABA_CLOUD_ACCESS_KEY_SECRET']
)
# Specify the IMM endpoint. For a list of IMM endpoints for supported regions, visit https://api.alibabacloud.com/product/imm.
config.endpoint = f'imm.cn-beijing.aliyuncs.com'
return imm20200930Client(config)
@staticmethod
def main(
args: List[str],
) -> None:
client = Sample.create_client()
extract_document_text_request = imm_20200930_models.ExtractDocumentTextRequest(
project_name='test-project',
source_uri='oss://test-bucket/test-object.docx'
)
runtime = util_models.RuntimeOptions()
try:
# Write your code to print the response of the API operation if necessary.
client.extract_document_text_with_options(extract_document_text_request, runtime)
except Exception as error:
# Handle exceptions with caution in your actual business scenario and never ignore exceptions in your project. In this example, error messages are printed to the console.
# Display the error message.
print(error.message)
# Show the URL for troubleshooting.
print(error.data.get("Recommend"))
UtilClient.assert_as_string(error.message)
@staticmethod
async def main_async(
args: List[str],
) -> None:
client = Sample.create_client()
extract_document_text_request = imm_20200930_models.ExtractDocumentTextRequest(
project_name='test-project',
source_uri='oss://test-bucket/test-object.docx'
)
runtime = util_models.RuntimeOptions()
try:
# Write your code to print the response of the API operation if necessary.
await client.extract_document_text_with_options_async(extract_document_text_request, runtime)
except Exception as error:
# Handle exceptions with caution in your actual business scenario and never ignore exceptions in your project. In this example, error messages are printed to the console.
# Display the error message.
print(error.message)
# Show the URL for troubleshooting.
print(error.data.get("Recommend"))
UtilClient.assert_as_string(error.message)
if __name__ == '__main__':
Sample.main(sys.argv[1:])