All Products
Search
Document Center

Intelligent Media Management:Extract text from a document

Last Updated:May 30, 2025

Document content extraction automatically extracts plain text from documents that come in various formats and languages. This topic describes how to extract text from a document.

Scenarios

  • Search engine optimization: To improve SEO efficiency, it is sometimes necessary to convert content to plain text, which helps extract keywords and metadata.

  • Programming and development: In programming and software development, source code and configuration files need to be saved and shared as plain text.

  • Format clearing: Converting documents to plain text files clears all text formatting.

Limits

Supported document file formats

The following table provides a list of the supported document file formats and their extensions.

Format

Extension

Word

.doc and .docx

PPT

.ppt and .pptx

Excel

.xls and .xlsx

PDF

.pdf

TXT

.txt

Limits on document size

  • The size of the document from which you want to extract text cannot exceed 20 MB.

  • The extracted text is stored in a plain text file. The file cannot exceed 100 KB in size.

Note

If the size of a document to extract text from exceeds the size limits, you can first convert the document to TXT by using the document format conversion feature.

Prerequisites

  • An AccessKey pair is created and obtained. For more information, see Create an AccessKey pair.

  • Object Storage Service (OSS) is activated, a bucket is created, and document objects are uploaded to the bucket. For more information, see Upload objects.

  • Intelligent Media Management (IMM) is activated. For more information, see Activate IMM.

  • A project is created in the IMM console. For more information, see Create a project.

    Note
    • You can also call the CreateProject operation to create a project. For more information, see CreateProject.

    • You can call the ListProjects operation to query the existing projects in a specific region. For more information, see ListProjects.

Usage

Call the ExtractDocumentText operation to extract text from a document.

Note

The operation does not support output in spreadsheet formats, such as Excel.

Example of content extraction

  • IMM project name: test-project

  • The URI of the document: oss://test-bucket/test-object.docx

Sample request

{
  "ProjectName": "test-project",
  "SourceURI": "oss://test-bucket/test-object.docx"
}

Sample response

{
  "DocumentText": "Alibaba Cloud Intelligent Media Management provides advanced and intelligent media content management features",
  "RequestId": "5C04D1DD-8B54-5670-9868-C30D186E5E20"
}
Note

The following text is extracted from the document: Alibaba Cloud Intelligent Media Management provides advanced and intelligent media content management features. Line breaks and spaces in original documents are converted to escape characters in the output files.

Sample code

# -*- coding: utf-8 -*-
# This file is auto-generated, don't edit it. Thanks.
import os
import sys

from typing import List

from alibabacloud_imm20200930.client import Client as imm20200930Client
from alibabacloud_tea_openapi import models as open_api_models
from alibabacloud_imm20200930 import models as imm_20200930_models
from alibabacloud_tea_util import models as util_models
from alibabacloud_tea_util.client import Client as UtilClient


class Sample:
    def __init__(self):
        pass

    @staticmethod
    def create_client() -> imm20200930Client:
        """
        Initialize the client by using the AccessKey ID and AccessKey secret.
        @return: Client
        @throws Exception
        """
        # Leaked project code can expose your AccessKey pair and all the resources in the account to security risks.
        # For security reasons, we recommend that you use temporary access credentials that are provided by Security Token Service (STS). For more information, visit https://www.alibabacloud.com/help/en/sdk/developer-reference/v2-manage-python-access-credentials.
        config = open_api_models.Config(
            # Required. Make sure that the ALIBABA_CLOUD_ACCESS_KEY_ID environment variable is configured.
            access_key_id=os.environ['ALIBABA_CLOUD_ACCESS_KEY_ID'],
            # Required. Make sure that the ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variable is configured.
            access_key_secret=os.environ['ALIBABA_CLOUD_ACCESS_KEY_SECRET']
        )
        # Specify the IMM endpoint. For a list of IMM endpoints for supported regions, visit https://api.alibabacloud.com/product/imm.
        config.endpoint = f'imm.cn-beijing.aliyuncs.com'
        return imm20200930Client(config)

    @staticmethod
    def main(
        args: List[str],
    ) -> None:
        client = Sample.create_client()
        extract_document_text_request = imm_20200930_models.ExtractDocumentTextRequest(
            project_name='test-project',
            source_uri='oss://test-bucket/test-object.docx'
        )
        runtime = util_models.RuntimeOptions()
        try:
            # Write your code to print the response of the API operation if necessary.
            client.extract_document_text_with_options(extract_document_text_request, runtime)
        except Exception as error:
            # Handle exceptions with caution in your actual business scenario and never ignore exceptions in your project. In this example, error messages are printed to the console.
            # Display the error message.
            print(error.message)
            # Show the URL for troubleshooting.
            print(error.data.get("Recommend"))
            UtilClient.assert_as_string(error.message)

    @staticmethod
    async def main_async(
        args: List[str],
    ) -> None:
        client = Sample.create_client()
        extract_document_text_request = imm_20200930_models.ExtractDocumentTextRequest(
            project_name='test-project',
            source_uri='oss://test-bucket/test-object.docx'
        )
        runtime = util_models.RuntimeOptions()
        try:
            # Write your code to print the response of the API operation if necessary.
            await client.extract_document_text_with_options_async(extract_document_text_request, runtime)
        except Exception as error:
            # Handle exceptions with caution in your actual business scenario and never ignore exceptions in your project. In this example, error messages are printed to the console.
            # Display the error message.
            print(error.message)
            # Show the URL for troubleshooting.
            print(error.data.get("Recommend"))
            UtilClient.assert_as_string(error.message)


if __name__ == '__main__':
    Sample.main(sys.argv[1:])