Knowledge base document management - import, status, chunks - Tablestore

Supported document formats

PDF: .pdf
Word: .doc, .docx
Excel: .xls, .xlsx
PowerPoint: .ppt, .pptx
Plain text: .txt
Markdown: .md

Document status lifecycle

A document transitions through the following statuses before becoming searchable:

Status	Description	Actions
`Pending`	The task is queued for processing.	Query status, Delete
`Indexing`	The system is parsing, chunking, and vectorizing the document.	Query status, Delete
`Completed`	Indexing is complete. The document is now searchable.	Search, Update metadata, Delete, View chunks
`Failed`	Indexing failed.	View failure reason, Delete, Re-upload
`Deleting`	The system is deleting the document and its associated chunks.	Wait for the deletion to complete.

Note

Documents in Pending or Indexing status are not searchable. Wait for the status to reach Completed.

Add documents

Import documents into the knowledge base for automatic parsing, chunking, vectorization, and indexing. Re-uploading with the same ossKey overwrites the existing document.

The SDK provides three methods for importing documents:

Method	SDK method	Description
Upload local file	`upload_documents()`	Specify the path to a local file. The SDK automatically uploads it to OSS and then adds it to the knowledge base.
Add OSS file	`add_documents()`	Specify the path to an existing OSS file.
Batch import from OSS directory	`add_documents()`	Specify an OSS directory path. The system recursively scans and adds all files in the directory.

Request parameters

Parameter	Type	Description
`knowledgeBaseName`	string	The name of the knowledge base. Required.
`subspace`	string	The name of the subspace. The maximum length is 128 characters. Required if subspaces are enabled for the knowledge base.
`documents`	list<object>	A list of documents. Required. You can include up to 10 documents in a single request, and each file must not exceed 50 MB. Note To request an increase in these limits, you can submit a ticket or contact technical support by joining the Tablestore technical exchange group (36165029092).
`documents[].filePath`	string	The local file path. Required when using upload_documents.
`documents[].ossKey`	string	The path to an OSS file or directory. The length must be between 1 and 256 characters. Required when using `add_documents`. Note
`documents[].metadata`	object	The document metadata. It must conform to the metadata schema defined for the knowledge base.
`documents[].inclusionFilters`	list<string>	Inclusion filter that supports the `` wildcard at the beginning and end (such as `.pdf`) for scanning OSS directories.
`documents[].exclusionFilters`	list<string>	Exclusion filter, supporting the `` wildcard at the beginning and end (for example, `draft*`)

Code examples

Upload a local file

Specify a local file path. The SDK uploads the file to OSS and adds it to the knowledge base automatically.

Note

When you use upload_documents, you must provide both oss_endpoint and oss_bucket_name when initializing the AgentStorageClient. Otherwise, a ValueError is raised.

resp = client.upload_documents({
    "knowledgeBaseName": "product_docs_kb",
    "documents": [
        {
            "filePath": "/home/user/docs/product_manual.pdf",
            "metadata": {"author": "Jane Doe", "category": "Product Manual"}
        },
        {
            "filePath": "/home/user/docs/faq.docx",
            "metadata": {"author": "John Doe", "category": "FAQ"}
        }
    ]
})

Add an OSS file

If the file already exists in OSS, specify its ossKey directly.

resp = client.add_documents({
    "knowledgeBaseName": "product_docs_kb",
    "documents": [
        {
            "ossKey": "oss://example-bucket/docs/product_manual.pdf",
            "metadata": {"author": "Jane Doe"}
        }
    ]
})

Batch import from an OSS directory

Specify an OSS directory path. The system recursively scans all files within the directory. You can use inclusionFilters and exclusionFilters to filter files based on name patterns.

resp = client.add_documents({
    "knowledgeBaseName": "product_docs_kb",
    "documents": [
        {
            "ossKey": "oss://example-bucket/docs/",
            "inclusionFilters": ["*.pdf", "*.docx"],
            "exclusionFilters": ["*draft*"]
        }
    ]
})

Response

Response fields

Field	Type	Description
`documentDetails`	list<object>	The processing result for each document.
`documentDetails[].docId`	string	The document ID.
`documentDetails[].ossKey`	string	The OSS path of the document.
`documentDetails[].status`	string	`succeed` or `failed`.
`documentDetails[].failureReason`	string	The reason for the failure. This field is present only if the status is `failed`.

Response examples

{
  "code": "SUCCESS",
  "data": {
    "documentDetails": [
      {"docId": "fc6ed97f-...", "status": "succeed", "ossKey": "oss://example-bucket/docs/product_manual.pdf"},
      {"docId": "940f2c5c-...", "status": "succeed", "ossKey": "oss://example-bucket/docs/faq.docx"}
    ]
  },
  "message": "succeed"
}

Example of a partial failure response (the HTTP status code is still 200 and the code is still SUCCESS):

{
  "code": "SUCCESS",
  "data": {
    "documentDetails": [
      {"status": "failed", "failureReason": "Metadata field 'date' date string format is not supported", "ossKey": "oss://..."},
      {"status": "succeed", "ossKey": "oss://...", "docId": "940f2c5c-..."}
    ]
  },
  "message": "succeed"
}

Usage notes

A 200 OK HTTP response with code: SUCCESS does not guarantee that all documents were processed successfully. You must check the status field for each document in the documentDetails array.
status: "succeed" means the upload task was accepted, not that indexing finished. The document can only be retrieved after its status reaches Completed.
If subspace is enabled for the knowledge base, you must pass the subspace parameter. Otherwise, an INVALID_PARAMETER error is returned.
Use a supported metadata date format such as yyyy-MM-dd HH:mm:ss. Unsupported formats cause a failed document status.

Check indexing status

Document upload is asynchronous — documents must finish processing before they become searchable. Use polling with exponential backoff to check indexing status.

import time

def wait_for_document(client, kb_name, doc_id, max_interval=30):
    """Polls the document status with exponential backoff until indexing is complete."""
    interval = 3
    while True:
        resp = client.get_document({
            "knowledgeBaseName": kb_name,
            "docId": doc_id
        })
        status = resp["data"][0]["status"]
        if status == "Completed":
            print(f"Indexing complete. Number of chunks: {resp['data'][0].get('chunkNum', 'N/A')}")
            return resp
        elif status == "Failed":
            raise Exception(f"Indexing failed: {resp['data'][0].get('failedDetails')}")
        print(f"Current status: {status}, retrying in {interval}s...")
        time.sleep(interval)
        interval = min(interval * 2, max_interval)

Processing time depends on file size, type, and count. Small files typically complete in seconds; large files or batch imports may take several minutes.

Query a document

Call the get_document method to retrieve details for a specific document, including its processing status, number of chunks, and metadata.

Request parameters

Parameter	Type	Description
`knowledgeBaseName`	string	The name of the knowledge base. Required.
`subspace`	string	The name of the subspace. Required if subspaces are enabled for the knowledge base.
`docId`	string	The document ID. You must specify either this parameter or `ossKey`.
`ossKey`	string	The OSS file path. You must specify either this parameter or `docId`.

Code example

resp = client.get_document({
    "knowledgeBaseName": "product_docs_kb",
    "docId": "fc6ed97f-..."
})

doc = resp["data"][0]
print(f"Status: {doc['status']}, Number of chunks: {doc.get('chunkNum', 'N/A')}")

Response

Field	Type	Description
`docId`	string	The document ID.
`ossKey`	string	The OSS path.
`subspace`	string	The subspace.
`chunkNum`	int	The number of chunks.
`status`	string	The document status: `Pending`, `Indexing`, `Completed`, `Failed`, or `Deleting`.
`createdAt`	int	The creation timestamp.
`updatedAt`	int	The update timestamp.
`eTag`	string	The eTag of the document.
`failedDetails`	string	The reason for the failure. This field is present only if the status is Failed.
`metadata`	object	The document metadata.

Usage notes

If the same ossKey is created, deleted, and then created again, get_document may return multiple records, including historical records. To identify the valid document, check the status field and use the record with a Completed status.

List documents

Call the list_documents method to retrieve a paginated list of documents in a knowledge base.

Request parameters

Parameter	Type	Description
`knowledgeBaseName`	string	The name of the knowledge base. Required.
`subspace`	list<string>	A list of subspace names. You can specify up to 10 subspaces. Required if subspaces are enabled for the knowledge base.
`maxResults`	int	The number of results to return. The default value is 10, and the maximum is 1000.
`nextToken`	string	The pagination token for retrieving the next page of results.

Code example

resp = client.list_documents({
    "knowledgeBaseName": "product_docs_kb",
    "maxResults": 20
})

for doc in resp["data"]["documentDetails"]:
    print(f"[{doc['status']}] {doc['ossKey']} (Number of chunks: {doc.get('chunkNum', '-')})")

Usage notes

The subspace parameter supports a list of up to 10 values. If you exceed this limit, an error is returned.

Update document metadata

Call the update_document method to update the metadata of a specific document.

Note

You can only update the metadata for documents that are in the Completed status. Calling this method for documents in any other status returns an error.

Request parameters

Parameter	Type	Description
`knowledgeBaseName`	string	The name of the knowledge base. Required.
`subspace`	string	The name of the subspace. Required if subspaces are enabled for the knowledge base.
`ossKey`	string	The OSS path of the document. You must specify either this parameter or `docId`.
`docId`	string	The document ID. You must specify either this parameter or `ossKey`.
`metadata`	map	The new metadata. Required.

Code example

resp = client.update_document({
    "knowledgeBaseName": "product_docs_kb",
    "docId": "fc6ed97f-...",
    "metadata": {"author": "Jane Doe", "category": "Technical Docs", "version": 2}
})

print(f"Update status: {resp['data']['updateStatus']}")  # UPDATED or NO_OP

Response

Field	Type	Description
`docId`	string	The document ID.
`ossKey`	string	The OSS path.
`updatedAt`	long	The update timestamp.
`updateStatus`	string	`NO_OP` or `UPDATED`.

Usage notes

Metadata updates are full replacements. To update a single field, include all other existing fields in the request.
Passing "metadata": null will clear all metadata.
If the metadata field is not specified, the original value is retained.
Limits: metadata size cannot exceed 4 KB; maximum 200 fields.

Delete documents

Call the delete_documents method to delete specified documents and all their associated chunks.

Request parameters

Parameter	Type	Description
`knowledgeBaseName`	string	The name of the knowledge base. Required.
`subspace`	string	The name of the subspace. Required if subspaces are enabled for the knowledge base.
`documents`	list<object>	A list of documents to delete. Required.
`documents[].docId`	string	The document ID. You must specify either this parameter or `ossKey`.
`documents[].ossKey`	string	The OSS path. You must specify either this parameter or `docId`.

Code example

resp = client.delete_documents({
    "knowledgeBaseName": "product_docs_kb",
    "documents": [
        {"docId": "fc6ed97f-..."},
        {"ossKey": "oss://example-bucket/docs/faq.docx"}
    ]
})

# Check the deletion result for each document
for detail in resp["data"]["documentDetails"]:
    print(f"{detail['ossKey']}: {detail['status']}")

Usage notes

As with AddDocuments, check the status of each document in documentDetails individually.

Tablestore:Document management

Supported document formats

Document status lifecycle

Add documents

Request parameters

Code examples

Upload a local file

Add an OSS file

Batch import from an OSS directory

Response

Response fields

Response examples

Usage notes

Check indexing status

Query a document

Request parameters

Code example

Response

Usage notes

List documents

Request parameters

Code example

Usage notes

Update document metadata

Request parameters

Code example

Response

Usage notes

Delete documents

Request parameters

Code example

Usage notes

Related documents