Import documents into a knowledge base and manage documents and chunks (upload, status, chunk) - Tablestore

Use the following API operations to manage documents in a knowledge base: import documents, query status, list documents, update metadata, and delete documents.

Supported document formats

PDF: .pdf
Word: .doc, .docx
Excel: .xls, .xlsx
PowerPoint: .ppt, .pptx
Plain text: .txt
Markdown: .md

Document status lifecycle

After a document is uploaded, it transitions through the following statuses before it becomes searchable:

Status	Description	Actions
`Pending`	The task is queued for processing.	Query status, Delete
`Indexing`	The system is parsing, chunking, and vectorizing the document.	Query status, Delete
`Completed`	Indexing is complete. The document is now searchable.	Search, Update metadata, Delete, View chunks
`Failed`	Indexing failed.	View failure reason, Delete, Re-upload
`Deleting`	The system is deleting the document and its associated chunks.	Wait for the deletion to complete.

Note

A document cannot be searched while its status is Pending or Indexing. You must wait for the status to change to Completed before the document becomes searchable.

Add documents

Import a document into the knowledge base. The system automatically completes parsing, chunking, embedding vectorization, and index building. Uploading a document with the same ossKey overwrites the existing document.

The SDK provides three methods for importing documents:

Method	SDK method	Description
Upload local file	`upload_documents()`	Specify the path to a local file. The SDK automatically uploads it to OSS and then adds it to the knowledge base.
Add OSS file	`add_documents()`	Specify the path to an existing OSS file.
Batch import from OSS directory	`add_documents()`	Specify an OSS directory path. The system recursively scans and adds all files in the directory.

Request parameters

Parameter	Type	Description
`knowledgeBaseName`	string	The name of the knowledge base. Required.
`subspace`	string	The name of the subspace. The maximum length is 128 characters. Required if subspaces are enabled for the knowledge base.
`documents`	list<object>	A list of documents. Required. You can include up to 10 documents in a single request, and each file must not exceed 50 MB. Note To request an increase in these limits, you can submit a ticket or contact technical support by joining the Tablestore technical exchange group (36165029092).
`documents[].filePath`	string	The local file path. Required when using upload_documents.
`documents[].ossKey`	string	The path to an OSS file or directory. The length must be between 1 and 256 characters. Required when using `add_documents`. Note
`documents[].metadata`	object	The document metadata. It must conform to the metadata schema defined for the knowledge base.
`documents[].inclusionFilters`	list<string>	Inclusion filter that supports the `` wildcard at the beginning and end (such as `.pdf`) for scanning OSS directories.
`documents[].exclusionFilters`	list<string>	Exclusion filter, supporting the `` wildcard at the beginning and end (for example, `draft*`)

Code examples

Upload a local file

Specify the path to a local file. The SDK automatically handles the two-step process: uploading the file to OSS and then adding it to the knowledge base.

Note

When you use upload_documents, you must provide both oss_endpoint and oss_bucket_name when initializing the AgentStorageClient. Otherwise, a ValueError is raised.

resp = client.upload_documents({
    "knowledgeBaseName": "product_docs_kb",
    "documents": [
        {
            "filePath": "/home/user/docs/product_manual.pdf",
            "metadata": {"author": "Jane Doe", "category": "Product Manual"}
        },
        {
            "filePath": "/home/user/docs/faq.docx",
            "metadata": {"author": "John Doe", "category": "FAQ"}
        }
    ]
})

Add an OSS file

If the file already exists in OSS, specify its ossKey directly.

resp = client.add_documents({
    "knowledgeBaseName": "product_docs_kb",
    "documents": [
        {
            "ossKey": "oss://example-bucket/docs/product_manual.pdf",
            "metadata": {"author": "Jane Doe"}
        }
    ]
})

Batch import from an OSS directory

Specify an OSS directory path. The system recursively scans all files within the directory. You can use inclusionFilters and exclusionFilters to filter files based on name patterns.

resp = client.add_documents({
    "knowledgeBaseName": "product_docs_kb",
    "documents": [
        {
            "ossKey": "oss://example-bucket/docs/",
            "inclusionFilters": ["*.pdf", "*.docx"],
            "exclusionFilters": ["*draft*"]
        }
    ]
})

Response

Response fields

Field	Type	Description
`documentDetails`	list<object>	The processing result for each document.
`documentDetails[].docId`	string	The document ID.
`documentDetails[].ossKey`	string	The OSS path of the document.
`documentDetails[].status`	string	`succeed` or `failed`.
`documentDetails[].failureReason`	string	The reason for the failure. This field is present only if the status is `failed`.

Response examples

{
  "code": "SUCCESS",
  "data": {
    "documentDetails": [
      {"docId": "fc6ed97f-...", "status": "succeed", "ossKey": "oss://example-bucket/docs/product_manual.pdf"},
      {"docId": "940f2c5c-...", "status": "succeed", "ossKey": "oss://example-bucket/docs/faq.docx"}
    ]
  },
  "message": "succeed"
}

Example of a partial failure response (the HTTP status code is still 200 and the code is still SUCCESS):

{
  "code": "SUCCESS",
  "data": {
    "documentDetails": [
      {"status": "failed", "failureReason": "Metadata field 'date' date string format is not supported", "ossKey": "oss://..."},
      {"status": "succeed", "ossKey": "oss://...", "docId": "940f2c5c-..."}
    ]
  },
  "message": "succeed"
}

Usage notes

A 200 OK HTTP response with code: SUCCESS does not guarantee that all documents were processed successfully. You must check the status field for each document in the documentDetails array.
status: "succeed" indicates that the upload task is received, not that indexing is complete. The document can be retrieved only after the document status changes to Completed.
If subspace is enabled for the knowledge base, you must pass the subspace parameter. Otherwise, an INVALID_PARAMETER error is returned.
You must use a supported metadata date format, such as yyyy-MM-dd HH:mm:ss, because an unsupported format will result in a failed document status.

Check indexing status

Document upload is an asynchronous process. After a document is uploaded, it must be processed before it can be searched. We recommend using a polling strategy with exponential backoff to check if indexing is complete.

import time

def wait_for_document(client, kb_name, doc_id, max_interval=30):
    """Polls the document status with exponential backoff until indexing is complete."""
    interval = 3
    while True:
        resp = client.get_document({
            "knowledgeBaseName": kb_name,
            "docId": doc_id
        })
        status = resp["data"][0]["status"]
        if status == "Completed":
            print(f"Indexing complete. Number of chunks: {resp['data'][0].get('chunkNum', 'N/A')}")
            return resp
        elif status == "Failed":
            raise Exception(f"Indexing failed: {resp['data'][0].get('failedDetails')}")
        print(f"Current status: {status}, retrying in {interval}s...")
        time.sleep(interval)
        interval = min(interval * 2, max_interval)

The processing time depends on the size, type, and number of files. Small files are typically processed in a few seconds, while large files or batch imports may take several minutes.

Query a document

Call the get_document method to retrieve details for a specific document, including its processing status, number of chunks, and metadata.

Request parameters

Parameter	Type	Description
`knowledgeBaseName`	string	The name of the knowledge base. Required.
`subspace`	string	The name of the subspace. Required if subspaces are enabled for the knowledge base.
`docId`	string	The document ID. You must specify either this parameter or `ossKey`.
`ossKey`	string	The OSS file path. You must specify either this parameter or `docId`.

Code example

resp = client.get_document({
    "knowledgeBaseName": "product_docs_kb",
    "docId": "fc6ed97f-..."
})

doc = resp["data"][0]
print(f"Status: {doc['status']}, Number of chunks: {doc.get('chunkNum', 'N/A')}")

Response

Field	Type	Description
`docId`	string	The document ID.
`ossKey`	string	The OSS path.
`subspace`	string	The subspace.
`chunkNum`	int	The number of chunks.
`status`	string	The document status: `Pending`, `Indexing`, `Completed`, `Failed`, or `Deleting`.
`createdAt`	int	The creation timestamp.
`updatedAt`	int	The update timestamp.
`eTag`	string	The eTag of the document.
`failedDetails`	string	The reason for the failure. This field is present only if the status is Failed.
`metadata`	object	The document metadata.

Usage notes

If the same ossKey is created, deleted, and then created again, get_document may return multiple records, including historical records. To identify the valid document, check the status field and use the record with a Completed status.

List documents

Call the list_documents method to retrieve a paginated list of documents in a knowledge base.

Request parameters

Parameter	Type	Description
`knowledgeBaseName`	string	The name of the knowledge base. Required.
`subspace`	list<string>	A list of subspace names. You can specify up to 10 subspaces. Required if subspaces are enabled for the knowledge base.
`maxResults`	int	The number of results to return. The default value is 10, and the maximum is 1000.
`nextToken`	string	The pagination token for retrieving the next page of results.

Code example

resp = client.list_documents({
    "knowledgeBaseName": "product_docs_kb",
    "maxResults": 20
})

for doc in resp["data"]["documentDetails"]:
    print(f"[{doc['status']}] {doc['ossKey']} (Number of chunks: {doc.get('chunkNum', '-')})")

Usage notes

The subspace parameter supports a list of up to 10 values. If you exceed this limit, an error is returned.

Update document metadata

Call the update_document method to update the metadata of a specific document.

Note

You can only update the metadata for documents that are in the Completed status. Calling this method for documents in any other status returns an error.

Request parameters

Parameter	Type	Description
`knowledgeBaseName`	string	The name of the knowledge base. Required.
`subspace`	string	The name of the subspace. Required if subspaces are enabled for the knowledge base.
`ossKey`	string	The OSS path of the document. You must specify either this parameter or `docId`.
`docId`	string	The document ID. You must specify either this parameter or `ossKey`.
`metadata`	map	The new metadata. Required.

Code example

resp = client.update_document({
    "knowledgeBaseName": "product_docs_kb",
    "docId": "fc6ed97f-...",
    "metadata": {"author": "Jane Doe", "category": "Technical Docs", "version": 2}
})

print(f"Update status: {resp['data']['updateStatus']}")  # UPDATED or NO_OP

Response

Field	Type	Description
`docId`	string	The document ID.
`ossKey`	string	The OSS path.
`updatedAt`	long	The update timestamp.
`updateStatus`	string	`NO_OP` or `UPDATED`.

Usage notes

Metadata updates are overwrite operations. The new metadata you provide completely replaces the existing metadata. If you only want to update a single field, you must include all other existing fields in the request.
Passing "metadata": null will clear all metadata.
If the metadata field is not specified, the original value is retained.
Limitations: The total size of the metadata (keys and values) cannot exceed 4 KB. The maximum number of fields is 200.

Delete documents

Call the delete_documents method to delete specified documents and all their associated chunks.

Request parameters

Parameter	Type	Description
`knowledgeBaseName`	string	The name of the knowledge base. Required.
`subspace`	string	The name of the subspace. Required if subspaces are enabled for the knowledge base.
`documents`	list<object>	A list of documents to delete. Required.
`documents[].docId`	string	The document ID. You must specify either this parameter or `ossKey`.
`documents[].ossKey`	string	The OSS path. You must specify either this parameter or `docId`.

Code example

resp = client.delete_documents({
    "knowledgeBaseName": "product_docs_kb",
    "documents": [
        {"docId": "fc6ed97f-..."},
        {"ossKey": "oss://example-bucket/docs/faq.docx"}
    ]
})

# Check the deletion result for each document
for detail in resp["data"]["documentDetails"]:
    print(f"{detail['ossKey']}: {detail['status']}")

Usage notes

Similar to AddDocuments, the deletion result also requires you to individually check the status of each document in documentDetails.

Tablestore:Document management

Supported document formats

Document status lifecycle

Add documents

Request parameters

Code examples

Upload a local file

Add an OSS file

Batch import from an OSS directory

Response

Response fields

Response examples

Usage notes

Check indexing status

Query a document

Request parameters

Code example

Response

Usage notes

List documents

Request parameters

Code example

Usage notes

Update document metadata

Request parameters

Code example

Response

Usage notes

Delete documents

Request parameters

Code example

Usage notes

Related documents