All Products
Search
Document Center

Object Storage Service:Data indexing

Last Updated:Aug 27, 2024

Object Storage Service (OSS) provides the data indexing feature to allow you to index the metadata of objects. You can specify the metadata of objects as index conditions to query objects. Data indexing can help you understand and manage data structures in a more efficient manner. Data indexing also facilitates object queries, statistics, and management.

Scenarios

To meet data audit or data supervision requirements, you may need to query specific objects from an OSS bucket in which hundreds of millions of objects are stored. An object contains a large amount of metadata, including the name, ETag value, storage class, size, tags, and last modified time. The data indexing feature allows you to combine simple query conditions and data aggregation methods based on your business requirements to improve query performance.

Usage notes

  • Supported regions

    The data indexing feature is supported only for buckets that are located in the China (Hangzhou), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), and Singapore regions.

  • Object quantity

    By default, the metadata management feature is supported only for a bucket that contains less than 10 billion objects.

  • Billing rules

    Currently, the data indexing feature is in public preview. To use the data indexing feature, you must enable the metadata management feature. After you enable the metadata management feature, you are charged for object metadata management and bucket queries. However, you are not charged during the public preview. For more information about the billable items of the data indexing feature, see Data indexing fees.

  • Time required for indexing

    After you enable the metadata management feature, OSS creates an index. The time required to create the index is proportional to the number of objects stored in the bucket. If a larger number of objects are stored in the bucket, a longer period of time is required to create the index. In most cases, the first time an index is created for 10 million objects, approximately 1 hour is required. The first time an index is created for 1 billion objects, approximately 1 day is required. The first time an index is created for 10 billion objects, approximately 2 to 3 days are required. The preceding time is provided only for reference.

  • Multipart upload

    If a bucket contains objects that are uploaded by using multipart upload, the query results include only the complete objects combined by calling the CompleteMultipartUpload operation. Parts that are uploaded by multipart upload tasks that are initiated but are not completed or canceled are not included in the query results.

Methods

Use the OSS console

  1. Log on to the OSS console.

  2. In the left-side navigation pane, click Buckets. On the Buckets page, find and click the desired bucket.

  3. In the left-side navigation tree, choose Object Management > Data Indexing.

  4. On the Data Indexing page, turn on Metadata Management.

    The time required for metadata management to take effect varies based on the number of objects in the bucket.

  5. Specify basic conditions to filter objects.

    In the Basic Filtering Conditions section, specify the basic filtering conditions based on your business requirements. The following table describes the basic filtering conditions.

    Filtering condition

    Description

    Storage Class

    By default, the following storage classes supported by OSS are selected: Standard, IA, Archive, Cold Archive, and Deep Cold Archive. You can specify the storage class based on your business requirements.

    ACL

    By default, the following access control lists (ACLs) supported by OSS are selected: Inherited from Bucket, Private, Public Read, and Public Read/Write. You can specify the ACL based on your business requirements.

    File Name

    You can select Fuzzy Match or Equal To. If you want to display the name of an object in the query results, such as exampleobject.txt, you can use one of the following methods to match the object name:

    • Select Equal To and enter the full name of the object. Example: exampleobject.txt.

    • Select Fuzzy Match and enter the prefix or suffix of the object name. Example: example or .txt.

      Important

      Fuzzy match can match all object names that contain the specified characters. For example, if you enter test next to Fuzzy Match, localfolder/test/.example.jpg and localfolder/test.jpg meet the query condition, and are displayed in the query results.

    Upload Type

    By default, the following upload types are selected. You can specify the upload type based on your business requirements.

    • Normal: returns objects uploaded by using simple upload in the query results.

    • Multipart: returns objects uploaded by using multipart upload in the query results.

    • Appendable: returns objects uploaded by using append upload in the query results.

    • Symlink: returns symbolic links.

    Last Modified At

    You can specify Start Date and End Date for Last Modified At. The values of Start Date and End Date are accurate to seconds.

    Object Size

    You can select Equal To, Greater Than, Greater Than or Equal To, Less Than, or Less Than or Equal To for Object Size. The object size is in KB.

    Object Versions

    You can query only the current versions of objects by using data indexing.

  6. (Optional). Specify other filtering conditions.

    If you want to sort objects in the query results or use tags to filter objects, click Show more filtering conditions.

    • Specify the order in which you want to sort objects in the query results

      In the Object Sort Order section, select Ascending or Descending to sort the objects by Last Modified At, File Name, or Object Size.

    • Specify tag-based filtering conditions

      In the Tag-based Filtering Conditions section, specify the ETags or tags that you want to use to filter objects.

      • ETags support only exact match. An ETag must be enclosed in quotation marks. Example: "5B3C1A2E0563E1B002CC607C6689". If you want to specify multiple ETags, separate them with line feeds.

      • Specify Object Tags by using key-value pairs. The keys and values of object tags are case-sensitive. For more information about tag rules, see Add tags to an object.

    • Specify the methods that you want to use to aggregate object data

      If you want to classify the query results and collect statistics on each category, you can specify data aggregation methods. For example, you can specify data aggregation methods to collect statistics on the sizes of all objects and obtain the number of distinct storage classes of objects in the query results.

Use OSS SDKs

Only OSS SDK for Java, OSS SDK for Python, and OSS SDK for Go allow you to use the data indexing feature to query objects that meet specific conditions. Before you use the data indexing feature to query objects in a bucket, you must enable the metadata management feature for the bucket. For the sample code of data indexing, see Overview.

import com.aliyun.oss.ClientException;
import com.aliyun.oss.OSS;
import com.aliyun.oss.common.auth.*;
import com.aliyun.oss.OSSClientBuilder;
import com.aliyun.oss.OSSException;
import com.aliyun.oss.model.*;
import java.util.ArrayList;
import java.util.List;

public class Demo {

    // In this example, the endpoint of the China (Hangzhou) region is used. Specify your actual endpoint. 
    private static String endpoint = "https://oss-cn-hangzhou.aliyuncs.com";
    // Specify the name of the bucket. Example: examplebucket. 
    private static String bucketName = "examplebucket";

    public static void main(String[] args) throws Exception {
        // Obtain access credentials from environment variables. Before you run the sample code, make sure that the OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET environment variables are configured. 
        EnvironmentVariableCredentialsProvider credentialsProvider = CredentialsProviderFactory.newEnvironmentVariableCredentialsProvider();
        // Create an OSSClient instance. 
        OSS ossClient = new OSSClientBuilder().build(endpoint, credentialsProvider);

        try {
            // Query objects that meet specific conditions and list information about the objects based on specific fields and sorting methods. 
            int maxResults = 20;
            // Query objects that are smaller than 1,048,576 bytes in size, return up to 20 objects at a time, and sort the objects in ascending order. 
            String query = "{\"Field\": \"Size\",\"Value\": \"1048576\",\"Operation\": \"lt\"}";
            String sort = "Size";
            DoMetaQueryRequest doMetaQueryRequest = new DoMetaQueryRequest(bucketName, maxResults, query, sort);
            Aggregation aggregationRequest = new Aggregation();
            Aggregations aggregations = new Aggregations();
            List<Aggregation> aggregationList = new ArrayList<Aggregation>();
            // Specify the name of the field that is used in the aggregate operation. 
            aggregationRequest.setField("Size");
            // Specify the operator that is used in the aggregate operation. max indicates the maximum value. 
            aggregationRequest.setOperation("max");
            aggregationList.add(aggregationRequest);
            aggregations.setAggregation(aggregationList);

            // Specify the aggregate operation. 
            doMetaQueryRequest.setAggregations(aggregations);
            doMetaQueryRequest.setOrder(SortOrder.ASC);
            DoMetaQueryResult doMetaQueryResult = ossClient.doMetaQuery(doMetaQueryRequest);
            if(doMetaQueryResult.getFiles() != null){
                for(ObjectFile file : doMetaQueryResult.getFiles().getFile()){
                    System.out.println("Filename: " + file.getFilename());
                    // Query the ETag values that are used to identify the content of the objects. 
                    System.out.println("ETag: " + file.getETag());
                    // Query the access control list (ACL) of the objects.
                    System.out.println("ObjectACL: " + file.getObjectACL());
                    // Query the type of the objects. 
                    System.out.println("OssObjectType: " + file.getOssObjectType());
                    // Query the storage class of the objects. 
                    System.out.println("OssStorageClass: " + file.getOssStorageClass());
                    // Query the number of tags of the objects. 
                    System.out.println("TaggingCount: " + file.getOssTaggingCount());
                    if(file.getOssTagging() != null){
                        for(Tagging tag : file.getOssTagging().getTagging()){
                            System.out.println("Key: " + tag.getKey());
                            System.out.println("Value: " + tag.getValue());
                        }
                    }
                    if(file.getOssUserMeta() != null){
                        for(UserMeta meta : file.getOssUserMeta().getUserMeta()){
                            System.out.println("Key: " + meta.getKey());
                            System.out.println("Value: " + meta.getValue());
                        }
                    }
                }
            } else if(doMetaQueryResult.getAggregations() != null){
                for(Aggregation aggre : doMetaQueryResult.getAggregations().getAggregation()){
                    // Query the name of the aggregation field. 
                    System.out.println("Field: " + aggre.getField());
                    // Query the aggregation operator. 
                    System.out.println("Operation: " + aggre.getOperation());
                    // Query the values of the aggregate operations. 
                    System.out.println("Value: " + aggre.getValue());
                    if(aggre.getGroups() != null && aggre.getGroups().getGroup().size() > 0){
                        // Query the values of the aggregation operations by group. 
                        System.out.println("Groups value: " + aggre.getGroups().getGroup().get(0).getValue());
                        // Query the total number of the aggregation operations by group. 
                        System.out.println("Groups count: " + aggre.getGroups().getGroup().get(0).getCount());
                    }
                }
            } else {
                System.out.println("NextToken: " + doMetaQueryResult.getNextToken());
            }
        } catch (OSSException oe) {
            System.out.println("Error Message:" + oe.getErrorMessage());
            System.out.println("Error Code:" + oe.getErrorCode());
            System.out.println("Request ID:" + oe.getRequestId());
            System.out.println("Host ID:" + oe.getHostId());
        } catch (ClientException ce) {
            System.out.println("Error Message: " + ce.getMessage());
        } finally {
            // Shut down the OSSClient instance. 
            ossClient.shutdown();
        }
    }
# -*- coding: utf-8 -*-
import oss2
from oss2.credentials import EnvironmentVariableCredentialsProvider
from oss2.models import MetaQuery, AggregationsRequest
# Obtain access credentials from environment variables. Before you run the sample code, make sure that the OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET environment variables are configured. 
auth = oss2.ProviderAuth(EnvironmentVariableCredentialsProvider())

# In this example, the endpoint of the China (Hangzhou) region is used. Specify your actual endpoint. 
# Specify the name of the bucket. Example: examplebucket. 
bucket = oss2.Bucket(auth, 'http://oss-cn-hangzhou.aliyuncs.com', 'examplebucket')

# Query objects that meet specific conditions and list the object information based on specific fields and sorting methods. 
# Query objects that are smaller than 1 MB, return up to 10 objects at a time, and sort the objects in ascending order. 
do_meta_query_request = MetaQuery(max_results=10, query='{"Field": "Size","Value": "1048576","Operation": "lt"}', sort='Size', order='asc')
result = bucket.do_bucket_meta_query(do_meta_query_request)

# Display the object names. 
print(result.files[0].file_name)
# Display the ETags of the objects. 
print(result.files[0].etag)
# Display the types of the objects. 
print(result.files[0].oss_object_type)
# Display the storage classes of the objects. 
print(result.files[0].oss_storage_class)
# Display the CRC-64 values of the objects. 
print(result.files[0].oss_crc64)
# Display the access control lists (ACLs) of the objects. 
print(result.files[0].object_acl)
package main

import (
    "fmt"
    "github.com/aliyun/aliyun-oss-go-sdk/oss"
    "os"
)
func main()  {
    // Obtain access credentials from environment variables. Before you run the sample code, make sure that the OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET environment variables are configured. 
    provider, err := oss.NewEnvironmentVariableCredentialsProvider()
    if err != nil {
        fmt.Println("Error:", err)
        os.Exit(-1)
    }

    // Create an OSSClient instance. 
    // Specify the endpoint of the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com. Specify your actual endpoint. 
    client, err := oss.New("yourEndpoint", "", "", oss.SetCredentialsProvider(&provider))
    if err != nil {
        fmt.Println("Error:", err)
        os.Exit(-1)
    }    
    // Query objects that are larger than 30 bytes in size, return up to 10 objects at the same time, and then sort the objects in ascending order. 
    query := oss.MetaQuery{
        NextToken: "",
        MaxResults: 10,
        Query: `{"Field": "Size","Value": "30","Operation": "gt"}`,
        Sort: "Size",
        Order: "asc",
    }
    // Query objects that match the specified conditions and list object information based on the specified fields and sorting methods. 
    result,err := client.DoMetaQuery("examplebucket",query)
    if err != nil {
        fmt.Println("Error:", err)
        os.Exit(-1)
    }
    fmt.Printf("NextToken:%s\n", result.NextToken)
    for _, file := range result.Files {
        fmt.Printf("File name: %s\n", file.Filename)
        fmt.Printf("size: %d\n", file.Size)
        fmt.Printf("File Modified Time:%s\n", file.FileModifiedTime)
        fmt.Printf("Oss Object Type:%s\n", file.OssObjectType)
        fmt.Printf("Oss Storage Class:%s\n", file.OssStorageClass)
        fmt.Printf("Object ACL:%s\n", file.ObjectACL)
        fmt.Printf("ETag:%s\n", file.ETag)
        fmt.Printf("Oss CRC64:%s\n", file.OssCRC64)
        fmt.Printf("Oss Tagging Count:%d\n", file.OssTaggingCount)
        for _, tagging := range  file.OssTagging {
            fmt.Printf("Oss Tagging Key:%s\n", tagging.Key)
            fmt.Printf("Oss Tagging Value:%s\n", tagging.Value)
        }
        for _, userMeta := range  file.OssUserMeta {
            fmt.Printf("Oss User Meta Key:%s\n", userMeta.Key)
            fmt.Printf("Oss User Meta Key Value:%s\n", userMeta.Value)
        }
    }
}

Use the OSS API

If your business requires a high level of customization, you can directly call RESTful APIs. To directly call an API, you must include the signature calculation in your code. For more information, see DoMetaQuery.

FAQ

When hundreds of millions of objects are stored in a bucket, why are data indexes not created for a long period of time?

Indexes can be created for 600 objects in approximately 1 second. You can estimate the period of time required to create indexes based on the number of objects in the bucket.

References

The data indexing feature supports multiple filtering conditions, such as the last modified time, storage class, ACL, and size of objects. If you want to filter OSS objects whose last modified time is within a specific period of time from a large number of objects in a bucket, see How to filter OSS objects whose last modified time is within a specific period of time.