Query objects - Object Storage Service - Alibaba Cloud Documentation Center

You can use the SelectObject API to execute SQL statements on objects and retrieve the results.

Background

Hadoop 3.0 supports running services such as Spark, Hive, and Presto on E-MapReduce (EMR) to process data stored in Object Storage Service (OSS). Alibaba Cloud services such as MaxCompute and Data Lake Analytics (DLA) also support processing data directly from OSS.

Traditionally, big data platforms use the GetObject API to download entire objects from OSS before they are filtered and analyzed.

The SelectObject API solves this problem by pushing down the query logic, such as projections and filters, to the OSS layer. This lets you retrieve only the data subset you need, reducing data transfer, lowering costs, and saving client-side CPU and memory. This efficiency makes data analysis on OSS more compelling.

Billing

When you use the SelectObject API to query data, you are charged based on the actual amount of data scanned from the source object. For more information, see Data processing fees.

Supported object formats

The SelectObject API supports specific object formats and SQL syntax.

CSV objects that comply with RFC 4180. This includes CSV-like objects, such as TSV. You can specify custom row delimiters, column delimiters, and quote characters.
JSON objects encoded in UTF-8. Two formats are supported: DOCUMENT and LINES.
- DOCUMENT: The entire object is a single, valid JSON object.
- LINES: The object consists of multiple lines, where each line is a self-contained JSON object. The lines are separated by a newline character, such as \n or \r\n. You do not need to specify the delimiter. The entire object, however, is not a valid JSON object.
Objects in the Standard and Infrequent Access storage classes. Objects in the Archive, Cold Archive, or Deep Cold Archive storage class must be restored before you can query them.
Objects encrypted using server-side encryption with OSS-managed keys or a CMK managed by KMS.

Supported SQL syntax

SQL clauses: SELECT, FROM, WHERE
Data types: string, int (64-bit), double (64-bit), decimal (128-bit), timestamp, and bool
Operators: Logical (AND, OR, NOT), Arithmetic (+, -, *, /, %), Comparison (>, =, <, >=, <=, !=), and String (LIKE, ||)
Important
The LIKE operator performs case-sensitive pattern matching.

Supported data types

By default, all data in a CSV object is treated as strings. You can use the CAST function to perform explicit type conversions.

The following example shows how to use a SQL query to cast the values in columns _1 and _2 to integers:Select * from OSSOBject where cast (_1 as int) > cast(_2 as int)

The SelectObject API also supports implicit type conversion in a WHERE clause. For example, in the following statement, the values in the first and second columns are implicitly converted to integers:

Select _1 from ossobject where _1 + _2 > 100

For JSON objects, if you do not use the CAST function, the data type is determined by the object's content. Standard built-in JSON data types include null, bool, int64, double, and string.

SQL examples

This section provides SQL examples for CSV and JSON objects.

CSV

Use case	SQL statement
Return the first 10 rows of data.	select * from ossobject limit 10
Return integers from the first and third columns where the value in the first column is greater than the value in the third column.	select _1, _3 from ossobject where cast(_1 as int) > cast(_3 as int)
Return the count of records in the first column that start with 'Chen'. (Note: The Chinese character after 'like' must be encoded in UTF-8.)	select count(*) from ossobject where _1 like 'Chen%'
Return all records where the timestamp in the second column is later than 2018-08-09 11:30:25 and the value in the third column is greater than 200.	select * from ossobject where _2 > cast('2018-08-09 11:30:25' as timestamp) and _3 > 200
Return the average, sum, maximum, and minimum of the floating-point numbers in the second column.	select AVG(cast(_6 as double)), SUM(cast(_6 as double)), MAX(cast(_6 as double)), MIN(cast(_6 as double)) from ossobject
Return all records where the concatenated string from the first and third columns starts with 'Tom' and ends with 'Anderson'.	select * from ossobject where (_1 \|\| _3) like 'Tom%Anderson'
Return all records where the value in the first column is divisible by 3.	select * from ossobject where (_1 % 3) = 0
Return all records where the value in the first column is between 1995 and 2012.	select * from ossobject where _1 between 1995 and 2012
Return all records where the value in the fifth column is 'N', 'M', 'G', or 'L'.	select * from ossobject where _5 in ('N', 'M', 'G', 'L')
Return all records where the product of the values in the second and third columns is greater than the sum of 100 and the value in the fifth column.	select * from ossobject where _2 * _3 > _5 + 100

JSON

Assume you have the following JSON object:

{
  "contacts":[
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": null
},... # Additional contact nodes omitted for brevity.
]}

The following table provides SQL examples.

Use case	SQL statement
Return all records where the value of age is 27.	select * from ossobject.contacts[*] s where s.age = 27
Return all home phone numbers.	select s.number from ossobject.contacts[].phoneNumbers[] s where s.type = "home"
Return all records where the spouse is null.	select * from ossobject s where s.spouse is null
Return all records with no children.	select * from ossobject s where s.children[0] is null Note There is no dedicated syntax to represent an empty array. Use the preceding statement instead.

Use cases

Common use cases for SelectObject include querying large objects in parallel, querying JSON objects, and analyzing log files.

Query large objects in parallel
Similar to the byte-range multipart download feature provided by the GetObject API, the SelectObject API supports parallel queries. You can split data in one of two ways:
- By row: This is a common method for splitting data. However, for sparse data, splitting by row can lead to an unbalanced load across query workers.
- By split: A split is an OSS concept for data sharding. A split contains multiple rows, and each split is approximately the same size.
Note
Splitting data by split is more efficient than by row.
If you can confirm that the columns in your CSV object do not contain newline characters, splitting by byte range is simpler because it does not require metadata creation. If the columns contain newline characters or if you are querying a JSON object, follow these steps:
1. Call the CreateSelectObjectMeta API to get the total number of splits for the object. To reduce scan time, we recommend calling this API asynchronously before you start the query.
2. Based on your client-side resources, determine the degree of parallelism (n). Then, divide the total number of splits by n to calculate how many splits each parallel query should process.
3. In the request body, specify the range of splits to query, such as split-range=1-20.
4. Merge the results from all parallel queries.
Query JSON objects
When you query JSON objects, narrow the JSON path in the FROM clause to improve performance.
Consider the following JSON object example:
```
{
  "contacts":[
{
  "firstName": "John",
  "lastName": "Smith",
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ]
}
]}
```
To find all street addresses where the postal code starts with '10021', you can write the SQL statement as either select s.address.streetAddress from ossobject.contacts[*] s where s.address.postalCode like '10021%' or select s.streetAddress from ossobject.contacts[*].address s where s.postalCode like '10021%'.
The second statement, select s.streetAddress from ossobject.contacts[*].address s where s.postalCode like '10021%', performs better because its JSON path is more precise.
Handle high-precision floating-point numbers in JSON objects
When you need to perform calculations with high-precision floating-point numbers in a JSON object, we recommend setting the ParseJsonNumberAsString option to true and then casting the value to the decimal type. For example, if an attribute a has the value 123456789.123456789, you can use the statement select s.a from ossobject s where cast(s.a as decimal) > 123456789.12345 to maintain the original data's precision.

Procedure

OSS console

Important

In the OSS console, you can extract a maximum of 40 MB of data from an object that is no larger than 128 MB.

Log on to the OSS console.
In the left-side navigation pane, click Buckets. On the Buckets page, find and click the desired bucket.
In the left-side navigation tree, choose Object Management > Objects.
Find the target object and choose > Select Content in the Actions column.

In the Select Content panel, configure the parameters based on the following descriptions.

Parameter	Description
Object Type	Only CSV and JSON object types are supported.
Delimiter	This parameter applies only to CSV objects. You can select a comma (,) or a custom delimiter.
Title Row	This parameter applies only to CSV objects. Specify whether the first row of the object contains column headers.
JSON Format	This parameter applies only to JSON objects. Select the format that matches your JSON object.
Compression Format	Specify whether your object is compressed. Only GZIP compression is supported.

Click Preview.
Important
Previewing an object in the Standard storage class incurs SelectObject scan fees. For objects in the Infrequent Access, Archive, Cold Archive, or Deep Cold Archive storage classes, previewing incurs both SelectObject scan fees and data retrieval fees. For more information, see Data processing fees.
Click Next, then enter and execute a SQL statement.
Assume you have a CSV object named People with three columns: Name, Company, and Age.
- To find people who are older than 50 and whose names start with 'Lora', you can run the following SQL statement. In the statement, _1, _2, and _3 are column indexes representing the first, second, and third columns, respectively.
```
select * from ossobject where _1 like 'Lora*' and _3 > 50
```
- To count the number of rows in the object and find the maximum and minimum ages, you can run the following SQL statement:
```
select count(*), max(cast(_3 as int)), min(cast(_3 as int)) from ossobject
```
View the execution results.
You can also click Download to save the selected content locally.

OSS SDKs

You can query objects using OSS SDKs for Java, Python, and Go.

Java

import com.aliyun.oss.ClientBuilderConfiguration;
import com.aliyun.oss.common.comm.SignVersion;
import com.aliyun.oss.model.*;
import com.aliyun.oss.OSS;
import com.aliyun.oss.common.auth.*;
import com.aliyun.oss.OSSClientBuilder;
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.InputStreamReader;

/**
 * Examples of create select object metadata and select object.
 *
 */
public class SelectObjectSample {
    // Specify the endpoint of the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com. 
    private static String endpoint = "https://oss-cn-hangzhou.aliyuncs.com";
    // Specify the name of the bucket. Example: examplebucket. 
    private static String bucketName = "examplebucket";

    public static void main(String[] args) throws Exception {
      
      	// Obtain access credentials from environment variables. Before you run the sample code, make sure that the OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET environment variables are configured. 
        EnvironmentVariableCredentialsProvider credentialsProvider = CredentialsProviderFactory.newEnvironmentVariableCredentialsProvider();
        // Specify the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the region to cn-hangzhou.
        String region = "cn-hangzhou";

        // Create an OSSClient instance. 
        // Call the shutdown method to release associated resources when the OSSClient is no longer in use.
        ClientBuilderConfiguration clientBuilderConfiguration = new ClientBuilderConfiguration();
        clientBuilderConfiguration.setSignatureVersion(SignVersion.V4);        
        OSS ossClient = OSSClientBuilder.create()
        .endpoint(endpoint)
        .credentialsProvider(credentialsProvider)
        .clientConfiguration(clientBuilderConfiguration)
        .region(region)               
        .build();
        // Specify the full path of the object that you want to query, and then query the data of the object by using SELECT statements. Do not include the bucket name in the full path. 
        // Specify the full path of the CSV object. 
        selectCsvSample("test.csv", ossClient);
        // Specify the full path of the JSON object. 
        selectJsonSample("test.json", ossClient);
        ossClient.shutdown();
    }

    private static void selectCsvSample(String key, OSS ossClient) throws Exception {
        // Specify the content of the object that you want to upload. 
        String content = "name,school,company,age\r\n" +
                "Lora Francis,School A,Staples Inc,27\r\n" +
                "Eleanor Little,School B,\"Conectiv, Inc\",43\r\n" +
                "Rosie Hughes,School C,Western Gas Resources Inc,44\r\n" +
                "Lawrence Ross,School D,MetLife Inc.,24";

        ossClient.putObject(bucketName, key, new ByteArrayInputStream(content.getBytes()));

        SelectObjectMetadata selectObjectMetadata = ossClient.createSelectObjectMetadata(
                new CreateSelectObjectMetadataRequest(bucketName, key)
                        .withInputSerialization(
                                new InputSerialization().withCsvInputFormat(
                                        // Specify the delimiter that is used to separate different records in the content. Example: \r\n. 
                                        new CSVFormat().withHeaderInfo(CSVFormat.Header.Use).withRecordDelimiter("\r\n"))));
        System.out.println(selectObjectMetadata.getCsvObjectMetadata().getTotalLines());
        System.out.println(selectObjectMetadata.getCsvObjectMetadata().getSplits());

        SelectObjectRequest selectObjectRequest =
                new SelectObjectRequest(bucketName, key)
                        .withInputSerialization(
                                new InputSerialization().withCsvInputFormat(
                                        new CSVFormat().withHeaderInfo(CSVFormat.Header.Use).withRecordDelimiter("\r\n")))
                        .withOutputSerialization(new OutputSerialization().withCsvOutputFormat(new CSVFormat()));
        // Use a SELECT statement to query all records whose values are greater than 40 in the 4th column. 
        selectObjectRequest.setExpression("select * from ossobject where _4 > 40");
        OSSObject ossObject = ossClient.selectObject(selectObjectRequest);

        // Read the content of the object. 
        BufferedReader reader = new BufferedReader(new InputStreamReader(ossObject.getObjectContent()));
        while (true) {
            String line = reader.readLine();
            if (line == null) {
                break;
            }
            System.out.println(line);
        }
        reader.close();

        ossClient.deleteObject(bucketName, key);
    }

    private static void selectJsonSample(String key, OSS ossClient) throws Exception {
        // Specify the content of the object that you want to upload. 
        final String content = "{\n" +
                "\t\"name\": \"Lora Francis\",\n" +
                "\t\"age\": 27,\n" +
                "\t\"company\": \"Staples Inc\"\n" +
                "}\n" +
                "{\n" +
                "\t\"name\": \"Eleanor Little\",\n" +
                "\t\"age\": 43,\n" +
                "\t\"company\": \"Conectiv, Inc\"\n" +
                "}\n" +
                "{\n" +
                "\t\"name\": \"Rosie Hughes\",\n" +
                "\t\"age\": 44,\n" +
                "\t\"company\": \"Western Gas Resources Inc\"\n" +
                "}\n" +
                "{\n" +
                "\t\"name\": \"Lawrence Ross\",\n" +
                "\t\"age\": 24,\n" +
                "\t\"company\": \"MetLife Inc.\"\n" +
                "}";

        ossClient.putObject(bucketName, key, new ByteArrayInputStream(content.getBytes()));

        SelectObjectRequest selectObjectRequest =
                new SelectObjectRequest(bucketName, key)
                        .withInputSerialization(new InputSerialization()
                                .withCompressionType(CompressionType.NONE)
                                .withJsonInputFormat(new JsonFormat().withJsonType(JsonType.LINES)))
                        .withOutputSerialization(new OutputSerialization()
                                .withCrcEnabled(true)
                                .withJsonOutputFormat(new JsonFormat()))
                        .withExpression("select * from ossobject as s where s.age > 40"); // Use the SELECT statement to query data in the object. 

        OSSObject ossObject = ossClient.selectObject(selectObjectRequest);

        // Read the content of the object. 
        BufferedReader reader = new BufferedReader(new InputStreamReader(ossObject.getObjectContent()));
        while (true) {
            String line = reader.readLine();
            if (line == null) {
                break;
            }
            System.out.println(line);
        }
        reader.close();

        ossClient.deleteObject(bucketName, key);
    }
}

Python

import oss2
from oss2.credentials import EnvironmentVariableCredentialsProvider

def select_call_back(consumed_bytes, total_bytes =  None):
        print('Consumed Bytes:' + str(consumed_bytes) + '\n')

# Obtain access credentials from environment variables. Before you run the sample code, make sure that the OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET environment variables are configured. 
auth = oss2.ProviderAuthV4(EnvironmentVariableCredentialsProvider())

# Specify the endpoint of the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com. 
endpoint = "https://oss-cn-hangzhou.aliyuncs.com"

# Specify the ID of the region that maps to the endpoint. Example: cn-hangzhou. This parameter is required if you use the signature algorithm V4.
region = "cn-hangzhou"

# Specify the name of your bucket.
bucket = oss2.Bucket(auth, endpoint, "yourBucketName", region=region)

key ='python_select.csv'
content ='Tom Hanks,USA,45\r\n'*1024
filename ='python_select.csv'

# Upload a CSV file. 
bucket.put_object(key, content)
# Configure the parameters for the SelectObject operation. 
csv_meta_params = {'RecordDelimiter': '\r\n'}
select_csv_params = {'CsvHeaderInfo': 'None',
                    'RecordDelimiter': '\r\n',
                    'LineRange': (500, 1000)}

csv_header = bucket.create_select_object_meta(key, csv_meta_params)
print(csv_header.rows)
print(csv_header.splits)
result = bucket.select_object(key, "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
bucket.delete_object(key)

###JSON DOCUMENT
key =  'python_select.json'
content =  "{\"contacts\":[{\"key1\":1,\"key2\":\"hello world1\"},{\"key1\":2,\"key2\":\"hello world2\"}]}"
filename =  'python_select.json'
# Upload a JSON DOCUMENT object. 
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'DOCUMENT'}
result = bucket.select_object(key, "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

###JSON LINES
key =  'python_select_lines.json'
content =  "{\"key1\":1,\"key2\":\"hello world1\"}\n{\"key1\":2,\"key2\":\"hello world2\"}"
filename =  'python_select.json'
# Upload a JSON LINES object. 
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'LINES'}
json_header = bucket.create_select_object_meta(key,select_json_params)
print(json_header.rows)
print(json_header.splits)

result = bucket.select_object(key, "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
select_content =  result.read()
print(select_content)
result = bucket.select_object_to_file(key, filename,
           "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

package main

import (
	"context"
	"flag"
	"io"
	"log"

	"github.com/aliyun/alibabacloud-oss-go-sdk-v2/oss"
	"github.com/aliyun/alibabacloud-oss-go-sdk-v2/oss/credentials"
)

// Define global variables.
var (
	region     string // The storage region.
	bucketName string // The bucket name.
	objectName string // The object name.
)

// The init function is used to initialize command line parameters.
func init() {
	flag.StringVar(&region, "region", "", "The region in which the bucket is located.")
	flag.StringVar(&bucketName, "bucket", "", "The name of the bucket.")
	flag.StringVar(&objectName, "object", "", "The name of the object.")
}

func main() {
	// Parse command line parameters.
	flag.Parse()

	// Check whether the bucket name is empty.
	if len(bucketName) == 0 {
		flag.PrintDefaults()
		log.Fatalf("invalid parameters, bucket name required")
	}

	// Check whether the region is empty.
	if len(region) == 0 {
		flag.PrintDefaults()
		log.Fatalf("invalid parameters, region required")
	}

	// Check whether the object name is empty.
	if len(objectName) == 0 {
		flag.PrintDefaults()
		log.Fatalf("invalid parameters, object name required")
	}

	// Load the default configurations and set the credential provider and region.
	cfg := oss.LoadDefaultConfig().
		WithCredentialsProvider(credentials.NewEnvironmentVariableCredentialsProvider()).
		WithRegion(region)

	// Create an OSS client.
	client := oss.NewClient(cfg)

	// Create a request to select an object.
	request := &oss.SelectObjectRequest{
		Bucket: oss.Ptr(bucketName), // The bucket name.
		Key:    oss.Ptr(objectName), // The object name.
		SelectRequest: &oss.SelectRequest{
			Expression: oss.Ptr("select * from ossobject limit 10"), // Define an SQL expression to query the first 10 rows of data in the object.
			InputSerializationSelect: oss.InputSerializationSelect{
				CsvBodyInput: &oss.CSVSelectInput{
					FileHeaderInfo: oss.Ptr("Use"),
				},
			},
			OutputSerializationSelect: oss.OutputSerializationSelect{
				OutputHeader: oss.Ptr(true),
			},
		},
	}

	// Execute the request to select the object.
	result, err := client.SelectObject(context.TODO(), request)
	if err != nil {
		log.Fatalf("failed to select object %v", err)
	}

	content, err := io.ReadAll(result.Body)
	if err != nil {
		log.Fatalf("failed to read object %v", err)
	}

	// Print the result of selecting the object.
	log.Printf("select object result:%#v\n", string(content))
}

ossutil CLI

You can use the command-line tool ossutil to query objects. For installation instructions, see Install ossutil.

The following command executes a SQL statement on the object named exampleobject in the bucket named examplebucket. The request uses CSV syntax.

ossutil api select-object --bucket examplebucket --key exampleobject --select-request "{\"Expression\":\"c2VsZWN0IFllYXIsU3RhdGVBYmJyLCBDaXR5TmFtZSwgU2hvcnRfUXVlc3Rpb25fVGV4dCBmcm9tIG9zc29iamVjdA==\",\"InputSerialization\":{\"CSV\":{\"FileHeaderInfo\":\"Use\",\"Range\":\"line-range=0-100\"}},\"OutputSerialization\":{\"JSON\":{\"RecordDelimiter\":\",\"}}}"

For more information about this command, see select-object.

API reference

The operations described above are based on the underlying SelectObject API. If you have advanced requirements, you can make direct REST API requests. When you make direct requests, you must manually calculate the authentication signature. For more information, see SelectObject.