You can call the SelectObject operation to execute SQL statements on an object and obtain the execution results.

Background information

You can directly process data in Object Storage Service (OSS) when you run Hadoop 3.0 or use services such as Spark, Hive, and Presto in E-MapReduce. Alibaba Cloud services, such as MaxCompute and Data Lake Analytics (DLA), are also supported in OSS.

The GetObject operation that is provided by OSS requires the big data platform to download OSS data to your local PC for analysis and filtering. As a result, large amounts of bandwidth and client resources are wasted in many query scenarios.

To resolve this issue, OSS provides the SelectObject operation. SelectObject allows OSS to preliminarily filter data by using conditions and projections that are provided by the big data platform. As a result, only useful data is returned to the big data platform. This way, the client can consume fewer bandwidth resources and process lower amounts of data to minimize the CPU utilization and memory usage. This makes OSS-based data warehousing and data analytics a better option.

Supported object types

This section describes the object types that are supported by the SelectObject operation.

  • CSV objects (and CSV-like objects such as TSV objects) that conform to RFC 4180. You can customize row and column delimiters and quote characters in CSV objects.
  • UTF-8 encoded JSON objects. The SelectObject operation supports JSON objects in DOCUMENT and LINES formats.
    • A JSON DOCUMENT object contains a single object.
    • A JSON LINES object consists of lines of objects that are separated by row delimiters. A complete JSON object is not a valid JSON object. SelectObject supports common delimiters, such as \n and \r\n. You do not need to specify the delimiters.
  • Standard and Infrequent Access (IA) objects. You must restore Archive and Cold Archive objects before you access the objects.
  • Objects that are fully managed and encrypted by OSS or encrypted by using customer master keys (CMKs) that are managed by Key Management Service (KMS).

Supported SQL syntax

  • SQL statement: SELECT FROM WHERE
  • Data types: string, int(64bit), double(64bit), decimal(128bit), timestamp, and bool.
  • Operations: logical operations (AND, OR, and NOT), arithmetic operations (+, -, *, /, and %), comparison operations (>, =, <, >=, <=, and !=), and String operations (LIKE and ||).
    Notice LIKE-based fuzzy matches are case-sensitive.

Supported data types

By default, CSV data in OSS is of the String type. You can use the CAST function to convert the data type.

The following example shows the SQL query statement that is used to convert the data in the first and second columns into data of the Integer type: Select * from OSSOBject where cast (_1 as int) > cast(_2 as int).

The SelectObject operation allows you to implicitly convert the data type by using a WHERE clause. For example, the following SQL statement converts the data in the first and second columns into data of the Integer type:

Select _1 from ossobject where _1 + _2 > 100

If you do not use the CAST function, the data type of a JSON object is determined by the type of the data in the object. A standard JSON object can support various data types, including Null, Boolean, Int64, Double, and String.

Sample SQL statements

The following table describes the sample SQL statements for CSV objects.

  • CSV
    Description SQL statement
    Return the first 10 rows. select * from ossobject limit 10
    Return integers in the first and third columns. The values of the integers in the first column are greater than the values of the integers in the third column. select _1, _3 from ossobject where cast(_1 as int) > cast(_3 as int)
    Return the number of records in which the data in the first column starts with X. A Chinese character that is specified after LIKE must be UTF-8 encoded. SELECT COUNT(*) FROM ossobject WHERE _1 LIKE 'X%'
    Return all records in which the time of the data in the second column is later than 2018-08-09 11:30:25 and the value in the third column is greater than 200. select * from ossobject where _2 > cast('2018-08-09 11:30:25' as timestamp) and _3 > 200
    Return the average value, sum, maximum value, and minimum value of the floating-point numbers in the second column.

    select AVG(cast(_2 as double)), SUM(cast(_2 as double)), MAX(cast(_2 as double)), MIN(cast(_2 as double))

    Return all records in which the strings that are concatenated by the data in the first and third columns start with Tom and end with Anderson. select * from ossobject where (_1 || _3) like 'Tom%Anderson'
    Return all records in which the data in the first column is divisible by 3. select * from ossobject where (_1 % 3) = 0
    Return all records in which the value of data in the first column is between 1995 and 2012. select * from ossobject where _1 between 1995 and 2012
    Return all records in which the value of data in the fifth column is N, M, G, or L. select * from ossobject where _5 in ('N', 'M', 'G', 'L')
    Return all records in which the product of the values of data in the second and third columns is greater than the sum of 100 and the value of data in the fifth column. select * from ossobject where _2 * _3 > _5 + 100
  • JSON

    Sample JSON object:

    {
      "contacts":[
    {
      "firstName": "John",
      "lastName": "Smith",
      "isAlive": true,
      "age": 27,
      "address": {
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode": "10021-3100"
      },
      "phoneNumbers": [
        {
          "type": "home",
          "number": "212 555-1234"
        },
        {
          "type": "office",
          "number": "646 555-4567"
        },
        {
          "type": "mobile",
          "number": "123 456-7890"
        }
      ],
      "children": [],
      "spouse": null
    }, …… # Other similar nodes are omitted.
    ]}

    The following table describes the sample SQL statements for JSON objects.

    Description SQL statement
    Return all records in which the value of age is 27. select * from ossobject.contacts[*] s where s.age = 27
    Return all home phone numbers. select s.number from ossobject.contacts[*].phoneNumbers[*] s where s.type = “home”
    Return all records in which the value of spouse is null. select * from ossobject s where s.spouse is null
    Return all records in which the value of children is 0. select * from ossobject s where s.children[0] is null
    Note The preceding statement is used to specify an empty array because no other methods are available.

Scenario

In most cases, SelectObject is used for multipart query, query of JSON objects, and analysis of log objects.

  • Query large objects by using multipart query

    The multipart query feature that is provided by the SelectObject operation is similar to the byte-based multipart download feature that is provided by the GetObject operation. Data is split into parts by row or by split.

    • By row: This method is used in most cases. However, unbalanced loads may occur when sparse data is split.
    • By split: A split includes multiple rows. Each split has approximately the same size.
    Note The method of splitting data by split is more efficient.

    If columns in a CSV object do not include line feeds, you can divide the object into parts based on bytes. This method is simple because you do not need to create Meta for the object. If you want to query a JSON object or a CSV object in which columns include line feeds, perform the following steps:

    1. Call the CreateSelectObjectMeta operation to obtain the total number of splits for the object. Before you call SelectObject, asynchronously call the CreateSelectObjectMeta operation to reduce the scanning time.
    2. Select the appropriate concurrency level (n) based on the resources on the client. Divide the total number of splits by the concurrency level (n) to obtain the number of splits in each query.
    3. Configure the parameters in the request body to perform multipart query. For example, you can set split-range to 1-20.
    4. Combine the results.
  • Query JSON objects

    When you query a JSON object, narrow down the JSON path range in the FROM clause.

    Sample JSON object:

    {
      "contacts":[
    {
      "firstName": "John",
      "lastName": "Smith",
      "address": {
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode": "10021-3100"
      },
      "phoneNumbers": [
        {
          "type": "home",
          "number": "212 555-1234"
        },
        {
          "type": "office",
          "number": "646 555-4567"
        },
        {
          "type": "mobile",
          "number": "123 456-7890"
        }
      ]
    }
    ]}

    If you want to query all streetAddress data of records in which the postal code starts with 10021, execute the following SQL statement: select s.address.streetAddress from ossobject.contacts[*] s where s.address.postalCode like ‘10021%’ or select s.streetAddress from ossobject.contacts[*].address s where s.postalCode like ‘10021%’.

    The select s.streetAddress from ossobject.contacts[*].address s where s.postalCode like ‘10021%’ statement provides better performance because the JSON path in the statement is more accurate.

  • Process high-precision floating-point numbers in JSON objects

    If you want to calculate high-precision floating-point numbers in a JSON object, we recommend that you set the ParseJsonNumberAsString parameter to true and use the CAST function to convert the parsed data to DECIMAL type. For example, if the value of attribute a is 123456789.123456789, you can execute the SELECT s.a FROM ossobject s WHERE CAST(s.a AS DECIMAL) > 123456789.12345 statement to maintain the accuracy of attribute a.

Use the OSS console

Notice You can select and obtain up to 40 MB of data from an object whose size does not exceed 128 MB in the OSS console.
  1. Log on to the OSS console.
  2. In the left-side navigation pane, click Buckets. On the Buckets page, click the name of the bucket from which you want to select object content.
  3. In the left-side navigation pane, choose Files > Files.
  4. On the Files page, choose More > Select Content in the Actions column corresponding to the object from which you want to select content.
  5. In the Select Content panel, specify the following parameters:
    • File Type: Select the content type of the object. Valid values: CSV and JSON.
    • Delimiter: Specify this parameter for CSV files. Valid values: comma (,) and Custom.
    • Title Line: Specify this parameter for CSV files. You can configure this option to specify whether the first row of the object contains a column heading.
    • JSON Display Mode: Select the display mode for JSON objects.
    • Compression Format: Specify whether to compress the current object. Currently, only GZIP-based compression is supported.
  6. Click Preview to preview the object.
    Notice When you preview Standard objects, you are charged for data scanning by OSS Select. When you preview IA, Archive, and Cold Archive objects, you are charged for data scanning by OSS Select and data retrieval.
  7. Click Next Step. Enter and execute an SQL statement.
  8. View the query result. Click Download to download the selected content to the local device.
    For example, a CSV file named People contains the following columns: Name, Company, and Age.
    • To query people who are above 50 years old and whose names start with Lora, execute the following SQL statement. In the statement, _1, _2, and _3 specify column indexes. _1 specifies the index of the first column. _2 specifies the index of the second column. _3 specifies the index of the third column.
      select * from ossobject where _1 like 'Lora*' and _3 > 50
    • To query the number of rows in the object, maximum age, and minimum age, execute the following SQL statement:
      select count(*), max(cast(_3 as int)), min(cast(_3 as int)) from oss_object

Use OSS SDKs

You can use only OSS SDKs for Java and Python to query objects.

import com.aliyun.oss.model.*;
import com.aliyun.oss.OSS;
import com.aliyun.oss.OSSClientBuilder;
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.InputStreamReader;

/**
 * Examples of create select object metadata and select object.
 *
 */
public class SelectObjectSample {
    // Set yourEndpoint to the endpoint of the region in which the bucket is located. For example, if your bucket is located in the China (Hangzhou) region, set yourEndpoint to https://oss-cn-hangzhou.aliyuncs.com. 
    private static String endpoint = "yourEndpoint";
    // Security risks may arise if you use the AccessKey pair of an Alibaba Cloud account to access OSS because the account has permissions on all API operations. We recommend that you use a Resource Access Management (RAM) user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
    private static String accessKeyId = "yourAccessKeyId";
    private static String accessKeySecret = "yourAccessKeySecret";
    // Specify the name of the bucket. Example: examplebucket. 
    private static String bucketName = "examplebucket";

    public static void main(String[] args) throws Exception {
        OSS ossClient = new OSSClientBuilder().build(endpoint, accessKeyId, accessKeySecret);
        // Specify the full path of the object that you want to query, and then query the data of the object by using SELECT statements. The full path of the object cannot contain bucket names. 
        // Specify the full path of the CSV object. 
        selectCsvSample("test.csv", ossClient);
        // Specify the full path of the JSON object. 
        selectJsonSample("test.json", ossClient);
        ossClient.shutdown();
    }

    private static void selectCsvSample(String key, OSS ossClient) throws Exception {
        // Specify the content of the sample object you want to upload. 
        String content = "name,school,company,age\r\n" +
                "Lora Francis,School A,Staples Inc,27\r\n" +
                "Eleanor Little,School B,\"Conectiv, Inc\",43\r\n" +
                "Rosie Hughes,School C,Western Gas Resources Inc,44\r\n" +
                "Lawrence Ross,School D,MetLife Inc.,24";

        ossClient.putObject(bucketName, key, new ByteArrayInputStream(content.getBytes()));

        SelectObjectMetadata selectObjectMetadata = ossClient.createSelectObjectMetadata(
                new CreateSelectObjectMetadataRequest(bucketName, key)
                        .withInputSerialization(
                                new InputSerialization().withCsvInputFormat(
                                        // Specify the delimiter between different records in the content. Example: \r\n. 
                                        new CSVFormat().withHeaderInfo(CSVFormat.Header.Use).withRecordDelimiter("\r\n"))));
        System.out.println(selectObjectMetadata.getCsvObjectMetadata().getTotalLines());
        System.out.println(selectObjectMetadata.getCsvObjectMetadata().getSplits());

        SelectObjectRequest selectObjectRequest =
                new SelectObjectRequest(bucketName, key)
                        .withInputSerialization(
                                new InputSerialization().withCsvInputFormat(
                                        new CSVFormat().withHeaderInfo(CSVFormat.Header.Use).withRecordDelimiter("\r\n")))
                        .withOutputSerialization(new OutputSerialization().withCsvOutputFormat(new CSVFormat()));
       
        selectObjectRequest.setExpression("select * from ossobject where _4 > 40"); // Use SELECT statements to query data in the object. 
        OSSObject ossObject = ossClient.selectObject(selectObjectRequest);

        // Read the content of the object. 
        BufferedReader reader = new BufferedReader(new InputStreamReader(ossObject.getObjectContent()));
        while (true) {
            String line = reader.readLine();
            if (line == null) {
                break;
            }
            System.out.println(line);
        }
        reader.close();

        ossClient.deleteObject(bucketName, key);
    }

    private static void selectJsonSample(String key, OSS ossClient) throws Exception {
        // Specify the content to upload. 
        final String content = "{\n" +
                "\t\"name\": \"Lora Francis\",\n" +
                "\t\"age\": 27,\n" +
                "\t\"company\": \"Staples Inc\"\n" +
                "}\n" +
                "{\n" +
                "\t\"name\": \"Eleanor Little\",\n" +
                "\t\"age\": 43,\n" +
                "\t\"company\": \"Conectiv, Inc\"\n" +
                "}\n" +
                "{\n" +
                "\t\"name\": \"Rosie Hughes\",\n" +
                "\t\"age\": 44,\n" +
                "\t\"company\": \"Western Gas Resources Inc\"\n" +
                "}\n" +
                "{\n" +
                "\t\"name\": \"Lawrence Ross\",\n" +
                "\t\"age\": 24,\n" +
                "\t\"company\": \"MetLife Inc.\"\n" +
                "}";

        ossClient.putObject(bucketName, key, new ByteArrayInputStream(content.getBytes()));

        SelectObjectRequest selectObjectRequest =
                new SelectObjectRequest(bucketName, key)
                        .withInputSerialization(new InputSerialization()
                                .withCompressionType(CompressionType.NONE)
                                .withJsonInputFormat(new JsonFormat().withJsonType(JsonType.LINES)))
                        .withOutputSerialization(new OutputSerialization()
                                .withCrcEnabled(true)
                                .withJsonOutputFormat(new JsonFormat()))
                        .withExpression("select * from ossobject as s where s.age > 40"); // Use SELECT statements to query data in the object. 

        OSSObject ossObject = ossClient.selectObject(selectObjectRequest);

        // Read the content of the object. 
        BufferedReader reader = new BufferedReader(new InputStreamReader(ossObject.getObjectContent()));
        while (true) {
            String line = reader.readLine();
            if (line == null) {
                break;
            }
            System.out.println(line);
        }
        reader.close();

        ossClient.deleteObject(bucketName, key);
    }
}
import oss2

def select_call_back(consumed_bytes, total_bytes =  None):
        print('Consumed Bytes:' + str(consumed_bytes) + '\n')

# The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations in OSS is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
access_key_id = 'yourAccessKeyId'
access_key_secret = 'yourAccessKeySecret'
# Specify the bucket name. Example: examplebucket. 
bucket_name = 'yourtBucketName'
# Specify the endpoint of the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com. 
endpoint = 'https://oss-cn-hangzhou.aliyuncs.com'

# Create the bucket. All object-related methods are called based on the bucket. 
bucket = oss2.Bucket(oss2.Auth(access_key_id, access_key_secret), endpoint, bucket_name)
key =  'python_select.csv'
content =  'Tom Hanks,USA,45\r\n'*1024
filename =  'python_select.csv'

# Upload a CSV file. 
bucket.put_object(key, content)
# Configure the parameters of the SelectObject API operation. 
csv_meta_params = {'RecordDelimiter': '\r\n'}
select_csv_params = {'CsvHeaderInfo': 'None',
                    'RecordDelimiter': '\r\n',
                    'LineRange': (500, 1000)}

csv_header = bucket.create_select_object_meta(key, csv_meta_params)
print(csv_header.rows)
print(csv_header.splits)
result = bucket.select_object(key, "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
bucket.delete_object(key)

###JSON DOCUMENT
key =  'python_select.json'
content =  "{\"contacts\":[{\"key1\":1,\"key2\":\"hello world1\"},{\"key1\":2,\"key2\":\"hello world2\"}]}"
filename =  'python_select.json'
# Upload a JSON DOCUMENT object. 
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'DOCUMENT'}
result = bucket.select_object(key, "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

###JSON LINES
key =  'python_select_lines.json'
content =  "{\"key1\":1,\"key2\":\"hello world1\"}\n{\"key1\":2,\"key2\":\"hello world2\"}"
filename =  'python_select.json'
# Upload a JSON LINES object. 
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'LINES'}
json_header = bucket.create_select_object_meta(key,select_json_params)
print(json_header.rows)
print(json_header.splits)

result = bucket.select_object(key, "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
select_content =  result.read()
print(select_content)
result = bucket.select_object_to_file(key, filename,
           "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

Use the RESTful API

If your program requires more custom options, you can call RESTful API operations. If you call RESTful API operations, you must manually write code to calculate the signature. For more information, see SelectObject.