You can call the SelectObject operation to execute SQL statements on an object and obtain the execution results.
Background information
You can directly process data in Object Storage Service (OSS) when you run Hadoop 3.0 or use services such as Spark, Hive, and Presto on E-MapReduce. Alibaba Cloud services, such as MaxCompute and Data Lake Analytics (DLA), are also supported in OSS.
The GetObject operation that is provided by OSS requires the big data platform to download OSS objects to your local PC for analysis and filtering. As a result, large amounts of bandwidth and client resources are wasted in many query scenarios.
To resolve this issue, OSS provides the SelectObject operation. SelectObject allows OSS to preliminarily filter data by using conditions and projections that are provided by the big data platform. As a result, only useful data is returned to the big data platform. This way, the client can consume fewer bandwidth resources and process smaller amounts of data to minimize CPU utilization and memory usage. This makes OSS-based data warehousing and data analytics a better option.
Billing
You are charged based on the size of the objects that are scanned when you call the SelectObject operation. For more information, see Data processing fees.
Supported object types
This section describes the object types that are supported by the SelectObject operation.
- CSV objects (and CSV-like objects such as TSV objects) that conform to RFC 4180. You can specify custom row and column delimiters and quote characters in CSV objects.
- UTF-8 encoded JSON objects. The SelectObject operation supports JSON objects in the
DOCUMENT and LINES formats.
- A JSON DOCUMENT object contains a single object.
- A JSON LINES object consists of lines of objects that are separated by line feeds. However, the entire object itself is not a valid JSON object. SelectObject supports common delimiters, such as \n and \r\n. You do not need to specify the delimiters.
- Standard and Infrequent Access (IA) objects. You must restore Archive and Cold Archive objects before you access the objects.
- Objects that are fully managed and encrypted by OSS or encrypted by using customer master keys (CMKs) that are managed by Key Management Service (KMS).
Supported SQL syntax
- SQL statement: SELECT FROM WHERE
- Data types: string, int(64bit), double(64bit), decimal(128bit), timestamp, and bool.
- Operators: logical operators (AND, OR, and NOT), arithmetic operators (+, -, *, /,
and %), comparison operators (>, =, <, >=, <=, and !=), and string operators (LIKE
and ||).
Important LIKE-based fuzzy matches are case-sensitive.
Supported data types
By default, CSV data in OSS is of the String type. You can use the CAST function to convert the data type.
The following example shows the SQL query statement that is used to convert the data
in the first and second columns into data of the Integer type: Select * from OSSOBject where cast (_1 as int) > cast(_2 as int)
.
The SelectObject operation allows you to implicitly convert the data type by using a WHERE clause. For example, the following SQL statement converts the data in the first and second columns into data of the Integer type:
Select _1 from ossobject where _1 + _2 > 100
If you do not use the CAST function, the type of data in a JSON object remains unchanged. A standard JSON object can support various data types, including Null, Bool, Int64, Double, and String.
Sample SQL statements
SQL statement examples are provided for CSV and JSON objects.
- CSV
Scenario SQL statement Return the first 10 rows. select * from ossobject limit 10 Return integers in the first and third columns. The integer values in the first column are greater than the integer values in the third column. select _1, _3 from ossobject where cast(_1 as int) > cast(_3 as int) Return the number of records in which the data in the first column starts with X. A Chinese character that is specified after LIKE must be UTF-8 encoded. select count(*) from ossobject where _1 like 'X%' Return all records in which the time value in the second column is later than 2018-08-09 11:30:25 and the value in the third column is greater than 200. select * from ossobject where _2 > cast('2018-08-09 11:30:25' as timestamp) and _3 > 200 Return the average value, sum, maximum value, and minimum value of the floating-point numbers in the second column. select AVG(cast(_6 as double)), SUM(cast(_6 as double)), MAX(cast(_6 as double)), MIN(cast(_6 as double)) from ossobject
Return all records in which the strings that are concatenated by the data in the first and third columns start with Tom and end with Anderson. select * from ossobject where (_1 || _3) like 'Tom%Anderson' Return all records in which the value in the first column is divisible by 3. select * from ossobject where (_1 % 3) = 0 Return all records in which the value in the first column is between 1995 and 2012. select * from ossobject where _1 between 1995 and 2012 Return all records in which the value in the fifth column is N, M, G, or L. select * from ossobject where _5 in ('N', 'M', 'G', 'L') Return all records in which the product of the values in the second and third columns is greater than the sum of 100 and the value in the fifth column. select * from ossobject where _2 * _3 > _5 + 100 - JSON
Sample JSON object:
{ "contacts":[ { "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 27, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" }, { "type": "mobile", "number": "123 456-7890" } ], "children": [], "spouse": null }, …… # Other similar nodes are omitted. ]}
The following table describes the sample SQL statements for JSON objects.
Scenario SQL statement Return all records in which the value of age is 27. select * from ossobject.contacts[*] s where s.age = 27 Return all home phone numbers. select s.number from ossobject.contacts[*].phoneNumbers[*] s where s.type = "home" Return all records in which the value of spouse is null. select * from ossobject s where s.spouse is null Return all records in which the value of children is 0. select * from ossobject s where s.children[0] is null Note The preceding statement is used to specify an empty array because no other methods are available.
Scenarios
In most cases, SelectObject is used for multipart query, query of JSON objects, and analysis of log objects.
- Query large objects by using multipart query
The multipart query feature that is provided by the SelectObject operation is similar to the byte-based multipart download feature that is provided by the GetObject operation. Data is split into parts by row or by split.
- By row: This method is used in most cases. However, unbalanced loads may occur when sparse data is split.
- By split: A split includes multiple rows. Each split has approximately the same size.
Note The method of splitting data by split is more efficient.If columns in a CSV object do not include line feeds, you can divide the object into parts based on bytes. This method is simple because you do not need to create Meta for the object. If you want to query a JSON object or a CSV object in which columns include line feeds, perform the following steps:
- Call the CreateSelectObjectMeta operation to obtain the total number of splits for the object. To reduce the scanning time, we recommend that you call the SelectObject operation asynchronously before you query the object.
- Select the appropriate concurrency level (n) based on the resources on the client. Divide the total number of splits by the concurrency level (n) to obtain the number of splits in each query.
- Configure the parameters in the request body to perform multipart query. For example, you can set split-range to 1-20.
- Combine the results.
- Query JSON objects
When you query a JSON object, narrow down the JSON path range in the FROM clause.
Sample JSON object:
{ "contacts":[ { "firstName": "John", "lastName": "Smith", "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" }, { "type": "mobile", "number": "123 456-7890" } ] } ]}
If you want to query all streetAddress data of records in which the postal code starts with 10021, execute the following SQL statement:
select s.address.streetAddress from ossobject.contacts[*] s where s.address.postalCode like '10021%'
orselect s.streetAddress from ossobject.contacts[*].address s where s.postalCode like '10021%'
.The
select s.streetAddress from ossobject.contacts[*].address s where s.postalCode like '10021%'
statement provides better performance because the JSON path in the statement is more accurate. - Process high-precision floating-point numbers in JSON objects
If you want to calculate high-precision floating-point numbers in a JSON object, we recommend that you set the ParseJsonNumberAsString parameter to true and use the CAST function to convert the parsed data to the Decimal type. For example, if the value of attribute a is 123456789.123456789, you can execute the
SELECT s.a FROM ossobject s WHERE CAST(s.a AS DECIMAL) > 123456789.12345
statement to maintain the accuracy of attribute a.
Use the OSS console
Use OSS SDKs
You can use only OSS SDK for Java and Python to query objects.
import com.aliyun.oss.model.*;
import com.aliyun.oss.OSS;
import com.aliyun.oss.OSSClientBuilder;
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.InputStreamReader;
/**
* Examples of create select object metadata and select object.
*
*/
public class SelectObjectSample {
// Specify the endpoint of the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com.
private static String endpoint = "yourEndpoint";
// The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations in OSS is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console.
private static String accessKeyId = "yourAccessKeyId";
private static String accessKeySecret = "yourAccessKeySecret";
// Specify the name of the bucket. Example: examplebucket.
private static String bucketName = "examplebucket";
public static void main(String[] args) throws Exception {
OSS ossClient = new OSSClientBuilder().build(endpoint, accessKeyId, accessKeySecret);
// Specify the full path of the object that you want to query, and then query the data of the object by using SELECT statements. Do not include the bucket name in the full path.
// Specify the full path of the CSV object.
selectCsvSample("test.csv", ossClient);
// Specify the full path of the JSON object.
selectJsonSample("test.json", ossClient);
ossClient.shutdown();
}
private static void selectCsvSample(String key, OSS ossClient) throws Exception {
// Specify the content of the object that you want to upload.
String content = "name,school,company,age\r\n" +
"Lora Francis,School A,Staples Inc,27\r\n" +
"Eleanor Little,School B,\"Conectiv, Inc\",43\r\n" +
"Rosie Hughes,School C,Western Gas Resources Inc,44\r\n" +
"Lawrence Ross,School D,MetLife Inc.,24";
ossClient.putObject(bucketName, key, new ByteArrayInputStream(content.getBytes()));
SelectObjectMetadata selectObjectMetadata = ossClient.createSelectObjectMetadata(
new CreateSelectObjectMetadataRequest(bucketName, key)
.withInputSerialization(
new InputSerialization().withCsvInputFormat(
// Specify the delimiter that is used to separate different records in the content. Example: \r\n.
new CSVFormat().withHeaderInfo(CSVFormat.Header.Use).withRecordDelimiter("\r\n"))));
System.out.println(selectObjectMetadata.getCsvObjectMetadata().getTotalLines());
System.out.println(selectObjectMetadata.getCsvObjectMetadata().getSplits());
SelectObjectRequest selectObjectRequest =
new SelectObjectRequest(bucketName, key)
.withInputSerialization(
new InputSerialization().withCsvInputFormat(
new CSVFormat().withHeaderInfo(CSVFormat.Header.Use).withRecordDelimiter("\r\n")))
.withOutputSerialization(new OutputSerialization().withCsvOutputFormat(new CSVFormat()));
// Use a SELECT statement to query all records whose values are greater than 40 in the fourth column.
selectObjectRequest.setExpression("select * from ossobject where _4 > 40");
OSSObject ossObject = ossClient.selectObject(selectObjectRequest);
// Read the content of the object.
BufferedReader reader = new BufferedReader(new InputStreamReader(ossObject.getObjectContent()));
while (true) {
String line = reader.readLine();
if (line == null) {
break;
}
System.out.println(line);
}
reader.close();
ossClient.deleteObject(bucketName, key);
}
private static void selectJsonSample(String key, OSS ossClient) throws Exception {
// Specify the content of the object that you want to upload.
final String content = "{\n" +
"\t\"name\": \"Lora Francis\",\n" +
"\t\"age\": 27,\n" +
"\t\"company\": \"Staples Inc\"\n" +
"}\n" +
"{\n" +
"\t\"name\": \"Eleanor Little\",\n" +
"\t\"age\": 43,\n" +
"\t\"company\": \"Conectiv, Inc\"\n" +
"}\n" +
"{\n" +
"\t\"name\": \"Rosie Hughes\",\n" +
"\t\"age\": 44,\n" +
"\t\"company\": \"Western Gas Resources Inc\"\n" +
"}\n" +
"{\n" +
"\t\"name\": \"Lawrence Ross\",\n" +
"\t\"age\": 24,\n" +
"\t\"company\": \"MetLife Inc.\"\n" +
"}";
ossClient.putObject(bucketName, key, new ByteArrayInputStream(content.getBytes()));
SelectObjectRequest selectObjectRequest =
new SelectObjectRequest(bucketName, key)
.withInputSerialization(new InputSerialization()
.withCompressionType(CompressionType.NONE)
.withJsonInputFormat(new JsonFormat().withJsonType(JsonType.LINES)))
.withOutputSerialization(new OutputSerialization()
.withCrcEnabled(true)
.withJsonOutputFormat(new JsonFormat()))
.withExpression("select * from ossobject as s where s.age > 40"); // Use the SELECT statement to query data in the object.
OSSObject ossObject = ossClient.selectObject(selectObjectRequest);
// Read the content of the object.
BufferedReader reader = new BufferedReader(new InputStreamReader(ossObject.getObjectContent()));
while (true) {
String line = reader.readLine();
if (line == null) {
break;
}
System.out.println(line);
}
reader.close();
ossClient.deleteObject(bucketName, key);
}
}
import oss2
def select_call_back(consumed_bytes, total_bytes = None):
print('Consumed Bytes:' + str(consumed_bytes) + '\n')
# The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations in Object Storage Service (OSS) is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console.
access_key_id = 'yourAccessKeyId'
access_key_secret = 'yourAccessKeySecret'
# Specify the bucket name. Example: examplebucket.
bucket_name = 'yourtBucketName'
# Specify the endpoint of the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com.
endpoint = 'https://oss-cn-hangzhou.aliyuncs.com'
# Create the bucket. All object-related methods need to be called based on the bucket.
bucket = oss2.Bucket(oss2.Auth(access_key_id, access_key_secret), endpoint, bucket_name)
key = 'python_select.csv'
content = 'Tom Hanks,USA,45\r\n'*1024
filename = 'python_select.csv'
# Upload a CSV object.
bucket.put_object(key, content)
# Configure the parameters of the SelectObject API operation.
csv_meta_params = {'RecordDelimiter': '\r\n'}
select_csv_params = {'CsvHeaderInfo': 'None',
'RecordDelimiter': '\r\n',
'LineRange': (500, 1000)}
csv_header = bucket.create_select_object_meta(key, csv_meta_params)
print(csv_header.rows)
print(csv_header.splits)
result = bucket.select_object(key, "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
select_content = result.read()
print(select_content)
result = bucket.select_object_to_file(key, filename,
"select * from ossobject where _3 > 44", select_call_back, select_csv_params)
bucket.delete_object(key)
###JSON DOCUMENT
key = 'python_select.json'
content = "{\"contacts\":[{\"key1\":1,\"key2\":\"hello world1\"},{\"key1\":2,\"key2\":\"hello world2\"}]}"
filename = 'python_select.json'
# Upload a JSON DOCUMENT object.
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'DOCUMENT'}
result = bucket.select_object(key, "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
select_content = result.read()
print(select_content)
result = bucket.select_object_to_file(key, filename,
"select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)
###JSON LINES
key = 'python_select_lines.json'
content = "{\"key1\":1,\"key2\":\"hello world1\"}\n{\"key1\":2,\"key2\":\"hello world2\"}"
filename = 'python_select.json'
# Upload a JSON LINES object.
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'LINES'}
json_header = bucket.create_select_object_meta(key,select_json_params)
print(json_header.rows)
print(json_header.splits)
result = bucket.select_object(key, "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
select_content = result.read()
print(select_content)
result = bucket.select_object_to_file(key, filename,
"select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)
package main
import (
"fmt"
"github.com/aliyun/aliyun-oss-go-sdk/oss"
"io/ioutil"
"os"
)
func main() {
// Create an OSSClient instance.
// Specify the endpoint of the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com. Specify your actual endpoint.
// The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to access Object Storage Service (OSS) is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console.
client, err := oss.New("yourEndpoint", "yourAccessKeyId", "yourAccessKeySecret")
if err != nil {
fmt.Println("Error:", err)
os.Exit(-1)
}
// Specify the bucket name. Example: examplebucket.
bucket,err := client.Bucket("examplebucket")
if err != nil {
fmt.Println("Error:", err)
os.Exit(-1)
}
// Specify the full path of the object that you want to query. The full path of the object cannot contain the bucket name. Example: exampledir/exampledata.csv.
key := "exampledir/exampledata.csv"
// Specify the full path of the local CSV file. Example: D:\\localpath\\exampledata.csv.
localCsvFile := "D:\\localpath\\exampledata.csv"
err = bucket.PutObjectFromFile(key, localCsvFile)
if err != nil {
fmt.Println("Error:", err)
os.Exit(-1)
}
selReq := oss.SelectRequest{}
// Execute the SELECT statement to query data in the object.
selReq.Expression = `select * from ossobject`
body,err := bucket.SelectObject(key, selReq)
if err != nil {
fmt.Println("Error:", err)
os.Exit(-1)
}
// Read the content of the object.
fc,err := ioutil.ReadAll(body)
if err != nil {
fmt.Println("Error:", err)
os.Exit(-1)
}
defer body.Close()
fmt.Println(string(fc))
}
Use RESTful APIs
If your business requires a high level of customization, you can directly call RESTful APIs. To directly call an API, you must include the signature calculation in your code. For more information, see SelectObject.