This topic describes how to use SelectObject in OSS SDK for Python to query CSV and JSON objects.

Example

The following code provides an example on how to query CSV and JSON objects:

import oss2

def select_call_back(consumed_bytes, total_bytes =  None):
        print('Consumed Bytes:' + str(consumed_bytes) + '\n')

# The AccessKey pair of an Alibaba Cloud account has permissions on all API operations. Using these credentials to perform operations in Object Storage Service (OSS) is a high-risk operation. We recommend that you use a RAM user to call API operations or perform routine O&M. To create a RAM user, log on to the RAM console. 
access_key_id = 'yourAccessKeyId'
access_key_secret = 'yourAccessKeySecret'
# Specify the bucket name. Example: examplebucket. 
bucket_name = 'yourtBucketName'
# Specify the endpoint of the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com. 
endpoint = 'https://oss-cn-hangzhou.aliyuncs.com'

# Create the bucket. All object-related methods need to be called based on the bucket. 
bucket = oss2.Bucket(oss2.Auth(access_key_id, access_key_secret), endpoint, bucket_name)
key =  'python_select.csv'
content =  'Tom Hanks,USA,45\r\n'*1024
filename =  'python_select.csv'

# Upload a CSV object. 
bucket.put_object(key, content)
# Configure the parameters of the SelectObject API operation. 
csv_meta_params = {'RecordDelimiter': '\r\n'}
select_csv_params = {'CsvHeaderInfo': 'None',
                    'RecordDelimiter': '\r\n',
                    'LineRange': (500, 1000)}

csv_header = bucket.create_select_object_meta(key, csv_meta_params)
print(csv_header.rows)
print(csv_header.splits)
result = bucket.select_object(key, "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
bucket.delete_object(key)

###JSON DOCUMENT
key =  'python_select.json'
content =  "{\"contacts\":[{\"key1\":1,\"key2\":\"hello world1\"},{\"key1\":2,\"key2\":\"hello world2\"}]}"
filename =  'python_select.json'
# Upload a JSON DOCUMENT object. 
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'DOCUMENT'}
result = bucket.select_object(key, "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

###JSON LINES
key =  'python_select_lines.json'
content =  "{\"key1\":1,\"key2\":\"hello world1\"}\n{\"key1\":2,\"key2\":\"hello world2\"}"
filename =  'python_select.json'
# Upload a JSON LINES object. 
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'LINES'}
json_header = bucket.create_select_object_meta(key,select_json_params)
print(json_header.rows)
print(json_header.splits)

result = bucket.select_object(key, "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
select_content =  result.read()
print(select_content)
result = bucket.select_object_to_file(key, filename,
           "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

Python SelectObject

The following section describes the elements used for the SelectObject API operation in Python in detail. These elements include select_object, select_object_to_file, and create_select_object_meta.

  • select_object
    • The following code provides an example on how to specify parameters for select_object:
      def select_object(self, key, sql,
                         progress_callback=None,
                         select_params=None
                         byte_range=None
                         headers=None
                         ):

      The preceding code example is used to execute an SQL statement on the object that has the specified key and return the query results.

      • sql: the SQL statement that does not need to be Base64-encoded.
      • Progress_callback: optional. This parameter specifies a callback function used to report the query progress.
      • select_params: the parameters and actions of the SelectObject operation.
      • headers: the header information included in the request. The header information functions the same as that for the GetObject operation. For example, you can configure the bytes field in the request header to specify the range of a CSV object to query.
    • The following table describes the parameters supported by the select_params parameter.
      Parameter Description
      Json_Type
      • By default, the object is a CSV object if this parameter is not specified.
      • The object is a JSON Document object if this parameter is set to DOCUMENT.
      • The object is a JSON LINES object if this parameter is set to LINES.
      CsvHeaderInfo The header information of the CSV object.

      Valid values: None, Ignore, and Use.

      • None: This object has no header information configured.
      • Ignore: This object has header information configured, but the information is not used when the SQL statement is executed.
      • Use: This object has header information configured, and the column names in the header information are used when the SQL statement is executed.
      CommentCharacter The comment character in the CSV object. The value of this parameter can be only one character in length. The default value is None, which indicates that no comment characters are allowed.
      RecordDelimiter The row delimiter in the CSV object. The value of this parameter can be only one or two characters in length. Default value: \n.
      OutputRecordDelimiter The row delimiter in the output result of the SELECT statement. Default value: \n.
      FieldDelimiter The column delimiter in the CSV object. The value of this parameter can be only one character in length. The default value is a comma (,).
      OutputFieldDelimiter The column delimiter in the output result of the SELECT statement. The default value is a comma (,).
      QuoteCharacter The quote character for the columns in the CSV object. The value of this parameter can be only one character in length. The default value is a double quotation mark ("). Row and column delimiters enclosed in quote characters are processed as normal characters.
      SplitRange The split range in multipart query. The value of this parameter is a closed interval in the (start, end) format, which indicates the range of splits to query.
      LineRange The row range in multipart query. The value of this parameter is a closed interval in the (start, end) format, which indicates the range of rows to query.
      CompressionType The format in which the object is compressed. Valid values: GZIP and None. Default value: None.
      KeepAllColumns If this parameter is set to true, columns that are excluded by the SELECT statement in the CSV object are left empty in the output result. However, the column positions are kept. Default value: False.
      For example, the columns in the CSV object are firstname, lastname, and age. The SQL statement is select firstname, age from ossobject.
      • If KeepAllColumns is set to true, the output result is firstname,,age, in which an extra comma (,) is added to indicate the position of the excluded lastname column.
      • If KeepAllColumns is set to false, the output result is firstname,age.
      Note This parameter is intended to make the code that is used to process GetObject available to process SelectObject without modifications.
      OutputRawData
      • If this parameter is set to True, SelectObject directly returns the raw data. If data is not returned for a long time, a timeout error may occur.
      • If this parameter is set to False, the output data is encapsulated in frames. Default value: False.
      EnablePayloadCrc Specifies whether a cyclic redundancy check (CRC) value is calculated for each frame. Default value: False.
      OutputHeader The header information in the first line of the output result. This parameter applies only to CSV objects.
      SkipPartialDataRecord If this parameter is set to True for a CSV object, the current record is skipped when a column of this record has no data. If this parameter is set to True for a JSON object, the current record is skipped when a key of this record does not exist. If this parameter is set to False, a column without data is left empty in the output result.
      For example, a row includes the firstname, lastname, and age columns. The SQL statement is select _1, _4 from ossobject.
      • If this parameter is set to True, this row is skipped.
      • If this parameter is set to False, firstname,\n is returned.
      MaxSkippedRecordsAllowed The maximum number of skipped rows. The default value is 0, which indicates that an error is returned if a row is skipped.
      ParseJsonNumberAsString
      • If this parameter is set to True, all numbers in the JSON object are parsed as strings.
      • If this parameter is set to False, all numbers in the JSON object are parsed as integers or floating-point numbers. Default value: False.

      High-precision floating-point numbers in a JSON object suffer loss of precision when they are parsed as floating-point numbers. To keep the precision, you can set this parameter to True and use the CAST function to convert the parsed data to the decimal type.

    • Returned results of select_object: A SelectObjectResult object is returned. You can use the read() function or the _iter_ method to obtain all results.
      Note If you call the read() function to read all results when multiple results are returned, excess memory resources are used and you have to wait for a long time. We recommend that you use the _iter_ method (foreach chunk in result) to obtain results and process each chunk in the results. The _iter_ method reduces memory usage and enables the client to process each chunk request processed by the OSS server in a timely manner. This way, the client does not need to wait until all results are returned.
  • select_object_to_file
    def select_object_to_file(self, key, filename, sql,
                       progress_callback=None,
                       select_params=None
                       headers=None
                       ):

    The preceding code example is used to execute an SQL statement on the object that has the specified key and write the query results to another specified object.

    Other parameters are the same as those of select_object.

  • create_select_object_meta
    • Syntax of create_select_object_meta
      def create_select_object_meta(self, key, select_meta_params=None, header=None):

      The preceding code example is used to create Select Meta for the object that has the specified key or obtain Select Meta from such an object. Select Meta includes the total number of rows, total number of columns (for CSV objects), and total number of splits in an object.

      If Select Meta has been created for an object, this function does not re-create Select Meta unless the OverwriteIfExists parameter is set to true.

      To create Select Meta for an object, the object must be completely scanned.

    • The following table describes the parameters supported by select_meta_params.
      Parameter Description
      Json_Type
      • By default, the object is a CSV object if this parameter is not specified.
      • If this parameter is specified, the value of this parameter must be LINES, which indicates that the object is a JSON LINES object.
      Note This operation does not apply to JSON Document objects.
      RecordDelimiter The row delimiter in the CSV object.
      FieldDelimiter The column delimiter in the CSV object.
      QuoteCharacter The quote character in the CSV object. Row and column delimiters enclosed in quote characters are processed as normal characters.
      CompressionType The format in which the object is compressed. No compression formats are supported. The default value is None.
      OverwriteIfExists Specifies whether the created Select Meta overwrites the original Select Meta. You do not need to set this parameter in most cases.
    • Returned results of create_select_object_meta: A GetSelectObjectMetaResult object is returned, which includes the rows and splits attributes. For a CSV object, the select_resp object in the result includes the columns attribute, which indicates the number of columns in the CSV object.

References

  • For the complete sample code that is used to query CSV and JSON objects, visit GitHub.
  • For more information about the API operation that you can call to query CSV and JSON objects, see SelectObject.