This topic describes how to use SelectObject in the Python SDK to query CSV and JSON objects.

Note For more information about SelectObject, see SelectObject in the OSS Developer Guide and the SelectObject API operation.

Python SDK examples

The following code provides an example of how to query CSV and JSON objects:

import oss2

def select_call_back(consumed_bytes, total_bytes =  None):
        print('Consumed Bytes:' + str(consumed_bytes) + '\n')

access_key_id = '<yourAccessKeyId>'
access_key_secret = '<yourAccessKeySecret>'
bucket_name = '<yourtBucketName>'
endpoint = '<yourEndpoint>'

# Create an OSS bucket. All object-related methods must be called through the bucket.
bucket = oss2.Bucket(oss2.Auth(access_key_id, access_key_secret), endpoint, bucket_name)
key =  'python_select.csv'
content =  'Tom Hanks,USA,45\r\n'*1024
filename =  'python_select.csv'

# Upload a CSV object.
bucket.put_object(key, content)
# Set the parameters for the SelectObject API operation.
csv_meta_params = {'RecordDelimiter': '\r\n'}
select_csv_params = {'CsvHeaderInfo': 'None',
                    'RecordDelimiter': '\r\n',
                    'LineRange': (500, 1000)}

csv_header = bucket.create_select_object_meta(key, csv_meta_params)
print(csv_header.rows)
print(csv_header.splits)
result = bucket.select_object(key, "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select * from ossobject where _3 > 44", select_call_back, select_csv_params)
bucket.delete_object(key)

###JSON DOCUMENT
key =  'python_select.json'
content =  "{\"contacts\":[{\"key1\":1,\"key2\":\"hello world1\"},{\"key1\":2,\"key2\":\"hello world2\"}]}"
filename =  'python_select.json'
# Upload a JSON DOCUMENT object.
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'DOCUMENT'}
result = bucket.select_object(key, "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
select_content = result.read()
print(select_content)

result = bucket.select_object_to_file(key, filename,
      "select s.key2 from ossobject.contacts[*] s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

###JSON LINES
key =  'python_select_lines.json'
content =  "{\"key1\":1,\"key2\":\"hello world1\"}\n{\"key1\":2,\"key2\":\"hello world2\"}"
filename =  'python_select.json'
# Upload a JSON LINES object.
bucket.put_object(key, content)
select_json_params = {'Json_Type': 'LINES'}
json_header = bucket.create_select_object_meta(key,select_json_params)
print(json_header.rows)
print(json_header.splits)

result = bucket.select_object(key, "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
select_content =  result.read()
print(select_content)
result = bucket.select_object_to_file(key, filename,
           "select s.key2 from ossobject s where s.key1 = 1", None, select_json_params)
bucket.delete_object(key)

SelectObject in Python

The following section describes the elements used for the SelectObject API operation in Python in detail. These elements include select_object, select_object_to_file, and create_select_object_meta.

  • select_object
    • The following code provides an example of how to set parameters for select_object:
      def select_object(self, key, sql,
                         progress_callback=None,
                         select_params=None
                         byte_range=None
                         headers=None                
                         ):

      The preceding code example is used to execute an SQL statement on the object that has the specified key and return the query results.

      • sql: the SQL statement that can be directly used as the value of the sql parameter and does not need to be Base64-encoded.
      • Progress_callback: optional. This parameter specifies a callback function used to report the query progress.
      • select_params: the parameters and actions of the SelectObject operation.
      • headers: the header information included in the request. The header information functions the same as that for the GetObject API operation. For example, you can configure the bytes field in the request header to specify the range of a CSV object to query.
    • The following table describes the parameters supported by the select_params parameter.
      Parameter Description
      Json_Type
      • By default, the object is a CSV object if this parameter is not specified.
      • The object is a JSON DOCUMENT object if this parameter is set to DOCUMENT.
      • The object is a JSON LINES object if this parameter is set to LINES.
      CsvHeaderInfo The header information of the CSV object.

      Valid values: None, Ignore, and Use

      • None: This object has no header information configured.
      • Ignore: This object has header information configured, which is not used when the SQL statement is executed.
      • Use: This object has header information configured. The column names in the header information are used when the SQL statement is executed.
      CommentCharacter The comment character within the CSV object. This parameter value can be only one character in length. The default value is None, indicating that no comment characters are allowed.
      RecordDelimiter The row delimiter in the CSV object. This parameter value can be only one or two characters in length. Default value: \n.
      OutputRecordDelimiter The row delimiter in the output result of the SELECT statement. Default value: \n.
      FieldDelimiter The column delimiter in the CSV object. This parameter value can be only one character in length. Default value: comma (,).
      OutputFieldDelimiter The column delimiter in the output result of the SELECT statement. Default value: comma (,).
      QuoteCharacter The quote character for the columns in the CSV object. This parameter value can be only one character in length. Default value: double quotation marks ("). Row and column delimiters enclosed in quotation marks are processed as normal characters.
      SplitRange The split range in multipart query. This parameter value is a closed interval in (start, end) format, indicating that indicating the range of aplits to query.
      LineRange The row range in multipart query. The parameter value is a closed interval in (start, end) format, indicating the range of rows to query.
      CompressionType The format in which the object is compressed. Default value: None. Valid value: GZIP.
      KeepAllColumns If this parameter is set to true, columns that are excluded by the SELECT statement in the CSV object are left empty in the output result. However, the column positions are kept. Default value: false.
      Example: The columns in the CSV object are firstname, lastname, and age. The SQL statement is select firstname, age from ossobject.
      • If KeepAllColumns is set to true, the output result is firstname,,age, in which a comma (,) is added to indicate the position of the excluded lastname column.
      • If KeepAllColumns is set to false, the output result is firstname,age.
      Note By using this parameter, you can directly use the code that is used to process GetObject to process SelectObject without modifications.
      OutputRawData
      • If this parameter is set to True, SelectObject directly returns the original data. If data is not returned for a long time, a timeout error may occur.
      • If this parameter is set to False, the output data is encapsulated in frames. Default value: False.
      EnablePayloadCrc Specifies whether a cyclic redundancy check (CRC) value is calculated for each frame. Default value: False.
      OutputHeader The header information in the first line of the output result. This parameter applies only to CSV objects.
      SkipPartialDataRecord If this parameter is set to True, the current record is skipped when a column in a CSV object has no data or when a key in a JSON object does not exist. If this parameter is set to False, a column without data is left empty in the output result.
      Example: A row includes the columns firstname, lastname, and age. The SQL statement is select _1, _4 from ossobject.
      • If this parameter is set to True, this row is skipped.
      • If this parameter is set to False, firstname,\n is returned.
      MaxSkippedRecordsAllowed The maximum number of skipped rows. The default value is 0, indicating that an error is returned if a row is skipped.
      ParseJsonNumberAsString
      • If this parameter is set to True, all numbers in the JSON object are parsed as strings.
      • If this parameter is set to False, all numbers in the JSON object are parsed as integers or floating-point numbers. Default value: False.

      High-accuracy floating-point numbers in a JSON object suffer loss of accuracy when they are parsed as floating-point numbers. To keep the accuracy, you can set this parameter to True and use the CAST function to convert the parsed data to the decimal type.

    • Returned result of select_object: A SelectObjectResult object is returned. You can use the read() function or the __iter__ method to obtain all results.
      Note If you call the read() function to read all results when there are multiple results to return, excess memory resources are used and you have to wait for a long time. We recommend that you use the __iter__ method (foreach chunk in result) to obtain results and process each chunk in the results. The __iter__ method reduces memory usage and enables the client to process each chunk request processed by the OSS server in a timely manner. This way, the client does not need to wait until all results are returned.
  • select_object_to_file
    def select_object_to_file(self, key, filename, sql,
                       progress_callback=None,
                       select_params=None
                       headers=None
                       ):

    The preceding code example is used to execute an SQL statement on the object that has the specified key and write the query results to another specified object.

    Other parameters are the same as those of select_object.

  • create_select_object_meta
    • Syntax of create_select_object_meta
      def create_select_object_meta(self, key, select_meta_params=None, header=None):

      The preceding code example is used to create Select Meta for the object that has the specified key or obtain Select Meta from such an object. Select Meta includes the total number of rows, total number of columns (for CSV objects), and total number of splits in an object.

      If Select Meta has been created for an object, this function does not re-create Select Meta unless the value of the OverwriteIfExists parameter is set to true.

      To create Select Meta for an object, the object must be completely scanned.

    • The following table describes the parameters supported by the select_meta_params parameter.
      Parameter Description
      Json_Type
      • By default, the object is a CSV object if this parameter is not specified.
      • If this parameter is specified, the parameter value must be LINES, indicating that the object is a JSON LINES object.
      Note This operation does not apply to JSON DOCUMENT objects.
      RecordDelimiter The row delimiter within in the CSV object.
      FieldDelimiter The column delimiter within the CSV object.
      QuoteCharacter The quote character within the CSV object. Row and column delimiters enclosed in quote characters are processed as normal characters.
      CompressionType The format in which the object is compressed. Currently, no compression formats are supported. The default value is None.
      OverwriteIfExists Specifies whether the created Select Meta overwrites the original Select Meta. You do not need to set this parameter in most cases.
    • Returned result of create_select_object_meta: A GetSelectObjectMetaResult object is returned, which includes the rows and splits attributes. For a CSV object, the select_resp object in the result includes the columns attribute, indicating the number of columns in the CSV object.