MaxCompute allows you to access unstructured data on Object Storage Service (OSS) by using a built-in extractor. The unstructured data includes data in CSV or TSV text files and data in gzip-compressed CSV or TSV text files. This topic describes how to create an OSS external table by using a built-in extractor of MaxCompute to access unstructured OSS data.

Prerequisites

  • MaxCompute is granted the permissions to access OSS data.

    For more information about authorization, see STS authorization.

  • An OSS bucket, OSS directories, and OSS data files are prepared.

    For more information about operations on OSS buckets, see Create buckets.

    For more information about how to create an OSS directory, see Create directories.

    For more information about how to upload data files, see Simple upload.

Background information

OSS is a secure, cost-effective, and highly reliable cloud storage service that can store a large number of files of any type.

Before you access OSS data in CSV or TSV files or gzip-compressed CSV or TSV files, you must create an OSS external table by using a built-in extractor of MaxCompute. The OSS external table can be a partitioned table or a non-partitioned table.

The methods to access OSS data by using OSS external tables vary based on the table type.
  • If you create a non-partitioned OSS external table by using a CREATE EXTERNAL TABLE statement, you can directly query OSS data from the OSS external table.
  • If you create a partitioned OSS external table by using a CREATE EXTERNAL TABLE statement, you must add partition data from OSS to the OSS external table before you can query OSS data from this table.
MaxCompute provides the following built-in extractors:
  • com.aliyun.odps.CsvStorageHandler: defines how to read data from and write data to CSV files. When you use this extractor, the column delimiter is a comma (,) and the line feed is \n.
  • com.aliyun.odps.TsvStorageHandler: defines how to read data from and write data to TSV files. When you use this extractor, the column delimiter is \t and the line feed is \n.
Note You can create, search for, query, configure, process, and analyze external tables in a visualized manner by using MaxCompute with DataWorks. For more information, see External table.

Limits

  • Built-in extractors of MaxCompute cannot be used to access data of the DATETIME type in OSS data files. You can use a custom extractor to access data of the DATETIME type in OSS data files. For more information about custom extractors, see Use a custom extractor of MaxCompute to access data of the DATETIME type in OSS data files.
  • If OSS data is stored in compressed files, you can use built-in extractors of MaxCompute to access the OSS data. These built-in extractors allow you to read data only from gzip-compressed CSV or TSV files.
  • The schemas of OSS external tables must be the same as those of OSS data files.
  • You can perform operations on OSS external tables only by using MaxCompute SQL.

Usage notes

When you use OSS external tables, take note of the following items:

  • If the OSS external table that you created is a partitioned table, the OSS directory where partition data is stored must comply with the format requirements. For more information about the format requirements, see Partition paths of an OSS external table.
  • OSS external tables record only the mappings between these tables and OSS directories. If you delete an OSS external table, data in the OSS directory that is mapped to the table is not deleted.
  • If a data file stored in OSS is an object of the Archive storage class, you must first restore the file. For more information about the restoration operation, see Restore objects.

Syntax used to create an external table

create external table [if not exists] <mc_oss_extable_name>
(
<col_name> <date_type>,
...
)
[partitioned by (<col_name> <data_type>, ...)]
-- Specify a built-in extractor.
stored by '<StorageHandler>' 
-- Specify the parameters that are related to the OSS external table.
with serdeproperties (
 'odps.properties.rolearn'='<ram_arn>'
-- Set odps.text.option.gzip.input.enabled to true when OSS data files are compressed in the GZIP format.
 [,'odps.text.option.gzip.input.enabled'='true']
-- Specify the properties that are related to the OSS external table based on your business requirements.
 [,'<property_name>'='<property_value>'[,'<property_name>'='<property_value>'...]]
) 
location '<oss_location>';
  • if not exists: optional. If you create a table by using the name of an existing table but do not specify the if not exists parameter, an error is returned. If you specify the if not exists parameter, a success message is returned when you create a table by using the name of an existing table. The success message is returned even if the schema of the existing table is different from that of the table you want to create. If you create a table by using the name of an existing table, the table is not created and the metadata of the existing table is not changed.
  • mc_oss_extable_name: required. The name of the OSS external table that you want to create.
  • col_name: required. The name of the column in the OSS external table.
  • date_type: required. The data type of the column in the OSS external table.
  • partitioned by (<col_name> <data_type>, ...): optional. The partition information of the OSS external table when the table is a partitioned table.
    • col_name: required. The name of the partition key column.
    • date_type: required. The data type of the partition key column.
  • StorageHandler: required. The built-in extractor. To read data from and write data to CSV files, set this parameter to com.aliyun.odps.CsvStorageHandler. To read data from and write data to TSV files, set this parameter to com.aliyun.odps.TsvStorageHandler. You need only to specify this parameter based on your business requirements. The logic that is used to read and write data is implemented by MaxCompute.
  • odps.properties.rolearn'='<ram_arn>: required. The Alibaba Cloud Resource Name (ARN) of AliyunODPSDefaultRole in Resource Access Management (RAM). You can obtain its value from the role details page in the RAM console.
  • location: required. The OSS directory where data files are stored. The OSS directory is in the format of oss://<oss_endpoint>/<Bucket name>/<Directory name>/. MaxCompute automatically reads data from all files in the OSS directory that you specified.
    • oss_endpoint: the OSS endpoint. We recommend that you use an internal endpoint of OSS to avoid extra fees that are incurred by Internet traffic. For more information about the internal endpoints of OSS, see Regions and endpoints. We recommend that OSS for storing data is deployed in the same region as the MaxCompute project. MaxCompute can be deployed only in some regions. In this case, cross-region data connectivity issues may occur.
    • Bucket name: the name of the OSS bucket. For more information about how to view the bucket name, see List buckets.
    • Directory name: the name of the OSS directory. File names do not need to be included in directory names. Examples of incorrect usage:
      http://oss-odps-test.oss-cn-shanghai-internal.aliyuncs.com/Demo/                  -- HTTP connections are not supported.
      https://oss-odps-test.oss-cn-shanghai-internal.aliyuncs.com/Demo/                 -- HTTPS connections are not supported.
      oss://oss-odps-test.oss-cn-shanghai-internal.aliyuncs.com/Demo                    -- The endpoint is incorrect.
      oss://oss://oss-cn-shanghai-internal.aliyuncs.com/oss-odps-test/Demo/vehicle.csv  -- The file name does not need to be included in the directory name. 
  • 'odps.text.option.gzip.input.enabled'='true': required when OSS data files are compressed in the GZIP format.
  • <property_name>'='<property_value>': optional. property_name indicates the property name and property_value indicates the property value. The following table describes the properties that are related to OSS external tables.
    Property name Description Property value Default value
    odps.text.option.gzip.input.enabled Required when you create an OSS external table to read data from OSS data files in the GZIP format. This property specifies whether data can be read from OSS files in the GZIP format.
    • True
    • False
    False
    odps.text.option.gzip.output.enabled Required when you create an OSS external table to write data to OSS files in the GZIP format. This property specifies whether data can be written to OSS files in the GZIP format.
    • True
    • False
    False
    odps.text.option.header.lines.count Specifies the number of rows to skip in the file. Non-negative integer 0
    odps.text.option.null.indicator Specifies the strings in the file to be parsed as NULL in an SQL statement. For example, if you use \N to represent NULL, a,\N,b is parsed as a, NULL, b. String Empty string
    odps.text.option.ignore.empty.lines Specifies whether to ignore empty rows.
    • True
    • False
    True
    odps.text.option.encoding Specifies the encoding format of the text file
    • UTF-8
    • UTF-16
    • US-ASCII
    • GBK
    UTF-8
    odps.text.option.delimiter Specifies the column delimiter of the text file Single character Comma (,)
    odps.text.option.use.quote Specifies whether to recognize column delimiters in a CSV file if it uses double quotation marks (") as column delimiters. If fields in the CSV file contain specified symbols that are used to separate multiple values, these fields must be enclosed in double quotation marks ("). The symbols include a carriage return, line feed (CRLF) pair, double quotation mark ("), and comma (,). If a field contains a double quotation mark ("), replace the double quotation mark (") with two double quotation marks ("") for escaping.
    • True
    • False
    False

Syntax used to add partition data to OSS external tables

After you create a partitioned OSS external table, MaxCompute automatically adds partition data to the OSS external table based on the OSS directory that is specified when you create this table. Syntax:
  • Syntax 1 (recommended): MaxCompute automatically parses the OSS directory structure, identifies partitions, and adds partition data to the OSS external table.
    msck repair table <mc_oss_extable_name> add partitions;
    mc_oss_extable_name: required. The name of the OSS external table to which you want to add partition data.
  • Syntax 2: Manually add partition data to the OSS external table.
    alter table <mc_oss_extable_name> add partition (<col_name>=<col_value>) [add partition (<col_name>=<col_value>)...]
    • mc_oss_extable_name: required. The name of the OSS external table to which you want to add partition data.
    • col_name: required. The name of the partition key column.
    • col_value: required. The value of the partition key column.
    The values of col_name and col_value must be the same as those in the names of the OSS directories where partition data is stored. For example, if the directory structure shown in the following figure is used, the value of col_name is direction, and the values of col_value are N, NE, S, SW, and W. One add partition clause in the ALTER TABLE statement is used to add a partition that is mapped to an OSS directory. The number of add partition clauses must be the same as that of OSS directories. Directories

Sample data

Sample source data is provided for you to better understand the examples in this topic. The following source data is prepared on OSS:
  • oss_endpoint: oss-cn-hangzhou-internal.aliyuncs.com, which indicates that OSS is deployed in the China (Hangzhou) region.
  • Bucket name: oss-mc-test.
  • Directory name: Demo1/ (associated with a non-partitioned OSS external table), Demo2/ (associated with a partitioned OSS external table), and Demo3/(associated with a compressed file). Directories
  • The following data files are uploaded to the preceding directories:
    • The vehicle.csv file is uploaded to the Demo1/ directory. The file contains the following data:
      1,1,51,1,46.81006,-92.08174,9/14/2014 0:00,S
      1,2,13,1,46.81006,-92.08174,9/14/2014 0:00,NE
      1,3,48,1,46.81006,-92.08174,9/14/2014 0:00,NE
      1,4,30,1,46.81006,-92.08174,9/14/2014 0:00,W
      1,5,47,1,46.81006,-92.08174,9/14/2014 0:00,S
      1,6,9,1,46.81006,-92.08174,9/15/2014 0:00,S
      1,7,53,1,46.81006,-92.08174,9/15/2014 0:00,N
      1,8,63,1,46.81006,-92.08174,9/15/2014 0:00,SW
      1,9,4,1,46.81006,-92.08174,9/15/2014 0:00,NE
      1,10,31,1,46.81006,-92.08174,9/15/2014 0:00,N
      Demo1
    • The Demo2/ directory contains the following subdirectories: direction=N, direction=NE, direction=S, direction=SW, and direction=W. The vehicle1.csv, vehicle2.csv, vehicle3.csv, vehicle4.csv, and vehicle5.csv files are separately uploaded to the five subdirectories. These files contain the following data:
      --vehicle1.csv
      1,7,53,1,46.81006,-92.08174,9/15/2014 0:00
      1,10,31,1,46.81006,-92.08174,9/15/2014 0:00
      
      --vehicle2.csv
      1,2,13,1,46.81006,-92.08174,9/14/2014 0:00
      1,3,48,1,46.81006,-92.08174,9/14/2014 0:00
      1,9,4,1,46.81006,-92.08174,9/15/2014 0:00
      
      --vehicle3.csv
      1,6,9,1,46.81006,-92.08174,9/15/2014 0:00
      1,5,47,1,46.81006,-92.08174,9/14/2014 0:00
      1,6,9,1,46.81006,-92.08174,9/15/2014 0:00
      
      --vehicle4.csv
      1,8,63,1,46.81006,-92.08174,9/15/2014 0:00
      
      --vehicle5.csv
      1,4,30,1,46.81006,-92.08174,9/14/2014 0:00
      Demo2
    • The vehicle.csv.gz file is uploaded to the Demo3/ directory. The vehicle.csv file is included in the vehicle.csv.gz file and has the same content as the vehicle.csv file in the Demo1/ directory. Compressed file

Examples

  • Example 1: Create a non-partitioned OSS external table based on the OSS data file in the Demo1/ directory and view data in the OSS external table.
    1. Log on to the MaxCompute client and create an OSS external table. Sample statement:
      create external table if not exists mc_oss_csv_external1
      (
      vehicleId int,
      recordId int,
      patientId int,
      calls int,
      locationLatitute double,
      locationLongtitue double,
      recordTime string,
      direction string
      )
      stored by 'com.aliyun.odps.CsvStorageHandler' 
      with serdeproperties (
       'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole'
      ) 
      location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo1/';

      You can execute the DESC EXTENDED mc_oss_csv_external1; statement to view the schema of the OSS external table that you created.

    2. Use the MaxCompute client to read and process OSS data based on the OSS external table that you created. Sample statement:
      select recordId, patientId, direction from mc_oss_csv_external1 where patientId > 25;
      The following result is returned:
      +------------+------------+------------+
      | recordid   | patientid  | direction  |
      +------------+------------+------------+
      | 1          | 51         | S          |
      | 3          | 48         | NE         |
      | 4          | 30         | W          |
      | 5          | 47         | S          |
      | 7          | 53         | N          |
      | 8          | 63         | SW         |
      | 10         | 31         | N          |
      +------------+------------+------------+
  • Example 2: Create a partitioned OSS external table based on the OSS data files in the Demo2/ directory and view the data in the OSS external table.
    1. Log on to the MaxCompute client and create an OSS external table. Sample statement:
      create external table if not exists mc_oss_csv_external2
      (
      vehicleId int,
      recordId int,
      patientId int,
      calls int,
      locationLatitute double,
      locationLongtitue double,
      recordTime string
      )
      partitioned by (
      direction string
      )
      stored by 'com.aliyun.odps.CsvStorageHandler' 
      with serdeproperties (
       'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole'
      ) 
      location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo2/';

      You can execute the DESC EXTENDED mc_oss_csv_external2; statement to view the schema of the OSS external table that you created.

    2. Use the MaxCompute client to add partition data to the OSS external table. Sample statement:
      msck repair table mc_oss_csv_external2 add partitions;
      -- The preceding statement is equivalent to the following statement:
      alter table mc_oss_csv_external2 add partition (direction = 'N') partition (direction = 'NE') partition (direction = 'S') partition (direction = 'SW') partition (direction = 'W');
    3. Use the MaxCompute client to read and process OSS data based on the OSS external table that you created. Sample statement:
      select recordId, patientId, direction from mc_oss_csv_external1 where direction = 'NE';
      The following result is returned:
      +------------+------------+------------+
      | recordid   | patientid  | direction  |
      +------------+------------+------------+
      | 2          | 13         | NE         |
      | 3          | 48         | NE         |
      | 9          | 4          | NE         |
      +------------+------------+------------+
  • Example 3: Create an OSS external table based on the GZIP file in the Demo3/ directory and view data in the external table.
    1. Log on to the MaxCompute client and create an OSS external table. Sample statement:
      create external table if not exists mc_oss_csv_external3
      (
      vehicleId int,
      recordId int,
      patientId int,
      calls int,
      locationLatitute double,
      locationLongtitue double,
      recordTime string,
      direction string
      )
      stored by 'com.aliyun.odps.CsvStorageHandler' 
      with serdeproperties (
       'odps.properties.rolearn'='acs:ram::xxxxxx:role/aliyunodpsdefaultrole',
       'odps.text.option.gzip.input.enabled'='true'
      ) 
      location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/Demo3/';

      You can execute the DESC EXTENDED mc_oss_csv_external3; statement to view the schema of the OSS external table that you created.

    2. Use the MaxCompute client to read and process OSS data based on the OSS external table that you created. Sample statement:
      select recordId, patientId, direction from mc_oss_csv_external3 where patientId > 25;
      The following result is returned:
      +------------+------------+------------+
      | recordid   | patientid  | direction  |
      +------------+------------+------------+
      | 1          | 51         | S          |
      | 3          | 48         | NE         |
      | 4          | 30         | W          |
      | 5          | 47         | S          |
      | 7          | 53         | N          |
      | 8          | 63         | SW         |
      | 10         | 31         | N          |
      +------------+------------+------------+
Note To obtain data by using HTTPS at the underlying layer, submit the set odps.sql.unstructured.data.oss.use.https=true; command with the SELECT statement that you want to execute.