MaxCompute allows you to create partitioned Object Storage Service (OSS) external tables to access data that is stored in partitions. This method reduces the volume of data that needs to be read and increases the data processing efficiency. This topic describes the standard and custom formats of partition paths supported by MaxCompute.

Background

After you associate MaxCompute with an OSS external table, MaxCompute scans all the data in the OSS directories and data files in the subdirectories. If the volume of data in OSS is large, a full text scan causes extra I/O operations and prolongs the data processing. You can address this issue by using the following methods:
  • Method 1 (recommended): Store data in a standard or custom partition path in OSS. When you create an OSS external partitioned table in MaxCompute, you can associate the partitioned OSS external table with an OSS directory in the CREATE TABLE statement. We recommend that you use standard partition paths. For more information, see Standard partition paths and Custom partition paths.
  • Method 2: Plan multiple paths to store data. You can create multiple OSS external tables to read data that is stored in each OSS path. This way, each OSS external table points to a subset of OSS data. This method is complex and has poor data management performance. We recommend that you do not use this method.

This topic describes the partition path format used in method 1. For more information, see Access OSS data by using a built-in extractor, Use a custom extractor to access OSS, or Open source data formats supported by OSS external tables.

Standard partition paths

The following example shows the format of a standard partition path.
oss://<oss_endpoint>/<Bucket name>/<Directory name>/<partitionKey1=value1 >\< partitionKey2=value2>\...
  • oss_endpoint: the OSS endpoint. We recommend that you use an internal endpoint of OSS to avoid extra fees that are incurred by Internet traffic. For more information about the internal endpoints of OSS, see Regions and endpoints. We recommend that OSS for storing data is deployed in the same region as the MaxCompute project. MaxCompute can be deployed only in some regions. In this case, cross-region data connectivity issues may occur.
  • Bucket name: the name of the OSS bucket. For more information about how to view the bucket name, see List buckets.
  • Directory name: the name of the OSS directory. File names do not need to be included in directory names. Examples of incorrect usage:
    http://oss-odps-test.oss-cn-shanghai-internal.aliyuncs.com/Demo/                  -- HTTP connections are not supported.
    https://oss-odps-test.oss-cn-shanghai-internal.aliyuncs.com/Demo/                 -- HTTPS connections are not supported.
    oss://oss-odps-test.oss-cn-shanghai-internal.aliyuncs.com/Demo                    -- The endpoint is incorrect.
    oss://oss-cn-shanghai-internal.aliyuncs.com/oss-odps-test/Demo/vehicle.csv  -- The file name does not need to be included in the directory name. 
  • partitionKey1=value1 and partitionKey2=value2 are the formats that are used to name the subdirectories in an OSS directory. These subdirectories correspond to the partitions in the OSS external table. partitionKey corresponds to the name of a partition key column in the partitioned OSS external table. value corresponds to the value of a partition key column.
Note Data is uploaded to OSS by using either the OSS console or other OSS tools. Therefore, the format of the path where data is saved depends on the format of the data that is uploaded.
For example, a company stores log files in OSS and uses MaxCompute to process data in these log files on a daily basis. If the log files are in the CSV format and can be accessed by using a built-in extractor, you can create a partitioned OSS external table that has the year, month, and day partition fields. Sample statement:
create external table log_table_external (
    click STRING,
    ip STRING,
    url STRING
  )
partitioned by (
    year STRING,
    month STRING,
    day STRING
  )
stored by 'com.aliyun.odps.CsvStorageHandler'
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-odps-test/log_data/';
Specify the following standard partition paths for OSS data:
oss://oss-odps-test/log_data/year=2016/month=06/day=01/logfile
oss://oss-odps-test/log_data/year=2016/month=06/day=02/logfile
oss://oss-odps-test/log_data/year=2016/month=07/day=10/logfile
oss://oss-odps-test/log_data/year=2016/month=08/day=08/logfile
...
Then, you can load OSS data to the OSS external table and analyze the data. Sample statements:
-- Load OSS data to the OSS external table.
alter table log_table_external add partition (year = '2016', month = '06', day = '01')
alter table log_table_external add partition (year = '2016', month = '06', day = '02')
alter table log_table_external add partition (year = '2016', month = '07', day = '10')
alter table log_table_external add partition (year = '2016', month = '08', day = '08')
...
-- Analyze the data. MaxCompute accesses only the file (logfile) in the log_data/year=2016/month=06/day=01 subdirectory. It does not scan all data in the log_data directory.
select count(distinct(ip)) from log_table_external where year = '2016' and month = '06' and day = '01';

Custom partition paths

If data in OSS is stored in partitions in a non-standard partition path, MaxCompute allows you to associate different subdirectories with different partitions.

The following example shows the format of custom partition paths. A custom partition path has only values of partition key columns but does not have their names.
oss://oss-odps-test/log_data_customized/2016/06/01/logfile
oss://oss-odps-test/log_data_customized/2016/06/02/logfile
oss://oss-odps-test/log_data_customized/2016/07/10/logfile
oss://oss-odps-test/log_data_customized/2016/08/08/logfile
...
After you create an OSS external table, execute the alter table...add partition...location statement to specify subdirectories and associate these subdirectories with different partitions. Sample statements:
alter table log_table_external add partition (year = '2016', month = '06', day = '01')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-odps-test/log_data_customized/2016/06/01/';
alter table log_table_external add partition (year = '2016', month = '06', day = '02')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-odps-test/log_data_customized/2016/06/02/';
alter table log_table_external add partition (year = '2016', month = '07', day = '10')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-odps-test/log_data_customized/2016/07/10/';
alter table log_table_external add partition (year = '2016', month = '08', day = '08')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-odps-test/log_data_customized/2016/08/08/';

After you specify custom partition paths, you can access data in the subdirectories regardless of whether the data is saved in data partitions in the standard format.