MaxCompute allows you to create Object Storage Service (OSS) external tables as partitioned tables to access OSS data that is stored in partitions. This method reduces the volume of data that must be read and improves the data processing efficiency. This topic describes the standard and custom formats of OSS partition directories that are supported by MaxCompute.
Background information
- Method 1 (recommended): Store data in a standard or custom partition directory in OSS. When you create an OSS external table by using MaxCompute, you must specify the partitions and oss_location information in the statement that is used to create the table. We recommend that you use standard partition directories. For more information, see Standard partition directories and Custom partition directories.
- Method 2: Plan multiple directories to store data. You can create multiple OSS external tables to read the data that is stored in each OSS directory. This way, each OSS external table points to a subset of OSS data. This method is complex and has poor data management performance. We recommend that you do not use this method.
Standard partition directories
oss://<oss_endpoint>/<Bucket name>/<Directory name>/<partitionKey1=value1>/<partitionKey2=value2>/...
Parameter | Required | Description |
---|---|---|
oss_endpoint | Yes | The OSS endpoint. We recommend that you use an internal endpoint of OSS to prevent
extra fees that are incurred by Internet traffic. For more information about the internal
endpoints of OSS, see Regions and endpoints.
Note We recommend that the OSS bucket for storing data is deployed in the same region as
the MaxCompute project. MaxCompute can be deployed only in some regions. Therefore,
cross-region data connectivity issues may occur.
|
Bucket Name | Yes | The name of the OSS bucket. For more information about how to view the bucket name, see List buckets. |
Directory Name | Yes | The name of the OSS directory. You do not need to include file names in directory names. |
partitionKey | Yes | The name of the partition key column in the OSS external table. |
value | Yes | The value of the partition key column in the OSS external table. |
oss://oss-odps-test/log_data/year=2016/month=06/day=01/logfile
oss://oss-odps-test/log_data/year=2016/month=06/day=02/logfile
oss://oss-odps-test/log_data/year=2016/month=07/day=10/logfile
oss://oss-odps-test/log_data/year=2016/month=08/day=08/logfile
...
The following statements show how to create an OSS external table and import and analyze partition data based on the preceding partition directories:
-- Create an OSS external table.
create external table log_table_external (
click STRING,
ip STRING,
url STRING
)
partitioned by (
year STRING,
month STRING,
day STRING
)
stored by 'com.aliyun.odps.CsvStorageHandler'
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-mc-test/log_data/';
-- Import OSS partition data.
alter table log_table_external add partition (year = '2016', month = '06', day = '01');
alter table log_table_external add partition (year = '2016', month = '06', day = '02');
alter table log_table_external add partition (year = '2016', month = '07', day = '10');
alter table log_table_external add partition (year = '2016', month = '08', day = '08');
...
-- Analyze the data. MaxCompute accesses only the file (logfile) in the log_data/year=2016/month=06/day=01 subdirectory. MaxCompute does not scan all data in the log_data directory.
select count(distinct(ip)) from log_table_external where year = '2016' and month = '06' and day = '01';
For more information about how to read data in a standard partition directory, see Access OSS data by using a built-in extractor.
Custom partition directories
If data in OSS is stored in partitions in a non-standard partition directory, MaxCompute allows you to associate different subdirectories with different partitions.
oss://oss-odps-test/log_data_customized/2016/06/01/logfile
oss://oss-odps-test/log_data_customized/2016/06/02/logfile
oss://oss-odps-test/log_data_customized/2016/07/10/logfile
oss://oss-odps-test/log_data_customized/2016/08/08/logfile
...
alter table ... add partition ... location ...
statement to specify subdirectories and associate these subdirectories with different
partitions. Sample statements: alter table log_table_external add partition (year = '2016', month = '06', day = '01')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-odps-test/log_data_customized/2016/06/01/';
alter table log_table_external add partition (year = '2016', month = '06', day = '02')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-odps-test/log_data_customized/2016/06/02/';
alter table log_table_external add partition (year = '2016', month = '07', day = '10')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-odps-test/log_data_customized/2016/07/10/';
alter table log_table_external add partition (year = '2016', month = '08', day = '08')
location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/oss-odps-test/log_data_customized/2016/08/08/';
After you specify custom partition directories, you can access data in the subdirectories regardless of whether the data is saved in standard partition directories.