MaxCompute allows you to export data from MaxCompute projects to Object Storage Service (OSS). This provides an easy way to store structured data in OSS. This also allows other computing engines to use the data that is exported from MaxCompute to OSS. This topic describes how to use UNLOAD statements to export data in the CSV format or another open source format from MaxCompute to OSS.
Prerequisites
- OSS is activated.
For more information about how to activate OSS, see Activate OSS.
- You have the SELECT permission on the table that you want to export from a MaxCompute
project.
For more information about authorization, see Authorize users.
Limits
- MaxCompute automatically splits the file that is exported to OSS into multiple parts and generates a name for the file. You cannot customize the name or file name extension for the exported file.
- File name extensions cannot be added to the exported files in an open source format.
- If you repeatedly export data, the previously exported file is not overwritten. Instead, a new file is generated.
Usage notes
- You are not charged for UNLOAD statements. The subquery clauses in the UNLOAD statements need to scan data and use computing resources to calculate the results. Therefore, the subquery clauses are charged as common SQL jobs.
- In some scenarios, you can store structured data in OSS to reduce storage costs. However,
you must estimate the costs in advance.
The MaxCompute storage fee is USD 0.018 per GB per month. For more information about storage fees, see Storage pricing (pay-as-you-go). The data compression ratio is about 5:1 for the data that is imported into MaxCompute. You are charged based on the size of data after compression.
If you use the Standard storage class of OSS to store your data, the unit price is USD 0.018 per GB per month. For more information about the fees for the Infrequent Access (IA), Archive, and Cold Archive storage classes, see Storage fees.
If you want to export data to reduce storage costs, we recommend that you: (1) Evaluate the data compression ratio based on the data feature test. (2) Estimate the costs of using UNLOAD statements based on the query statement used when you export data. (3) Evaluate the method for accessing the exported data to avoid extra costs caused by unnecessary data migration.
Authorization for access to OSS
WITH SERDEPROPERTIES
in UNLOAD statements and then use RAM roles to complete OSS authorization. Authorization
procedure:
- Log on to the RAM console. In the left-side navigation pane, choose RAM Roles. On the page that appears, click
Create RAM Role. In the Create RAM Role panel, specify Alibaba Cloud Service for Trusted entity type in the Select Role Type step. In the Configure Role step,
specify Normal Service Role for Role Type, specify RAM Role Name, such as unload2oss, and then select MaxCompute from the Select Trusted Service drop-down list.
For more information, see Create a RAM role for a trusted Alibaba Cloud service.
- After the role is created, attach the system policy AliyunOSSFullAccess to the role.
For more information, see Grant permissions to a RAM role.
Use a built-in extractor to export data
- Syntax
unload from {<select_statement>|<table_name> [partition (<pt_spec>)]} into location <external_location> [stored by <StorageHandler>] [with serdeproperties ('<property_name>'='<property_value>',...)];
- Parameters
- select_statement: a
SELECT
clause. This clause is used to query the data that needs to be inserted into a table in the destination OSS directory from the source table. The source table can be a partitioned table or a non-partitioned table. For more information aboutSELECT
clauses, see SELECT syntax. - table_name and pt_spec: You can use the table name or the combination of the table and partition names to
specify the data that you want to export. This export method does not automatically
generate query statements. Therefore, no fees are incurred. The value of pt_spec is in the
(partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...)
format. - external_location: required. The destination OSS directory to which you want to export data. The value
of this parameter is in the
'oss://<oss_endpoint>/<object>'
format . For more information about OSS directories, see OSS domain names. - StorageHandler: required. The name of the storage handler that is considered a built-in extractor.
Set this parameter to
com.aliyun.odps.CsvStorageHandler
orcom.aliyun.odps.TsvStorageHandler
. The storage handler is used to process CSV and TSV files and defines how to read data from or write data to these files. You need to specify only this parameter based on your business requirements. The related logic is implemented by the system. If you use a built-in extractor to export data, the file name extension.csv
or.tsv
is automatically added to the files that are exported. You can use this parameter in the same way as you use it for MaxCompute external tables. For more information, see Access OSS data by using a built-in extractor. - <property_name>'='<property_value>': optional. property_name specifies the name of a property and property_value specifies the value of a property. You can use this clause in the same way as you use it for MaxCompute external tables. For more information about the properties, see Access OSS data by using a built-in extractor.
- select_statement: a
- Examples
This section demonstrates how to export data of the sale_detail table from a MaxCompute project to OSS. The sale_detail table contains the following data:
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | | s3 | c3 | 100.3 | 2013 | china | | null | c5 | NULL | 2014 | shanghai | | s6 | c6 | 100.4 | 2014 | shanghai | | s7 | c7 | 100.5 | 2014 | shanghai | +------------+-------------+-------------+------------+------------+
- Log on to the OSS console and create the directory
mc-unload/data_location/
in the OSS bucket in theoss-cn-hangzhou
region and organize the OSS directory. For more information about how to create an OSS bucket, see Create buckets.The following OSS directory is organized based on the bucket, region, and endpoint.oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location
- Log on to the MaxCompute client and execute the UNLOAD statement to export data of the sale_detail table to OSS.
The following examples are provided:
- Example 1: Export the data of the sale_detail table as a CSV file and package the
file into a GZIP file. Sample statements:
-- Control the number of exported files: Set the size of data of the MaxCompute table read by a single worker. Unit: megabytes. The MaxCompute table is compressed before you export it. The size of the exported data is about four times the data size before the export. set odps.stage.mapper.split.size=256; -- Export data. unload from (select * from sale_detail) into location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location' stored by 'com.aliyun.odps.CsvStorageHandler' with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss', 'odps.text.option.gzip.output.enabled'='true'); -- The preceding statements are equivalent to the following statements: set odps.stage.mapper.split.size=256; unload from sale_detail into location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location' stored by 'com.aliyun.odps.CsvStorageHandler' with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss', 'odps.text.option.gzip.output.enabled'='true');
- Example 2: Export data from the partition (sale_date='2013', region='china') in the
sale_detail table as a TSV file to OSS and package the file into a GZIP file.
-- Control the number of exported files: Set the size of data of the MaxCompute table read by a single worker. Unit: megabytes. The MaxCompute table is compressed before you export it. The size of the exported data is about four times the data size before the export. set odps.stage.mapper.split.size=256; -- Export data. unload from sale_detail partition (sale_date='2013',region='china') into location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location' stored by 'com.aliyun.odps.TsvStorageHandler' with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss', 'odps.text.option.gzip.output.enabled'='true');
'odps.text.option.gzip.output.enabled'='true'
specifies that the exported file is compressed in a GZIP file. Only the GZIP format is supported. - Example 1: Export the data of the sale_detail table as a CSV file and package the
file into a GZIP file. Sample statements:
- Log on to the OSS console to view the import result in the destination OSS directory.
- Import result of Example 1
- Import result of Example 2
- Import result of Example 1
- Log on to the OSS console and create the directory
Export data in another open source format
- Syntax
unload from {<select_statement>|<table_name> [partition (<pt_spec>)]} into location <external_location> [row format serde '<serde_class>' [with serdeproperties ('<property_name>'='<property_value>',...)] ] storeds as <file_format> [properties('<tbproperty_name>'='<tbproperty_value>')];
- Parameters
- select_statement: a
SELECT
clause. This clause is used to query the data that needs to be inserted into a table in the destination OSS directory from the source table. The source table can be a partitioned table or a non-partitioned table. For more information aboutSELECT
clauses, see SELECT syntax. - table_name and pt_spec: You can use the table name or the combination of the table and partition names to
specify the data that you want to export. This export method does not automatically
generate query statements. Therefore, no fees are incurred. The value of pt_spec is in the
(partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...)
format. - external_location: required. The destination OSS directory to which you want to export data. The value
of this parameter is in the
'oss://<oss_endpoint>/<object>'
format . For more information about OSS directories, see OSS domain names. - serde_class: optional. You can use this clause in the same way as you use it for MaxCompute external tables. For more information, see Open source data formats supported by OSS external tables.
- '<property_name>'='<property_value>': optional. property_name specifies the name of a property and property_value specifies the value of a property. The supported properties are the same as those supported by MaxCompute external tables. For more information about the properties, see Open source data formats supported by OSS external tables.
- file_format: required. The format of the exported file, such as ORC, PARQUET, RCFILE, SEQUENCEFILE, or TEXTFILE. You can use this parameter in the same way as you use it for MaxCompute external tables. For more information, see Open source data formats supported by OSS external tables.
- '<tbproperty_name>'='<tbproperty_value>': optional. tbproperty_name specifies the name of a property in the extended information of the external table.
tbproperty_value specifies the value of a property in the extended information of the external table.
For example, if you want to export data in an open source format as a file compressed
by using Snappy or LZO, you can set the mcfed.parquet.compression property to
SNAPPY
orLZO
.
- select_statement: a
- Examples
This section demonstrates how to export data of the sale_detail table from a MaxCompute project to OSS. The sale_detail table contains the following data:
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | | s3 | c3 | 100.3 | 2013 | china | | null | c5 | NULL | 2014 | shanghai | | s6 | c6 | 100.4 | 2014 | shanghai | | s7 | c7 | 100.5 | 2014 | shanghai | +------------+-------------+-------------+------------+------------+
- Log on to the OSS console and create the directory
mc-unload/data_location/
in the OSS bucket in theoss-cn-hangzhou
region and organize the OSS directory. For more information about how to create an OSS bucket, see Create buckets.The following OSS directory is organized based on the bucket, region, and endpoint.oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location
- Log on to the MaxCompute client and execute the UNLOAD statement to export data of the sale_detail table to OSS.
The following examples are provided:
- Example 1: Export data of the sale_detail table as a file in the PARQUET format and
compressed by using Snappy. Sample statements:
-- Control the number of exported files: Set the size of data of the MaxCompute table read by a single worker. Unit: megabytes. The MaxCompute table is compressed before you export it. The size of the exported data is about four times the data size before the export. set odps.stage.mapper.split.size=256; -- Export data. unload from (select * from sale_detail) into location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location' row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss') stored as parquet properties('mcfed.parquet.compression'='SNAPPY');
- Example 2: Export data from the partition (sale_date='2013', region='china') in the
sale_detail table as a PARQUET file to OSS and compress the file by using Snappy.
Sample statements:
-- Control the number of exported files: Set the size of data of the MaxCompute table read by a single worker. Unit: megabytes. The MaxCompute table is compressed before you export it. The size of the exported data is about four times the data size before the export. set odps.stage.mapper.split.size=256; -- Export data. unload from sale_detail partition (sale_date='2013',region='china') into location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location' row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss') stored as parquet properties('mcfed.parquet.compression'='SNAPPY');
- Example 1: Export data of the sale_detail table as a file in the PARQUET format and
compressed by using Snappy. Sample statements:
- Log on to the OSS console to view the import result in the destination OSS directory.
- Import result of Example 1
- Import result of Example 2
Note If the exported data is compressed by using Snappy or LZO, the file name extension .snappy or .lzo of the exported file cannot be displayed. - Import result of Example 1
- Log on to the OSS console and create the directory