MaxCompute allows you to export data from MaxCompute projects to Object Storage Service (OSS). This makes it easier to store structured data in OSS and to allow other computing engines on OSS to use the data that is exported from MaxCompute. This topic describes how to use UNLOAD statements to export data in the CSV format or another open source format from MaxCompute to OSS.

Prerequisites

  • OSS is activated.

    For more information about how to activate OSS, see Activate OSS.

  • You have the SELECT permissions on the table that you want to export from a MaxCompute project.

    For more information about authorization, see Authorize users.

Limits

The use of the UNLOAD statements has the following limits:
  • MaxCompute automatically splits the file that is exported to OSS into multiple parts and generates a name for the file.
  • You are not allowed to customize the name or file name extension for the exported file.
  • If you export data repeatedly, the previously exported file is not overwritten. Instead, a new file is generated.

Usage notes

  • You are not charged for UNLOAD statements. The subquery clauses in the UNLOAD statements need to scan data and use computing resources to calculate the results. Therefore, the subquery clauses are charged as common SQL jobs.
  • In some scenarios, you can store structured data in OSS to reduce storage costs. However, you must estimate the costs in advance.

    The MaxCompute storage fee is USD 0.018 per GB per month. For more information about storage fees, see Storage pricing (pay-as-you-go). The data compression ratio is about 5:1 for the data that is imported into MaxCompute. You are charged based on the size of data after compression.

    If you use the Standard storage class of OSS to store your data, the unit price is USD 0.018 per GB per month. For more information about the fees for the Infrequent Access, Archive, and Cold Archive storage classes, see Storage fees.

    If you only want to reduce storage costs by exporting data, we recommend that you:
    • Estimate the data compression ratio based on the data feature test.
    • Estimate the costs of using UNLOAD statements based on the query statement used when you export data.
    • Evaluate the method for accessing the exported data to avoid extra costs caused by unnecessary data migration.

Authorization for access to OSS

If you want to export data to OSS, you must authorize MaxCompute to access OSS. The authorization methods for UNLOAD statements are the same as those for MaxCompute external tables. You can use one of the following authorization methods:
  • Method 1: This method is recommended. Configure WITH SERDEPROPERTIES in an UNLOAD statement and authorize MaxCompute to access OSS by using a RAM role.
    1. Log on to the RAM console. In the left-side navigation pane, choose RAM Roles. On the page that appears, click Create RAM Role. In the Create RAM Role panel, specify Alibaba Cloud Service for Trusted entity type in the Select Role Type step. In the Configure Role step, specify Normal Service Role for Role Type, specify RAM Role Name, for example, unload2oss, and then select MaxCompute from the Select Trusted Service drop-down list.

      For more information, see Create a RAM role for a trusted Alibaba Cloud service.

      Create RAM Role
    2. After the role is created, attach the system policy AliyunOSSFullAccess to the role.

      For more information, see Grant permissions to a RAM role.

      Add Permissions
  • Method 2: In an UNLOAD statement, specify the AccessKey ID and AccessKey secret for the destination OSS directory to authorize MaxCompute to access OSS.
    Note If you use this authorization method, you must use the plaintext of the AccessKey ID and AccessKey secret. This may pose security risks. We recommend that you do not use this method. For more information, see OSS domain names.
    You must specify the destination OSS directory in the following format:
    'oss://<AccessKey_Id>:<AccessKey_Secret>@<oss_endpoint>/<object>'

    oss_endpoint indicates the internal endpoint for the region where the destination OSS bucket resides. This endpoint is marked Classic Network Access from ECS (Internal Network) in the OSS console. To obtain the endpoint, perform the following operations: Log on to the OSS console. In the left-side navigation pane, choose Buckets. On the page that appears, click the name of the specified bucket in the Bucket column. Then, view the endpoint of the bucket in the Domain Names section on the Overview page.

    View endpoints
After the authorization is complete, you must select one of the following export methods based on the format of the data that you want to export:

Use a built-in extractor or a storage handler to export data

  • Syntax
    unload from (<select_statement> ) 
    into 
    location <external_location>
    [stored by <StorageHandler>]
    [with serdeproperties (<Options>)];
  • Parameter description
    • select_statement: a query clause in a SELECT statement. This clause is used to query the data that needs to be inserted into a table in the destination OSS directory from the source table. The source table can be a partitioned table or a non-partitioned table. For more information about SELECT statements, see SELECT syntax.
    • external_location: required. The destination OSS directory to which you want to export data. The value of this parameter is in the 'oss://<oss_endpoint>/<object>' or 'oss://<AccessKey_Id>:<AccessKey_Secret>@<oss_endpoint>/<object>' format. In Method 1, the format is 'oss://<oss_endpoint>/<object>'. In Method 2, the format is 'oss://<AccessKey_Id>:<AccessKey_Secret>@<oss_endpoint>/<object>'. For more information about OSS directories, see OSS domain names.
    • StorageHandler: required. The name of a built-in storage handler. Set the value to com.aliyun.odps.CsvStorageHandler. This is a built-in storage handler that is used to process CSV files. It defines how to read data from and write data to CSV files. You can specify this parameter based on your business requirements. The logic of reading and writing data is implemented by the system. By default, the file name extension .csv is added to the files that are exported. You can use this parameter in the same way as you use it for a MaxCompute external table. For more information, see Access OSS data by using the built-in extractor.
    • Options: optional. This parameter specifies the properties related to the source table. The supported properties are the same as those of MaxCompute external tables. For more information the properties, see Access OSS data by using the built-in extractor.
  • Example
    This example demonstrates how to export data of the sale_detail table from a MaxCompute project to OSS. The sale_detail table contains the following data:
    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    | s1         | c1          | 100.1       | 2013       | china      |
    | s2         | c2          | 100.2       | 2013       | china      |
    | s3         | c3          | 100.3       | 2013       | china      |
    | null       | c5          | NULL        | 2014       | shanghai   |
    | s6         | c6          | 100.4       | 2014       | shanghai   |
    | s7         | c7          | 100.5       | 2014       | shanghai   |
    +------------+-------------+-------------+------------+------------+
    1. Log on to the OSS console and create the directory mc-unload/data_location/ in the OSS bucket in the oss-cn-hangzhou region and organize the OSS directory. For more information about how to create an OSS bucket, see Create buckets. Bucket
      You must organize the path to the OSS directory based on the bucket, region, and endpoint. The following path is an example:
      oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location
    2. Log on to the MaxCompute client and execute the UNLOAD statement to export data of the sale_detail table as a CSV file to OSS. Sample statement:
      unload from
      ( select * from sale_detail )
      into
      location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location'
      stored by 'com.aliyun.odps.CsvStorageHandler'
      with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss', 'odps.text.option.gzip.output.enabled'='true'); 

      'odps.text.option.gzip.output.enabled'='true' specifies that the exported file is compressed in a GZIP file. Only the GZIP format is supported.

    3. Log on to the OSS console to view the import result in the destination OSS directory. Import result

Export data in another open source format

  • Syntax
    unload from (<select_statement> ) 
    into 
    location <external_location>
    [row format serde '<serde_class>'
      [with serdeproperties (<Options>)]
    ]
    storeds as <file_format>;
  • Parameter description
    • select_statement: a query clause in a SELECT statement. This clause is used to query the data that needs to be inserted into a table in the destination OSS directory from the source table. The source table can be a partitioned table or a non-partitioned table. For more information about SELECT statements, see SELECT syntax.
    • external_location: required. The destination OSS directory to which you want to export data. The value of this parameter is in the 'oss://<oss_endpoint>/<object>' or 'oss://<AccessKey_Id>:<AccessKey_Secret>@<oss_endpoint>/<object>' format. In Method 1, the format is 'oss://<oss_endpoint>/<object>'. In Method 2, the format is 'oss://<AccessKey_Id>:<AccessKey_Secret>@<oss_endpoint>/<object>'. For more information about OSS directories, see OSS domain names.
    • serde_class: optional. You can use this clause in the same way as you use it for a MaxCompute external table. For more information, see Open source data formats supported by OSS external tables.
      Note File name extensions cannot be added to the files that are exported in an open source format.
      Mappings between file formats and SerDe classes:
      • SEQUENCEFILE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      • TEXTFILE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      • RCFILE: org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe
      • ORC: org.apache.hadoop.hive.ql.io.orc.OrcSerde
      • ORCFILE: rg.apache.hadoop.hive.ql.io.orc.OrcSerde
      • PARQUET: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
      • AVRO: org.apache.hadoop.hive.serde2.avro.AvroSerDe
    • Options: required. This parameter specifies the properties related to the source table. The supported properties are the same as those of MaxCompute external tables. For more information, see Open source data formats supported by OSS external tables.
    • file_format: required. The format of the data that you want to export, such as ORC, PARQUET, RCFILE, SEQUENCEFILE, and TEXTFILE. You can use this parameter in the same way as you use it for a MaxCompute external table. For more information, see Open source data formats supported by OSS external tables.
  • Example
    This example demonstrates how to export data of the sale_detail table from a MaxCompute project to OSS. The sale_detail table contains the following data:
    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    | s1         | c1          | 100.1       | 2013       | china      |
    | s2         | c2          | 100.2       | 2013       | china      |
    | s3         | c3          | 100.3       | 2013       | china      |
    | null       | c5          | NULL        | 2014       | shanghai   |
    | s6         | c6          | 100.4       | 2014       | shanghai   |
    | s7         | c7          | 100.5       | 2014       | shanghai   |
    +------------+-------------+-------------+------------+------------+
    1. Log on to the OSS console and create the directory mc-unload/data_location/ in the OSS bucket in the oss-cn-hangzhou region and organize the OSS directory. For more information about how to create an OSS bucket, see Create buckets. Bucket
      You must organize the path to the OSS directory based on the bucket, region, and endpoint. The following path is an example:
      oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location
    2. Log on to the MaxCompute client and execute the UNLOAD statement to export data of the sale_detail table as a PARQUET file to OSS. Sample statement:
      unload from
      ( select * from sale_detail )
      into 
      location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location' 
      row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
      with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss') 
      stored as parquet;
    3. Log on to the OSS console to view the import result in the destination OSS directory. Import result