MaxCompute allows you to export data from MaxCompute projects to Object Storage Service (OSS). This provides an easy way to store structured data in OSS. This also allows other compute engines to use the data that is exported from MaxCompute to OSS. This topic describes how to use UNLOAD statements to export data in the CSV format or another open source format from MaxCompute to OSS.

Prerequisites

  • OSS is activated.

    For more information about how to activate OSS, see Activate OSS.

  • You have the SELECT permission on the table that you want to export from a MaxCompute project.

    For more information about authorization, see Permissions.

Limits

The use of UNLOAD statements has the following limits:
  • MaxCompute automatically splits the file that is exported to OSS into multiple parts and generates a name for the file. You cannot customize the name or file name extension for the exported file.
  • File name extensions cannot be added to the exported files in an open source format.
  • If you repeatedly export data, the previously exported file is not overwritten. Instead, a new file is generated.

Precautions

  • You are not charged for UNLOAD statements. The subquery clauses in the UNLOAD statements need to scan data and use computing resources to calculate the results. Therefore, the subquery clauses are charged as common SQL jobs.
  • In some scenarios, you can store structured data in OSS to reduce storage costs. However, you must estimate the costs in advance.

    The storage fee for MaxCompute is USD 0.018 per GB per month. For more information about storage fees, see Storage pricing (pay-as-you-go). The data compression ratio for data that is imported into MaxCompute is about 5:1. You are charged based on the size of data after compression.

    If you use the Standard storage class of OSS to store your data, the unit price is USD 0.018 per GB per month. For more information about the fees for the Infrequent Access (IA), Archive, and Cold Archive storage classes, see Storage fees.

    If you want to export data to reduce storage costs, we recommend that you: (1) Evaluate the data compression ratio based on the data feature test. (2) Estimate the costs of using UNLOAD statements based on the query statement used when you export data. (3) Evaluate the method for accessing the exported data to avoid extra costs caused by unnecessary data migration.

Authorization for access to OSS

If you want to export data to OSS, you must grant MaxCompute the permissions to access OSS. The authorization methods for UNLOAD statements are the same as those for MaxCompute external tables. You need to configure the properties specified by WITH SERDEPROPERTIES in UNLOAD statements and then use RAM roles to complete OSS authorization. To grant MaxCompute the permissions to access OSS, perform the following steps:
  1. Log on to the RAM console. In the left-side navigation pane, click Roles. On the page that appears, click Create Role. In the Create Role panel, select Alibaba Cloud Service for Select Trusted Entity in the Select Role Type step and click Next. In the Configure Role step, select Normal Service Role for Role Type, specify RAM Role Name, such as unload2oss, and then select MaxCompute from the Select Trusted Service drop-down list.

    For more information, see Create a RAM role for a trusted Alibaba Cloud service.

    Create RAM Role
  2. After the role is created, attach the system policy AliyunOSSFullAccess to the role.

    For more information, see Grant permissions to a RAM role.

    Add Permissions
After the authorization is complete, you must select one of the following export methods based on the format of the data that you want to export:

Use a built-in extractor to export data

  • Syntax
    unload from {<select_statement>|<table_name> [partition (<pt_spec>)]} 
    into 
    location <external_location>
    [stored by <StorageHandler>]
    [with serdeproperties ('<property_name>'='<property_value>',...)];
  • Parameters
    • select_statement: a SELECT clause. This clause is used to query the data that needs to be inserted into a table in the destination OSS directory from the source table. The source table can be a partitioned table or a non-partitioned table. For more information about SELECT clauses, see SELECT syntax.
    • table_name and pt_spec: You can use the table name or the combination of the table and partition names to specify the data that you want to export. This export method does not automatically generate query statements. Therefore, no fees are incurred. The value of pt_spec is in the (partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...) format.
    • external_location: required. The destination OSS directory to which you want to export data. The value of this parameter is in the 'oss://<oss_endpoint>/<object>' format. For more information about OSS directories, see OSS domain names.
    • StorageHandler: required. The name of the storage handler that is considered a built-in extractor. Set this parameter to com.aliyun.odps.CsvStorageHandler or com.aliyun.odps.TsvStorageHandler. The storage handler is used to process CSV and TSV files and defines how to read data from or write data to these files. You need to specify only this parameter based on your business requirements. The related logic is implemented by the system. If you use a built-in extractor to export data, the file name extension .csv or .tsv is automatically added to the files that are exported. You can use this parameter in the same way as you use it for MaxCompute external tables. For more information, see Create an OSS external table.
    • <property_name>'='<property_value>': optional. property_name specifies the name of a property and property_value specifies the value of a property. You can use this clause in the same way as you use it for MaxCompute external tables. For more information about properties, see Create an OSS external table.
  • Examples:
    This section demonstrates how to export data of the sale_detail table from a MaxCompute project to OSS. The sale_detail table contains the following data:
    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    | s1         | c1          | 100.1       | 2013       | china      |
    | s2         | c2          | 100.2       | 2013       | china      |
    | s3         | c3          | 100.3       | 2013       | china      |
    | null       | c5          | NULL        | 2014       | shanghai   |
    | s6         | c6          | 100.4       | 2014       | shanghai   |
    | s7         | c7          | 100.5       | 2014       | shanghai   |
    +------------+-------------+-------------+------------+------------+
    1. Log on to the OSS console and create the directory mc-unload/data_location/ in the OSS bucket in the oss-cn-hangzhou region and organize the OSS directory. For more information about how to create an OSS bucket, see Create buckets. Bucket
      The following OSS directory is organized based on the bucket, region, and endpoint.
      oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location
    2. Log on to the MaxCompute client and execute the UNLOAD statement to export data of the sale_detail table to OSS. The following examples are provided:
      • Example 1: Export the data of the sale_detail table as a CSV file and package the file into a GZIP file. Sample statements:
        -- Control the number of exported files: Set the size of data of the MaxCompute table read by a single worker. Unit: megabytes. The MaxCompute table is compressed before you export it. The size of the exported data is about four times the data size before the export. 
        set odps.stage.mapper.split.size=256;
        -- Export data. 
        unload from
        (select * from sale_detail)
        into
        location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location'
        stored by 'com.aliyun.odps.CsvStorageHandler'
        with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss', 'odps.text.option.gzip.output.enabled'='true');
        -- The preceding statements are equivalent to the following statements: 
        set odps.stage.mapper.split.size=256;
        unload from sale_detail 
        into
        location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location'
        stored by 'com.aliyun.odps.CsvStorageHandler'
        with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss', 'odps.text.option.gzip.output.enabled'='true');                                 
      • Example 2: Export data from the partition (sale_date='2013', region='china') in the sale_detail table as a TSV file to OSS and package the file into a GZIP file.
        -- Control the number of exported files: Set the size of data of the MaxCompute table read by a single worker. Unit: megabytes. The MaxCompute table is compressed before you export it. The size of the exported data is about four times the data size before the export. 
        set odps.stage.mapper.split.size=256;
        -- Export data. 
        unload from sale_detail partition (sale_date='2013',region='china')
        into
        location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location'
        stored by 'com.aliyun.odps.TsvStorageHandler'
        with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss', 'odps.text.option.gzip.output.enabled'='true');

      'odps.text.option.gzip.output.enabled'='true' specifies that the exported file is compressed in a GZIP file. Only the GZIP format is supported.

    3. Log on to the OSS console to view the import result in the destination OSS directory.
      • Import result of Example 1 Import result
      • Import result of Example 2 Import result

Export data in another open source format

  • Syntax
    unload from {<select_statement>|<table_name> [partition (<pt_spec>)]}
    into 
    location <external_location>
    [row format serde '<serde_class>'
      [with serdeproperties ('<property_name>'='<property_value>',...)]
    ]
    storeds as <file_format>
    [properties('<tbproperty_name>'='<tbproperty_value>')];
  • Parameters
    • select_statement: a SELECT clause. This clause is used to query the data that needs to be inserted into a table in the destination OSS directory from the source table. The source table can be a partitioned table or a non-partitioned table. For more information about SELECT clauses, see SELECT syntax.
    • table_name and pt_spec: You can use the table name or the combination of the table and partition names to specify the data that you want to export. This export method does not automatically generate query statements. Therefore, no fees are incurred. The value of pt_spec is in the (partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...) format.
    • external_location: required. The destination OSS directory to which you want to export data. The value of this parameter is in the 'oss://<oss_endpoint>/<object>' format. For more information about OSS directories, see OSS domain names.
    • serde_class: optional. You can use this clause in the same way as you use it for MaxCompute external tables. For more information, see Create an OSS external table.
    • '<property_name>'='<property_value>': optional. property_name specifies the name of a property. property_value specifies the value of a property. You can use this clause in the same way as you use it for MaxCompute external tables. For more information about properties, see Create an OSS external table.
    • file_format: required. The format of the exported file, such as ORC, PARQUET, RCFILE, SEQUENCEFILE, or TEXTFILE. You can use this clause in the same way as you use it for MaxCompute external tables. For more information, see Create an OSS external table.
    • '<tbproperty_name>'='<tbproperty_value>': optional. tbproperty_name specifies the name of a property in the extended information of the external table. tbproperty_value specifies the value of a property in the extended information of the external table. For example, if you want to export data in an open source format as a file compressed by using Snappy or LZO, you can set the mcfed.parquet.compression property to SNAPPY or LZO.
  • Examples:
    This section demonstrates how to export data of the sale_detail table from a MaxCompute project to OSS. The sale_detail table contains the following data:
    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    | s1         | c1          | 100.1       | 2013       | china      |
    | s2         | c2          | 100.2       | 2013       | china      |
    | s3         | c3          | 100.3       | 2013       | china      |
    | null       | c5          | NULL        | 2014       | shanghai   |
    | s6         | c6          | 100.4       | 2014       | shanghai   |
    | s7         | c7          | 100.5       | 2014       | shanghai   |
    +------------+-------------+-------------+------------+------------+
    1. Log on to the OSS console and create the directory mc-unload/data_location/ in the OSS bucket in the oss-cn-hangzhou region and organize the OSS directory. For more information about how to create an OSS bucket, see Create buckets. Bucket
      The following OSS directory is organized based on the bucket, region, and endpoint.
      oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location
    2. Log on to the MaxCompute client and execute the UNLOAD statement to export data of the sale_detail table to OSS. The following examples are provided:
      • Example 1: Export data of the sale_detail table as a file in the PARQUET format and compressed by using Snappy. Sample statements:
        -- Control the number of exported files: Set the size of data of the MaxCompute table read by a single worker. Unit: megabytes. The MaxCompute table is compressed before you export it. The size of the exported data is about four times the data size before the export. 
        set odps.stage.mapper.split.size=256;
        -- Export data. 
        unload from
        (select * from sale_detail)
        into 
        location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location' 
        row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
        with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss') 
        stored as parquet 
        properties('mcfed.parquet.compression'='SNAPPY');
      • Example 2: Export data from the partition (sale_date='2013', region='china') in the sale_detail table as a PARQUET file to OSS and compress the file by using Snappy. Sample statements:
        -- Control the number of exported files: Set the size of data of the MaxCompute table read by a single worker. Unit: megabytes. The MaxCompute table is compressed before you export it. The size of the exported data is about four times the data size before the export. 
        set odps.stage.mapper.split.size=256;
        -- Export data. 
        unload from sale_detail partition (sale_date='2013',region='china') 
        into 
        location 'oss://oss-cn-hangzhou-internal.aliyuncs.com/mc-unload/data_location' 
        row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
        with serdeproperties ('odps.properties.rolearn'='acs:ram::139699392458****:role/unload2oss') 
        stored as parquet 
        properties('mcfed.parquet.compression'='SNAPPY');
    3. Log on to the OSS console to view the import result in the destination OSS directory.
      • Import result of Example 1 Import result
      • Import result of Example 2 Import result
      Note If the exported data is compressed by using Snappy or LZO, the file name extension .snappy or .lzo of the exported file cannot be displayed.

Specify the prefix and suffix of the exported file

When you run the UPLOAD command to export a MaxCompute table as a file in specific business scenarios, you must specify the prefix and suffix of the file. You can perform the following operations to customize the prefix of the file and the default suffix that corresponds to the file format.

  • Syntax
    • Use a built-in extractor to export a CSV or TSV file.
      -- Use a built-in extractor to export a CSV or TSV file.
      unload from {<select_statement>|<table_name> [partition (<pt_spec>)]}
      into
      location <external_location>
      [stored by <StorageHandler>]
      [with serdeproperties ('<property_name>'='<property_value>',...)];
      • Set property_name of the prefix to odps.external.data.prefix. You can specify the value. The value can be up to 10 characters in length.
      • Set property_name of the suffix to odps.external.data.enable.extension. If the value of odps.external.data.enable.extension is true, the file format is displayed as the suffix.
      • For more information about other parameters, see Use a built-in extractor to export data.
    • Export a file in an open source format, such as ORC or Parquet.
      unload from {<select_statement>|<table_name> [partition (<pt_spec>)]}
      into
      location <external_location>
      [row format serde '<serde_class>'
        [with serdeproperties ('<property_name>'='<property_value>',...)]
      ]
      storeds as <file_format>
      [properties('<tbproperty_name>'='<tbproperty_value>')];
      • Set tbproperty_name of the prefix to odps.external.data.prefix. You can specify the value. The value can be up to 10 characters in length.
      • Set tbproperty_name of the suffix to odps.external.data.enable.extension. If the value of odps.external.data.enable.extension is true, the file format is displayed as the suffix.
      • For more information about other parameters, see Export data in another open source format.
  • Suffix description
    File format SerDe Suffix
    SEQUENCEFILE org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe .sequencefile
    TEXTFILE org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe .txt
    RCFILE org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe .rcfile
    ORC org.apache.hadoop.hive.ql.io.orc.OrcSerde .orc
    PARQUET org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe .parquet
    AVRO org.apache.hadoop.hive.serde2.avro.AvroSerDe .avro
    JSON org.apache.hive.hcatalog.data.JsonSerDe .json
    CSV org.apache.hadoop.hive.serde2.OpenCSVSerde .csv
  • Examples
    • Export a TXT file and add the mf_ prefix and a suffix.
      set odps.sql.hive.compatible=true;
      set odps.sql.split.hive.bridge=true;
      unload from (select col_tinyint,col_binary from mf_fun_datatype limit 1)
      into
      location 'oss://oss-cn-beijing-internal.aliyuncs.com/mfosscostfee/demo6/'
      stored as textfile
      properties ('odps.external.data.prefix'='mf_', 'odps.external.data.enable.extension'='true');
      View the export result in the destination OSS directory of the specified data store.
    • Export a CSV file and add the mf_ prefix and a suffix.
      set odps.sql.hive.compatible=true;
      set odps.sql.split.hive.bridge=true;
      unload from (select col_tinyint,col_binary from mf_fun_datatype limit 2)
      into
      location 'oss://oss-cn-beijing-internal.aliyuncs.com/mfosscostfee/demo6/'
      stored by 'com.aliyun.odps.CsvStorageHandler'
      properties ('odps.external.data.prefix'='mf_', 'odps.external.data.enable.extension'='true');
                                  
      View the export result in the destination OSS directory of the specified data store.