All Products
Search
Document Center

ApsaraDB for SelectDB:Export data by using the SELECT INTO OUTFILE statement

Last Updated:Apr 10, 2025

Alibaba Cloud allows you to use the SELECT INTO OUTFILE statement to export data from ApsaraDB for SelectDB instances and generate files. You can store the files to remote storage systems such as Object Storage Service (OSS) and Hadoop Distributed File System (HDFS) over the Amazon Simple Storage Service (Amazon S3) and HDFS protocols.

Overview

The SELECT INTO OUTFILE statement is used to export query results to files. You can store the files to remote storage systems such as OSS and HDFS over the Amazon S3 and HDFS protocols.

The SELECT INTO OUTFILE statement is executed in a synchronized manner. When the results are returned, the execution ends. A row in the returned results shows the result of a data export request. Due to the restrictions of synchronously executing a statement, you cannot confirm whether a data export job is complete as expected or whether all queried data is exported. In this case, you can specify the success_file_name parameter to require that a file be generated as an identifier in the directory in which the exported files are stored after a job succeeds. You can use this file to determine whether the data export job is complete as expected.

This statement is executed by performing an SQL query. By default, only one thread is used during the execution. The duration of an entire data export process includes the time consumed to perform the query and time consumed to return the result set. If you want to query a large amount of data, you must configure the query_timeout session variable to extend the timeout period for the query.

This statement does not check whether the exported files and file path exist. In addition, this statement does not automatically create a path or overwrite an existing file. The preceding operations are decided by the semantics of the destination remote storage system. ApsaraDB for SelectDB does not manage the exported files. You need to manage the files generated by successful data export jobs and residual files generated by failed data export jobs in remote storage systems.

Usage method

Syntax

query_stmt
INTO OUTFILE "file_path"
[format_as]
[properties]

Request parameters

Parameter

Description

file_path

The path to which the exported files are stored and prefix of the exported files. Example: s3://bucket_name/to/my_file_ and hdfs://path/to/my_file_.

If the prefix is set to my_file_, a final file name consists of the following elements: the my_file_ prefix, the file sequence number, and the suffix that indicates the file format. The sequence number starts from 0 and represents the number of segment files. Examples:

my_file_abcdefg_0.csv
my_file_abcdefg_1.csv
my_file_abcdegf_2.csv

You can omit the file name prefix and specify only the file directory. Example: hdfs://path/to/.

format_as

The format of an exported file. Valid values: CSV, PARQUET, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, and ORC. Default value: CSV.

properties

The properties that you can specify based on your business requirements. For more information, see the Syntax of the properties object section of this topic.

You can export data over the Amazon S3 or HDFS protocol.

Note
  • Data of basic types can be exported to files of all supported types.

  • Data of the ARRAY, MAP, and STRUCT types can be exported to files of the following types: CSV, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, and ORC.

  • Data of the NESTED type cannot be exported.

Syntax of the properties object

The following sample code shows an example on the syntax of the properties object. You can specify properties related to exported files, HDFS protocol, and Amazon S3 protocol.

[PROPERTIES ("key"="value", ...)]
  • File-related properties

    Parameter

    Description

    column_separator

    The column delimiter, which is used only for files of the CSV type.

    line_delimiter

    The row delimiter, which is used only for files of the CSV type.

    max_file_size

    The maximum size of a file. If the size of the query result exceeds this value, multiple segment files are generated.

    Valid value: [5 MB, 2 GB]. Default value: 1 GB.

    If you specify that the exported file is of the ORC type, the maximum size of a segment file is a multiple of 64 MB. For example, if you set the max_file_size parameter to 5 MB, the maximum size of a segment file is 64 MB. If you set the max_file_size parameter to 65 MB, the maximum size of a segment file is 128 MB.

    delete_existing_files

    Default value: false. If you set this parameter to true, all files in the directory specified by the file_path parameter are deleted before data is exported to the specified directory. Examples:

    If you set the file_path parameter to /user/, all files and directories under the /user/ directory are deleted.

    If you set the file_path parameter to /user/tmp/, all files and directories under the /user/tmp/ directory are deleted.

    Warning

    Risks may occur if you set the delete_existing_files parameter to true. We recommend that you perform this operation only in a test environment. To specify the delete_existing_files parameter, submit a ticket to contact Alibaba Cloud technical support. After you set the enable_delete_existing_files parameter to true in the fe.conf file and restart the data export job, the delete_existing_files parameter takes effect.

    file_suffix

    The suffix of the exported file. If this parameter is not specified, the default suffix is used.

  • HDFS-related properties

    Parameter

    Description

    fs.defaultFS

    The NameNode endpoint.

    hadoop.username

    The username used to log on to HDFS.

    dfs.nameservices

    The Name Service ID. The value of this parameter is the same as that in the hdfs-site.xml file.

    dfs.ha.namenodes.[nameservice ID]

    The NameNode IDs in the Name Service. The value of this parameter is the same as that in the hdfs-site.xml file.

    dfs.namenode.rpc-address.[nameservice ID].[name node ID]

    The Remote Procedure Call (RPC) URLs of the NameNodes. Each RPC URL corresponds to a NameNode. The value of this parameter is the same as that in the hdfs-site.xml file.

    dfs.client.failover.proxy.provider.[nameservice ID]

    The Java class that implements active NameNode connection for the HDFS client. Default value: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

    If Kerberos authentication is enabled for a Hadoop cluster, you must also specify the properties described in the following table.

    Parameter

    Description

    dfs.namenode.kerberos.principal

    The Kerberos principal name of the HDFS NameNode.

    hadoop.security.authentication

    The authentication method. Set this parameter to kerberos to enable Kerberos authentication.

    hadoop.kerberos.principal

    The principal for Kerberos authentication.

    hadoop.kerberos.keytab

    The path to the Kerberos keytab file.

  • Amazon S3-related properties

    The following table describes the properties that you can set for storage systems that support the Amazon S3 protocol. These storage systems include Amazon S3 and OSS.

    Parameter

    Description

    s3.endpoint

    Required. The endpoint of the destination storage system.

    s3.access_key

    Required. The access key used to authenticate your requests to access the destination storage system.

    s3.secret_key

    Required. The secret key used to authenticate your requests to access the destination storage system. Type: STRING.

    s3.region

    Required. The region in which the destination storage system resides.

    s3.session_token

    The session token used for temporary session credentials to access the destination storage system. This parameter is required if temporary session authentication is enabled.

    use_path_style

    Optional. Specifies whether to access the destination storage system in the path style. Default value: false.

    By default, Amazon S3 SDKs use the virtual-hosted style to access the destination storage system. However, some object storage systems may not support the virtual-hosted style. In this case, you can specify the use_path_style parameter to forcibly access the destination storage system in the path style.

    Note

    The following URI schemas are supported: http://, https://, and s3://.

    1. If you use http:// or https://, the use_path_style parameter is used to determine whether to access the destination storage system in the path style.

    2. If you use s3://, the destination storage system is accessed in the virtual-hosted style.

Response parameters

The data is exported in a synchronized manner. When the results are returned, the statement execution ends. A row in the returned results shows the result of the data export.

  • Sample success response:

    SELECT * FROM tbl1 LIMIT 10 into outfile "file:///home/work/path/result_";
    +------------+-----------+----------+--------------------------------------------------------------------+
    | FileNumber | TotalRows | FileSize | URL                                                                |
    +------------+-----------+----------+--------------------------------------------------------------------+
    |          1 |         2 |        8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
    +------------+-----------+----------+--------------------------------------------------------------------+
    1 row in set (0.05 sec)

    The following table describes the response parameters.

    Parameter

    Description

    FileNumber

    The number of generated files.

    TotalRows

    The number of returned rows in the result set.

    FileSize

    The total size of the exported files. Unit: bytes.

    URL

    The compute node to which the data is exported if data is exported to a local disk.

  • If parallel scan is performed, multiple rows of data are returned.

    +------------+-----------+----------+--------------------------------------------------------------------+
    | FileNumber | TotalRows | FileSize | URL                                                                |
    +------------+-----------+----------+--------------------------------------------------------------------+
    |          1 |         3 |        7 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
    |          1 |         2 |        4 | file:///192.168.1.11/home/work/path/result_{fragment_instance_id}_ |
    +------------+-----------+----------+--------------------------------------------------------------------+
    2 rows in set (2.218 sec)
  • If a request fails, an error message is returned. The following SQL statements show an example.

    SELECT * FROM tbl INTO OUTFILE ...
    ERROR 1064 (HY000): errCode = 2, detailMessage = ...

Parallel scan

By default, a single thread on a BE node is used to export data and return the result set. Therefore, the export duration has a positive correlation with the result set size. You can use parallel scan to speed up the process of exporting data. To enable parallel scan, the following conditions must be met:

  1. The set enable_parallel_outfile session variable is set to true.

  2. The query meets the requirements of parallel scan. For example, the top-level query plan does not require the returned data to be processed in a specific order.

The be_instance_num * parallel_fragment_exec_instance_num parameter indicates the number of requests that can be concurrently processed by a single ApsaraDB for SelectDB instance.

Check whether parallel scan is enabled

After you set the session variable to enable parallel scan, you can execute the following statement to check whether parallel scan is enabled for the current request:

EXPLAIN SELECT xxx FROM xxx WHERE xxx INTO outfile "s3://xxx" format AS csv properties ("AWS_ENDPOINT" = "xxx", ...);

After you execute the EXPLAIN statement on the query, ApsaraDB for SelectDB returns a plan.

  • If RESULT FILE SINK is displayed in PLAN FRAGMENT 1, the parallel scan is enabled for the query.

  • If RESULT FILE SINK is displayed in PLAN FRAGMENT 0, the parallel scan is disabled because the query does meet the conditions to enable parallel scan.

Sample plan:
+-----------------------------------------------------------------------------+
| Explain String                                                              |
+-----------------------------------------------------------------------------+
| PLAN FRAGMENT 0                                                             |
|  OUTPUT EXPRS:<slot 2> | <slot 3> | <slot 4> | <slot 5>                     |
|   PARTITION: UNPARTITIONED                                                  |
|                                                                             |
|   RESULT SINK                                                               |
|                                                                             |
|   1:EXCHANGE                                                                |
|                                                                             |
| PLAN FRAGMENT 1                                                             |
|  OUTPUT EXPRS:`k1` + `k2`                                                   |
|   PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1`   |
|                                                                             |
|   RESULT FILE SINK                                                          |
|   FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_                        |
|   STORAGE TYPE: S3                                                          |
|                                                                             |
|   0:OlapScanNode                                                            |
|      TABLE: multi_tablet                                                    |
+-----------------------------------------------------------------------------+

Usage notes

  • If the result set of the SELECT INTO OUTFILE statement is empty, a file is still generated.

  • File segmentation ensures that each segment file stores complete rows of data. Therefore, the size of a segment file may slightly differ from the value of the max_file_size parameter.

  • If the output of a function, such as a BITMAP function or HyperLogLog (HLL) function, contains invisible characters, \N is returned. This indicates that the function returns a null value.

  • The output of some functions used in geospatial databases is encoded binary characters. If the output of a function is of other types, garbled text is returned. For example, the output of the ST_Point function is of the VARCHAR type. After you invoke the ST_Point function, garbled text is returned. In this case, use the ST_AsText function.

Examples

  • Export data over the HDFS protocol. Export the results of a simple query to the hdfs://${host}:${fileSystem_port}/path/to/result.txt file. Specify that the exported files are in the CSV format, the HDFS username is work, the column delimiter is ,, and the row delimiter is \n. Sample code:

    -- The default value of the fileSystem_port prapameter is 9000.
    SELECT * FROM tbl
    INTO OUTFILE "hdfs://${host}:${fileSystem_port}/path/to/result_"
    FORMAT AS CSV
    PROPERTIES
    (
        "fs.defaultFS" = "hdfs://ip:port",
        "hadoop.username" = "work"
    );

    The following code provides an example on how to execute the SELECT INTO OUTFILE statement if high availability is enabled for the Hadoop cluster:

    -- The default value of the fileSystem_port prapameter for the Hadoop cluster that enables high availability is 8020.
    SELECT * FROM tbl
    INTO OUTFILE "hdfs:///path/to/result_"
    FORMAT AS CSV
    PROPERTIES
    (
    'fs.defaultFS'='hdfs://hacluster/',
    'dfs.nameservices'='hacluster',
    'dfs.ha.namenodes.hacluster'='n1,n2',
    'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020',
    'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020',
    'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
    );

    If the total file size is less than or equal to 1 GB, the result_0.csv file is generated. If the total file size is greater than 1 GB, multiple segment files named result_0.csv, result_1.csv, ... are generated.

  • Export the query results of the UNION statement to the s3://oss-bucket/result.txt file. Specify that the exported data is stored in the OSS bucket named oss-bucket that resides in the China (Hangzhou) region. Specify that the exported files are in the PARQUET format. You do not need to specify the column delimiter. Specify that a file is generated as an identifier after the data is exported to check whether the data export job is successful.

    SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1
    INTO OUTFILE "s3://oss-bucket/result_"
    FORMAT AS PARQUET
    PROPERTIES
    (
        "s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
        "s3.access_key" = "****",
        "s3.secret_key" = "****",
        "s3.region" = "cn-hangzhou"
    );
    Important

    If you export data over the Amazon S3 protocol, you must use http://, https://, or s3:// as the URI schema at the beginning of the URL of the destination storage system regardless of the cloud service provider. Otherwise, the ERROR 1105 (HY000): errCode = 2, detailMessage = Unknown properties: [s3.region, s3.endpoint, s3.secret_key, s3.access_key] error message appears.

  • Export data to OSS over the Amazon S3 protocol and enable parallel scan. The prefix of a generated file is my_file_{fragment_instance_id}_. Sample code:

    SET enable_parallel_outfile = true;
    
    SELECT k1 FROM tb1 LIMIT 1000
    INTO outfile "s3://my_bucket/export/my_file_"
    format AS csv
    properties
    (
        "s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
        "s3.access_key" = "****",
        "s3.secret_key" = "****",
        "s3.region" = "cn-hangzhou"
    )
  • Export data to OSS over the Amazon S3 protocol. In this example, ORDER BY k1 indicates that the returned results are ordered based on the specified value. Therefore, even parallel scan is enabled for this request, the data is not concurrently scanned. Sample code:

    SET enable_parallel_outfile = true;
    
    SELECT k1 FROM tb1 ORDER BY k1 LIMIT 1000
    INTO outfile "s3://my_bucket/export/my_file_"
    format AS csv
    properties
    (
        "s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
        "s3.access_key" = "****",
        "s3.secret_key" = "****",
        "s3.region" = "cn-hangzhou"
    )

Data type mappings

Parquet files and ORC files support different data types. ApsaraDB for SelectDB provides the data export feature that can automatically export data in ApsaraDB for SelectDB instances to data of the types that are supported by Parquet files and ORC files.

  • The following table describes the data type mappings between ApsaraDB for SelectDB data and data in ORC files.

    Data type of ApsaraDB for SelectDB

    Data type supported by ORC files

    BOOLEAN

    BOOLEAN

    TINYINT

    TINYINT

    SMALLINT

    SMALLINT

    INT

    INT

    BIGINT

    BIGINT

    LARGEINT

    STRING

    DATE

    STRING

    DATEV2

    STRING

    DATETIME

    STRING

    DATETIMEV2

    TIMESTAMP

    FLOAT

    FLOAT

    DOUBLE

    DOUBLE

    CHAR / VARCHAR / STRING

    STRING

    DECIMAL

    DECIMAL

    STRUCT

    STRUCT

    MAP

    MAP

    ARRAY

    ARRAY

  • If you export data from the memory of ApsaraDB for SelectDB to Parquet files, the data is first mapped to the data types supported by Apache Arrow. The data is then written to Parquet files by Apache Arrow. The following table describes the data type mappings between ApsaraDB for SelectDB data and Apache Arrow data.

    Data type of ApsaraDB for SelectDB

    Data type of Apache Arrow

    BOOLEAN

    BOOLEAN

    TINYINT

    INT8

    SNALLINT

    INT16

    INT

    INT32

    BIGINT

    INT64

    LARGEINT

    UTF8

    DATE

    UTF8

    DATEV2

    UTF8

    DATETIME

    UTF8

    DATETIMEV2

    UTF8

    FLOAT

    FLOAT32

    DOUBLE

    FLOAT64

    CHAR / VARCHAR / STRING

    UTF8

    DECIMAL

    DECIMAL128

    STRUCT

    STRUCT

    MAP

    MAP

    ARRAY

    LIST