All Products
Search
Document Center

ApsaraDB for SelectDB:Export data by using the SELECT INTO OUTFILE statement

Last Updated:Jun 17, 2025

To back up or migrate data of an ApsaraDB for SelectDB instance, you can use the SELECT INTO OUTFILE statement to export data from the SelectDB instance to a file. You can store the file to remote storage systems such as Object Storage Service (OSS) and Hadoop Distributed File System (HDFS) over the Amazon Simple Storage Service (Amazon S3) and HDFS protocols.

Overview

You can use the SELECT INTO OUTFILE statement to back up or migrate data of a SelectDB instance. The statement has the following characteristics:

  • This statement is executed in a synchronized manner.

    • After you execute the statement, a row that describes the status of the data export job is immediately returned.

    • Due to the synchronization mechanism, the status of your data export job is returned only after you execute the statement. If the connection is interrupted, the status of the data export job is not returned, and you cannot determine whether all queried data is exported. To prevent this issue, we recommend that you configure the "success_file_name" = "SUCCESS" setting in Properties when you create the data export job. This way, if the data export job succeeds, an identifier file is generated in the directory in which the exported files are stored, and you can use this file to determine whether all queried data is exported.

  • This statement is executed by performing an SQL query.

    By default, only one thread is used during the execution.

  • The export duration is calculated by using the following formula: Total execution duration = Query execution duration + Time consumed to return the result set.

    If you want to query a large amount of data, you must configure the query_timeout session variable to extend the timeout period for the query. For more information about abow to configure the session variable, see Variable management.

  • This statement exports the query results into files.

    You can store the files to remote storage systems, such as OSS and HDFS, over the Amazon S3 and HDFS protocols.

  • This statement does not check whether the exported files and file path exist.

    This statement does not automatically create a path or overwrite an existing file. The operations that the statement can perform are decided by remote storage systems such as Amazon S3 and HDFS protocols.

    SelectDB does not manage the exported files. You need to manage the files generated by successful data export jobs and residual files generated by failed data export jobs in remote storage systems.

Limits

This statement can be used to export data over Amazon S3 and HDFS protocols.

Usage notes

  • The exported file format varies based on the data type.

    • Basic data types: The CVS, CSV_with_names, CVS_with_names_and_types, Parquet, and Optimized Row Columnar (ORC) file formats are supported.

    • Complex data types:

      • Data of the ARRAY, MAP, and STRUCT types can be exported to files of the following formats: CSV, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, and ORC.

      • Nested data cannot be exported.

  • Usage notes for result files:

    • If the result set is empty, a file is still generated.

    • File segmentation ensures that each segment file stores complete rows of data. Therefore, the size of a segment file may slightly differ from the value of the max_file_size parameter.

  • Usage notes for the result file content:

    • If the output of a function, such as a BITMAP function or HyperLogLog (HLL) function, contains invisible characters, \N is returned. This indicates that the function returns a null value.

    • The output of some functions used in geospatial databases is encoded binary characters. We recommend that you use the ST_AsText function. The functions used in geospatial databases include ST_CIRCLE, ST_POINT, ST_POLYGON, ST_GEOMETRYFROMTEXT, ST_LINEFROMTEXT, and ST_GEOMETRYFROMWKB.

Syntax

query_stmt
INTO OUTFILE "file_path"
[format_as]
[properties]

Parameters

Parameter

Required

Description

query_stmt

Yes

The dataset that you want to export. You can use a SELECT statement to query the dataset.

file_path

Yes

The path to which the exported files are stored and prefix of the exported files.

You can omit the file name prefix and specify only the file directory. Example: "hdfs://path/to/".

If you omit the file name prefix, the exported file is named in the following format: Instance name_File sequence number+File name extension.

  • The file sequence number starts from 0 and represents the number of segment files.

    If a single file is exported, the file sequence number is not required.

  • The default file name extension is .csv. You can use format_as to specify other file name extensions.

Example values of file_path:

  • "s3://bucket_name/to/my_file_"

  • "hdfs://path/to/my_file_"

Description:

  • The file prefix is my_file_.

  • If you set format_as to CSV, the file name extension is .csv.

  • Check whether a single file is exported:

    • If a single file is exported, the file sequence number is not required, and the file name is my_file_Instance name_0.csv.

    • If multiple files are exported, you must take note of the following items:

      • The file sequence numbers start from 0 and increment by one.

      • The sequence number represents the number of segment files.

      Examples:

      my_file_abcdefg_0.csv
      my_file_abcdefg_1.csv
      my_file_abcdegf_2.csv
      ...

format_as

No

The format of an exported file.

  • CSV, Parquet, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, and ORC are supported.

  • If you left this parameter empty, CSV files are automatically exported.

properties

No

The properties related to exported files and HDFS and Amazon S3 protocols. For more information, see Syntax of the properties object.

Note

You can export data over the Amazon S3 or HDFS protocol.

Syntax of the properties object

The following sample code shows an example on the syntax of the properties object. You can specify properties related to exported files, HDFS protocol, and Amazon S3 protocol.

[PROPERTIES ("key"="value", ...)]
  • File-related properties

    Parameter

    Required

    Description

    column_separator

    No

    The column delimiter, which is used only for files of the CSV type.

    line_delimiter

    No

    The row delimiter, which is used only for files of the CSV type.

    max_file_size

    No

    The maximum size of a file. If the size of the query result exceeds this value, multiple segment files are generated.

    • Valid value: [5 MB, 2 GB].

    • Default value: 1 GB.

    If you set format_as to ORC, the size of a segment file is calculated by using the following formula: ceil (max_file_size/64) × 64 MB.

    delete_existing_files

    No

    Specifies whether to delete all files in the directory specified by the file_path parameter. Valid values:

    • false (default): does not delete all files in the directory specified by the file_path parameter but directly exports the files.

    • true: deletes all files in the directory specified by the file_path parameter and add the exported data to the directory. Examples:

      • If you use the "file_path" = "/user/tmp" setting, all files and directories in the "/user/" path are deleted.

      • If you use the "file_path" = "/user/tmp/" setting, all files and directories in the "/user/tmp/" path are deleted.

    Warning
    • If you use the delete_existing_files = true setting, risks may occur. We recommend that you perform this operation only in a test environment.

    • To specify the delete_existing_files parameter, submit a ticket to contact Alibaba Cloud technical support. The SelectDB technical support adds the enable_delete_existing_files = true setting to the fe.conf file and restarts the frontend for the delete_existing_files parameter to take effect.

    file_suffix

    No

    The file name extension of the exported file. If this parameter is not specified, the default file name extension is used.

  • HDFS-related properties

    Parameter

    Required

    Description

    fs.defaultFS

    Yes

    The NameNode endpoint and port.

    hadoop.username

    Yes

    The username used to log on to HDFS.

    dfs.nameservices

    Yes

    The Name Service name. The value of this parameter is the same as that in the hdfs-site.xml file.

    dfs.ha.namenodes.[nameservice ID]

    Yes

    The NameNode IDs in the Name Service. The value of this parameter is the same as that in the hdfs-site.xml file.

    dfs.namenode.rpc-address.[nameservice ID].[name node ID]

    Yes

    The Remote Procedure Call (RPC) URLs of the NameNodes. Each RPC URL corresponds to a NameNode. The value of this parameter is the same as that in the hdfs-site.xml file.

    dfs.client.failover.proxy.provider.[nameservice ID]

    Yes

    The Java class that implements active NameNode connection for the HDFS client. Default value: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

    If Kerberos authentication is enabled for a Hadoop cluster, you must also specify the following properties.

    Parameter

    Required

    Description

    dfs.namenode.kerberos.principal

    Yes

    The Kerberos principal name of the HDFS NameNode.

    hadoop.security.authentication

    Yes

    The authentication method. Set this parameter to kerberos to enable Kerberos authentication.

    hadoop.kerberos.principal

    Yes

    The principal for Kerberos authentication.

    hadoop.kerberos.keytab

    Yes

    The path to the Kerberos keytab file.

  • Amazon S3-related properties

    The following table describes the properties that you can set for storage systems that support the Amazon S3 protocol. These storage systems include Amazon S3 and OSS.

    Parameter

    Required

    Description

    s3.endpoint

    Yes

    The endpoint of the destination storage system.

    s3.access_key

    Yes

    The access key used to authenticate your requests to access the destination storage system.

    s3.secret_key

    Yes

    The secret key used to authenticate your requests to access the destination storage system.

    s3.region

    Yes

    The region in which the destination storage system resides.

    s3.session_token

    Yes

    The session token used for temporary session credentials to access the destination storage system. This parameter is required if temporary session authentication is enabled.

    use_path_style

    No

    Specifies whether to access the destination storage system in the path style. Default value: false.

    By default, Amazon S3 SDKs use the virtual-hosted style to access the destination storage system. However, some object storage systems may not support the virtual-hosted style. In this case, you can specify the use_path_style parameter to forcibly access the destination storage system in the path style.

    Note

    The following URI schemas are supported: http://, https://, and s3://.

    1. If you use http:// or https://, the use_path_style parameter is used to determine whether to access the destination storage system in the path style.

    2. If you use s3://, the destination storage system is accessed in the virtual-hosted style.

Response parameters

The data is exported in a synchronized manner. When the results are returned, the statement execution ends. A row in the returned results shows the result of the data export.

  • If the execution succeeds, the following information is returned:

    SELECT * FROM tbl1 LIMIT 10 into outfile "file:///home/work/path/result_";
    +------------+-----------+----------+--------------------------------------------------------------------+
    | FileNumber | TotalRows | FileSize | URL                                                                |
    +------------+-----------+----------+--------------------------------------------------------------------+
    |          1 |         2 |        8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
    +------------+-----------+----------+--------------------------------------------------------------------+
    1 row in set (0.05 sec)

    The following table describes the response parameters.

    Parameter

    Description

    FileNumber

    The number of generated files.

    TotalRows

    The number of returned rows in the result set.

    FileSize

    The total size of the exported files. Unit: bytes.

    URL

    The compute node to which the data is exported if data is exported to a local disk.

  • If parallel export is performed, multiple rows of data are returned.

    +------------+-----------+----------+--------------------------------------------------------------------+
    | FileNumber | TotalRows | FileSize | URL                                                                |
    +------------+-----------+----------+--------------------------------------------------------------------+
    |          1 |         3 |        7 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
    |          1 |         2 |        4 | file:///192.168.1.11/home/work/path/result_{fragment_instance_id}_ |
    +------------+-----------+----------+--------------------------------------------------------------------+
    2 rows in set (2.218 sec)
  • If the execution fails, an error message is returned. The following SQL statements show an example.

    SELECT * FROM tbl INTO OUTFILE ...
    ERROR 1064 (HY000): errCode = 2, detailMessage = ...

Parallel export

By default, the export of the query result set is unparallel, and a single thread on a BE node is used to export data and return the result set. Therefore, the export duration has a positive correlation with the result set size.

You can use parallel export to speed up data export. To enable parallel export, perform the following steps:

  1. Set the set enable_parallel_outfile session variable to true.

  2. Check whether the query result set supports parallel export.

    After you enable parallel export, you can use the EXPLAIN statement to check whether parallel export is supported. Syntax:

    EXPLAIN <select_into_outfile>;

    Execute the SELECT INTO OUTFILE statement. For more information about the syntax, see Syntax.

    After you execute the EXPLAIN statement, SelectDB immediately returns a query plan. You must analyze the query plan and check whether parallel export is supported.

  • If RESULT FILE SINK is displayed in PLAN FRAGMENT 1, parallel export is supported.

    If parallel export is supported, you can use be_instance_num * parallel_fragment_exec_instance_num to calculate the number of requests that can be concurrently processed by a single SelectDB instance.

    • be_instance_num specifies the number of nodes in a BE cluster.

      • If the cluster provides up to 32 CPU cores, the cluster contains only one node.

      • If the cluster provides at least 32 CPU cores, the cluster contains multiple nodes, and each node has 32 CPU cores.

    • parallel_fragment_exec_instance_num specifies the parallel parameter.

      You can execute the following statement to view the parameter:

      SHOW variables LIKE  '% parallel_fragment_exec_instance_num%'; 

      You can execute the following statement to configure the parameter:

      SET  parallel_fragment_exec_instance_num = <parallel_fragment_exec_instance_num>;  
  • If RESULT FILE SINK is displayed in PLAN FRAGMENT 0, parallel export is not supported.

    In the following sample query plan of a SELECT INTO OUTFILE job, RESULT FILE SINK is displayed in PLAN FRAGMENT 0. In this case, parallel export is not supported.

    Sample parallel export plan:
    +-----------------------------------------------------------------------------+
    | Explain String                                                              |
    +-----------------------------------------------------------------------------+
    | PLAN FRAGMENT 0                                                             |
    |  OUTPUT EXPRS:<slot 2> | <slot 3> | <slot 4> | <slot 5>                     |
    |   PARTITION: UNPARTITIONED                                                  |
    |                                                                             |
    |   RESULT SINK                                                               |
    |                                                                             |
    |   1:EXCHANGE                                                                |
    |                                                                             |
    | PLAN FRAGMENT 1                                                             |
    |  OUTPUT EXPRS:`k1` + `k2`                                                   |
    |   PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1`   |
    |                                                                             |
    |   RESULT FILE SINK                                                          |
    |   FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_                        |
    |   STORAGE TYPE: S3                                                          |
    |                                                                             |
    |   0:OlapScanNode                                                            |
    |      TABLE: multi_tablet                                                    |
    +-----------------------------------------------------------------------------+
  • Create a data export job to export data based on the syntax.

Examples

Export data over the HDFS protocol

Export the results of a simple query to a CSV file. Examples:

Note

When you export data over the HDFS protocol, the PROPERTIES values of SELECT INTO OUTFILE vary based on the status of high availability for the Hadoop cluster.

High availability is disabled for the Hadoop cluster

-- fileSystem_port Default value: 9000.
SELECT * FROM tbl
INTO OUTFILE "hdfs://${host}:${fileSystem_port}/path/to/result_"
FORMAT AS CSV
PROPERTIES
(
    "fs.defaultFS" = "hdfs://ip:port",
    "hadoop.username" = "work"
);

High availability is enabled for the Hadoop cluster

--HA fileSystem_port Default value: 8020.
SELECT * FROM tbl
INTO OUTFILE "hdfs:///path/to/result_"
FORMAT AS CSV
PROPERTIES
(
'fs.defaultFS'='hdfs://hacluster/',
'dfs.nameservices'='hacluster',
'dfs.ha.namenodes.hacluster'='n1,n2',
'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020',
'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020',
'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);

If the total file size is less than or equal to 1 GB, the result_0.csv file is generated. If the total file size is greater than 1 GB, multiple segment files named result_0.csv, result_1.csv, ... are generated.

Export data to a file over the Amazon S3 protocol

Export the query results of the UNION statement to a file. Specify that the exported data is stored in the OSS bucket named oss-bucket that resides in the China (Hangzhou) region. Specify that the exported files are in the PARQUET format. You do not need to specify the column delimiter. Specify that a file is generated as an identifier after the data is exported. Sample code:

SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1
INTO OUTFILE "s3://oss-bucket/result_"
FORMAT AS PARQUET
PROPERTIES
(
    "s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
    "s3.access_key" = "****",
    "s3.secret_key" = "****",
    "s3.region" = "cn-hangzhou"
);
Important

If you export data over the Amazon S3 protocol, you must use http://, https://, or s3:// as the URI schema at the beginning of the URL of the destination storage system regardless of the cloud service provider. Otherwise, the ERROR 1105 (HY000): errCode = 2, detailMessage = Unknown properties: [s3.region, s3.endpoint, s3.secret_key, s3.access_key] error message is displayed.

OSS Export data to an OSS bucket over the Amazon S3 protocol

Export data to OSS over the Amazon S3 protocol. Sample code:

SET enable_parallel_outfile = true;

SELECT k1 FROM tb1 LIMIT 1000
INTO outfile "s3://my_bucket/export/my_file_"
format AS csv
properties
(
    "s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
    "s3.access_key" = "****",
    "s3.secret_key" = "****",
    "s3.region" = "cn-hangzhou"
)

In the same result set, the query result contains ORDER BY k1, which indicates that the returned results are ordered based on the specified value. Therefore, even parallel export is enabled for this request, the data is not concurrently exported. Sample code:

SET enable_parallel_outfile = true;

SELECT k1 FROM tb1 ORDER BY k1 LIMIT 1000
INTO outfile "s3://my_bucket/export/my_file_"
format AS csv
properties
(
    "s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
    "s3.access_key" = "****",
    "s3.secret_key" = "****",
    "s3.region" = "cn-hangzhou"
)

Data type mappings

Parquet and ORC files support different data types. SelectDB provides the data export feature to automatically export data in SelectDB instances to data of the types that are supported by Parquet and ORC files.

  • The following table describes the data type mappings between SelectDB data and data in ORC files.

    Data type of ApsaraDB for SelectDB

    Data type supported by ORC files

    BOOLEAN

    BOOLEAN

    TINYINT

    TINYINT

    SMALLINT

    SMALLINT

    INT

    INT

    BIGINT

    BIGINT

    LARGEINT

    STRING

    DATE

    STRING

    DATEV2

    STRING

    DATETIME

    STRING

    DATETIMEV2

    TIMESTAMP

    FLOAT

    FLOAT

    DOUBLE

    DOUBLE

    CHAR / VARCHAR / STRING

    STRING

    DECIMAL

    DECIMAL

    STRUCT

    STRUCT

    MAP

    MAP

    ARRAY

    ARRAY

  • If you export data from the memory of a SelectDB instance to Parquet files, the data is first mapped to the data types supported by Apache Arrow and then written to Parquet files by Apache Arrow. The following table describes the data type mappings between SelectDB and Apache Arrow data.

    Data type of ApsaraDB for SelectDB

    Data type of Apache Arrow

    BOOLEAN

    BOOLEAN

    TINYINT

    INT8

    SNALLINT

    INT16

    INT

    INT32

    BIGINT

    INT64

    LARGEINT

    UTF8

    DATE

    UTF8

    DATEV2

    UTF8

    DATETIME

    UTF8

    DATETIMEV2

    UTF8

    FLOAT

    FLOAT32

    DOUBLE

    FLOAT64

    CHAR / VARCHAR / STRING

    UTF8

    DECIMAL

    DECIMAL128

    STRUCT

    STRUCT

    MAP

    MAP

    ARRAY

    LIST