Use the SELECT INTO OUTFILE statement to export query results from ApsaraDB for SelectDB to remote storage—Object Storage Service (OSS) or Hadoop Distributed File System (HDFS)—for backup or data migration.
How it works
-
Write a
SELECTquery to define the data to export. -
Specify a destination path in OSS or HDFS and set the output format and properties.
-
Run the statement. SelectDB executes the query and writes the results to the destination in a single synchronous operation.
-
Check the returned row for file count, row count, and file size to confirm success.
Usage notes
Synchronous execution and result integrity
SELECT INTO OUTFILE is synchronous. The statement returns only after the export finishes—or fails. If the connection drops during export, no status is returned and you cannot tell whether the export completed.
To confirm export success reliably, add "success_file_name" = "SUCCESS" to the PROPERTIES clause. When the export succeeds, SelectDB writes a marker file named SUCCESS to the output directory.
Single-threaded execution
By default, the export uses one thread on a single BE node. Export duration equals query execution time plus result-set write time:
Total duration = Query execution time + Result write time
For large datasets, extend the timeout before running the export:
SET query_timeout = <seconds>;
For details on session variables, see Variable management.
File path and overwrite behavior
SELECT INTO OUTFILE does not check whether the destination path or files already exist. It does not create paths automatically or overwrite existing files. Path creation and overwrite behavior are controlled by the remote storage system (OSS or HDFS).
File management responsibility
SelectDB does not manage exported files. Clean up output files from successful jobs and residual files from failed jobs directly in your storage system.
Supported file formats
| Data type | Supported formats |
|---|---|
| Basic types | CSV, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, Parquet, ORC (Optimized Row Columnar) |
| Complex types (ARRAY, MAP, STRUCT) | CSV, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, ORC |
| Nested data | Not supported |
Special output values
-
BITMAP and HyperLogLog (HLL) functions: If the output contains invisible characters,
\Nis returned instead (indicating null). -
Geospatial functions: Output is encoded binary. Use
ST_AsTextto get readable text. Affected functions:ST_CIRCLE,ST_POINT,ST_POLYGON,ST_GEOMETRYFROMTEXT,ST_LINEFROMTEXT,ST_GEOMETRYFROMWKB.
Result file behavior
-
An empty result set still produces an output file.
-
Segment files always contain complete rows, so actual file size may differ slightly from
max_file_size.
Limitations
SELECT INTO OUTFILE supports the Amazon Simple Storage Service (Amazon S3) protocol and the HDFS protocol only.
Syntax
query_stmt
INTO OUTFILE "file_path"
[FORMAT AS format_name]
[PROPERTIES ("key"="value", ...)]
Parameters
| Parameter | Required | Description |
|---|---|---|
query_stmt |
Yes | A SELECT statement that defines the data to export. |
file_path |
Yes | Destination path and optional filename prefix. Omit the prefix to use the directory only, for example "hdfs://path/to/". |
FORMAT AS |
No | Output format. Options: CSV (default), CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, PARQUET, ORC. |
PROPERTIES |
No | File, HDFS, or S3 properties. See File properties, HDFS properties, and S3 properties. |
File naming
When file_path includes a prefix (for example "s3://bucket_name/to/my_file_"), output files are named using this pattern:
<prefix><instance-name>_<sequence>.csv
-
The sequence number starts from
0and increments for each segment file. -
For a single output file, the sequence number is omitted.
-
The default extension is
.csv. Override it withfile_suffix.
Example: With prefix my_file_ and FORMAT AS CSV, a multi-file export produces:
my_file_abcdefg_0.csv
my_file_abcdefg_1.csv
my_file_abcdefg_2.csv
...
File properties
| Property | Required | Description |
|---|---|---|
column_separator |
No | Column delimiter. Applies to CSV formats only. |
line_delimiter |
No | Row delimiter. Applies to CSV formats only. |
max_file_size |
No | Maximum size per segment file. Valid range: 5 MB–2 GB. Default: 1 GB. For ORC files, the actual size is ceil(max_file_size / 64) × 64 MB. |
delete_existing_files |
No | Whether to delete all files in the destination directory before exporting. false (default): export without deleting existing files. true: delete all files and subdirectories in the destination directory first. |
success_file_name |
No | Name of the marker file written to the destination directory after a successful export, for example "SUCCESS". Use this to confirm export completion when the connection may be interrupted. |
file_suffix |
No | File extension for output files. Overrides the default extension. |
delete_existing_files = true permanently deletes data. Use this setting in test environments only. To enable it, submit a ticket to Alibaba Cloud technical support to set enable_delete_existing_files = true in fe.conf and restart the frontend.
`delete_existing_files` scope examples:
-
"file_path" = "/user/tmp"— deletes all files and directories under/user/. -
"file_path" = "/user/tmp/"— deletes all files and directories under/user/tmp/.
HDFS properties
The following properties are required when exporting over HDFS.
| Property | Required | Description |
|---|---|---|
fs.defaultFS |
Yes | NameNode endpoint and port. |
hadoop.username |
Yes | Username for HDFS authentication. |
dfs.nameservices |
Yes | Name Service name. Must match the value in hdfs-site.xml. |
dfs.ha.namenodes.[nameservice ID] |
Yes | NameNode IDs in the Name Service. Must match hdfs-site.xml. |
dfs.namenode.rpc-address.[nameservice ID].[name node ID] |
Yes | Remote Procedure Call (RPC) address for each NameNode. Must match hdfs-site.xml. |
dfs.client.failover.proxy.provider.[nameservice ID] |
Yes | Java class for active NameNode failover. Default: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider. |
Kerberos authentication (if enabled)
| Property | Required | Description |
|---|---|---|
dfs.namenode.kerberos.principal |
Yes | Kerberos principal name of the HDFS NameNode. |
hadoop.security.authentication |
Yes | Authentication method. Set to kerberos. |
hadoop.kerberos.principal |
Yes | Principal for Kerberos authentication. |
hadoop.kerberos.keytab |
Yes | Path to the Kerberos keytab file. |
S3 properties
These properties apply to any storage system that supports the S3 protocol, including OSS.
| Property | Required | Description |
|---|---|---|
s3.endpoint |
Yes | Endpoint of the destination storage system. |
s3.access_key |
Yes | Access key for authentication. |
s3.secret_key |
Yes | Secret key for authentication. |
s3.region |
Yes | Region of the destination storage system. |
s3.session_token |
If using temporary credentials | Session token for temporary session authentication. |
use_path_style |
No | Access style. Default: false (virtual-hosted style). Set to true to force path-style access for storage systems that do not support virtual-hosted style. |
URI scheme: Use http://, https://, or s3:// in file_path.
-
http://orhttps://:use_path_styledetermines the access style. -
s3://: always uses virtual-hosted style, regardless ofuse_path_style.
Always include a URI scheme (http://, https://, or s3://) in the destination URL. Omitting the scheme causes the error: ERROR 1105 (HY000): errCode = 2, detailMessage = Unknown properties: [s3.region, s3.endpoint, s3.secret_key, s3.access_key].
Response
SELECT INTO OUTFILE is synchronous. When the statement finishes, it returns one row per export thread.
Success response:
SELECT * FROM tbl1 LIMIT 10 INTO OUTFILE "file:///home/work/path/result_";+------------+-----------+----------+--------------------------------------------------------------------+
| FileNumber | TotalRows | FileSize | URL |
+------------+-----------+----------+--------------------------------------------------------------------+
| 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
+------------+-----------+----------+--------------------------------------------------------------------+
1 row in set (0.05 sec)
| Field | Description |
|---|---|
FileNumber |
Number of files generated. |
TotalRows |
Number of rows in the result set. |
FileSize |
Total size of exported files, in bytes. |
URL |
The compute node to which the data is exported if data is exported to a local disk. |
Parallel export response (multiple rows, one per thread):
+------------+-----------+----------+--------------------------------------------------------------------+
| FileNumber | TotalRows | FileSize | URL |
+------------+-----------+----------+--------------------------------------------------------------------+
| 1 | 3 | 7 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ |
| 1 | 2 | 4 | file:///192.168.1.11/home/work/path/result_{fragment_instance_id}_ |
+------------+-----------+----------+--------------------------------------------------------------------+
2 rows in set (2.218 sec)
Error response:
ERROR 1064 (HY000): errCode = 2, detailMessage = ...
Enable parallel export
By default, export runs on a single BE node thread. Enable parallel export to distribute the workload across multiple BE nodes.
Step 1: Enable the session variable.
SET enable_parallel_outfile = true;
Step 2: Verify that your query supports parallel export.
Run EXPLAIN on your SELECT INTO OUTFILE statement:
EXPLAIN <select_into_outfile_statement>;
Check the output:
-
Parallel export supported:
RESULT FILE SINKappears inPLAN FRAGMENT 1. -
Parallel export not supported:
RESULT FILE SINKappears inPLAN FRAGMENT 0.
A query withORDER BYcannot run in parallel even whenenable_parallel_outfile = true, because ordering requires a single-threaded merge step.
Example query plan (parallel export supported):
+-----------------------------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------------------------+
| PLAN FRAGMENT 0 |
| OUTPUT EXPRS:<slot 2> | <slot 3> | <slot 4> | <slot 5> |
| PARTITION: UNPARTITIONED |
| |
| RESULT SINK |
| |
| 1:EXCHANGE |
| |
| PLAN FRAGMENT 1 |
| OUTPUT EXPRS:`k1` + `k2` |
| PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` |
| |
| RESULT FILE SINK |
| FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ |
| STORAGE TYPE: S3 |
| |
| 0:OlapScanNode |
| TABLE: multi_tablet |
+-----------------------------------------------------------------------------+
Step 3: Calculate maximum concurrency (optional).
When parallel export is supported, the maximum number of concurrent requests is:
be_instance_num × parallel_fragment_exec_instance_num
-
be_instance_num: number of BE nodes. Clusters with up to 32 CPU cores use 1 node; clusters with at least 32 CPU cores use multiple nodes, each with 32 CPU cores. -
parallel_fragment_exec_instance_num: parallelism setting per node.
To view the current parallelism setting:
SHOW VARIABLES LIKE '%parallel_fragment_exec_instance_num%';
To set it:
SET parallel_fragment_exec_instance_num = <value>;
Examples
Export to HDFS
The required HDFS properties differ depending on whether high availability (HA) is enabled for the Hadoop cluster.
Hadoop cluster without high availability
-- fileSystem_port default: 9000
SELECT * FROM tbl
INTO OUTFILE "hdfs://${host}:${fileSystem_port}/path/to/result_"
FORMAT AS CSV
PROPERTIES
(
"fs.defaultFS" = "hdfs://ip:port",
"hadoop.username" = "work"
);
Hadoop cluster with high availability
-- fileSystem_port default: 8020
SELECT * FROM tbl
INTO OUTFILE "hdfs:///path/to/result_"
FORMAT AS CSV
PROPERTIES
(
"fs.defaultFS" = "hdfs://hacluster/",
"dfs.nameservices" = "hacluster",
"dfs.ha.namenodes.hacluster" = "n1,n2",
"dfs.namenode.rpc-address.hacluster.n1" = "192.168.0.1:8020",
"dfs.namenode.rpc-address.hacluster.n2" = "192.168.0.2:8020",
"dfs.client.failover.proxy.provider.hacluster" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);
If the total result is 1 GB or less, SelectDB writes a single file: result_0.csv. If the total exceeds 1 GB, it writes multiple segment files: result_0.csv, result_1.csv, and so on.
Export to OSS over the S3 protocol
Export the result of a UNION query to Parquet format in an OSS bucket in the China (Hangzhou) region:
SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1
INTO OUTFILE "s3://oss-bucket/result_"
FORMAT AS PARQUET
PROPERTIES
(
"s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
"s3.access_key" = "****",
"s3.secret_key" = "****",
"s3.region" = "cn-hangzhou"
);
Export to OSS with parallel export enabled
Export data in parallel. All requests use the enable_parallel_outfile session variable.
SET enable_parallel_outfile = true;
SELECT k1 FROM tb1 LIMIT 1000
INTO OUTFILE "s3://my_bucket/export/my_file_"
FORMAT AS CSV
PROPERTIES
(
"s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
"s3.access_key" = "****",
"s3.secret_key" = "****",
"s3.region" = "cn-hangzhou"
);
If the query includes ORDER BY, parallel export is disabled automatically even with enable_parallel_outfile = true:
SET enable_parallel_outfile = true;
SELECT k1 FROM tb1 ORDER BY k1 LIMIT 1000
INTO OUTFILE "s3://my_bucket/export/my_file_"
FORMAT AS CSV
PROPERTIES
(
"s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
"s3.access_key" = "****",
"s3.secret_key" = "****",
"s3.region" = "cn-hangzhou"
);
Data type mappings
When exporting to Parquet or ORC, SelectDB automatically maps its data types to the formats supported by each file type.
SelectDB to ORC
| SelectDB type | ORC type |
|---|---|
| BOOLEAN | BOOLEAN |
| TINYINT | TINYINT |
| SMALLINT | SMALLINT |
| INT | INT |
| BIGINT | BIGINT |
| LARGEINT | STRING |
| DATE | STRING |
| DATEV2 | STRING |
| DATETIME | STRING |
| DATETIMEV2 | TIMESTAMP |
| FLOAT | FLOAT |
| DOUBLE | DOUBLE |
| CHAR / VARCHAR / STRING | STRING |
| DECIMAL | DECIMAL |
| STRUCT | STRUCT |
| MAP | MAP |
| ARRAY | ARRAY |
SelectDB to Parquet (via Apache Arrow)
Parquet export routes data through Apache Arrow. SelectDB types map to Arrow types first, then Arrow writes them to Parquet.
| SelectDB type | Apache Arrow type |
|---|---|
| BOOLEAN | BOOLEAN |
| TINYINT | INT8 |
| SMALLINT | INT16 |
| INT | INT32 |
| BIGINT | INT64 |
| LARGEINT | UTF8 |
| DATE | UTF8 |
| DATEV2 | UTF8 |
| DATETIME | UTF8 |
| DATETIMEV2 | UTF8 |
| FLOAT | FLOAT32 |
| DOUBLE | FLOAT64 |
| CHAR / VARCHAR / STRING | UTF8 |
| DECIMAL | DECIMAL128 |
| STRUCT | STRUCT |
| MAP | MAP |
| ARRAY | LIST |