Alibaba Cloud allows you to use the SELECT INTO OUTFILE
statement to export data from ApsaraDB for SelectDB instances and generate files. You can store the files to remote storage systems such as Object Storage Service (OSS) and Hadoop Distributed File System (HDFS) over the Amazon Simple Storage Service (Amazon S3) and HDFS protocols.
Overview
The SELECT INTO OUTFILE
statement is used to export query results to files. You can store the files to remote storage systems such as OSS and HDFS over the Amazon S3 and HDFS protocols.
The SELECT INTO OUTFILE
statement is executed in a synchronized manner. When the results are returned, the execution ends. A row in the returned results shows the result of a data export request. Due to the restrictions of synchronously executing a statement, you cannot confirm whether a data export job is complete as expected or whether all queried data is exported. In this case, you can specify the success_file_name
parameter to require that a file be generated as an identifier in the directory in which the exported files are stored after a job succeeds. You can use this file to determine whether the data export job is complete as expected.
This statement is executed by performing an SQL query. By default, only one thread is used during the execution. The duration of an entire data export process includes the time consumed to perform the query and time consumed to return the result set. If you want to query a large amount of data, you must configure the query_timeout
session variable to extend the timeout period for the query.
This statement does not check whether the exported files and file path exist. In addition, this statement does not automatically create a path or overwrite an existing file. The preceding operations are decided by the semantics of the destination remote storage system. ApsaraDB for SelectDB does not manage the exported files. You need to manage the files generated by successful data export jobs and residual files generated by failed data export jobs in remote storage systems.
Usage method
Syntax
query_stmt
INTO OUTFILE "file_path"
[format_as]
[properties]
Request parameters
Parameter | Description |
file_path | The path to which the exported files are stored and prefix of the exported files. Example: If the prefix is set to
You can omit the file name prefix and specify only the file directory. Example: |
format_as | The format of an exported file. Valid values: CSV, PARQUET, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, and ORC. Default value: CSV. |
properties | The properties that you can specify based on your business requirements. For more information, see the Syntax of the properties object section of this topic. You can export data over the Amazon S3 or HDFS protocol. |
Data of basic types can be exported to files of all supported types.
Data of the ARRAY, MAP, and STRUCT types can be exported to files of the following types: CSV, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, and ORC.
Data of the NESTED type cannot be exported.
Syntax of the properties object
The following sample code shows an example on the syntax of the properties object. You can specify properties related to exported files, HDFS protocol, and Amazon S3 protocol.
[PROPERTIES ("key"="value", ...)]
File-related properties
Parameter
Description
column_separator
The column delimiter, which is used only for files of the CSV type.
line_delimiter
The row delimiter, which is used only for files of the CSV type.
max_file_size
The maximum size of a file. If the size of the query result exceeds this value, multiple segment files are generated.
Valid value: [5 MB, 2 GB]. Default value: 1 GB.
If you specify that the exported file is of the ORC type, the maximum size of a segment file is a multiple of 64 MB. For example, if you set the max_file_size parameter to 5 MB, the maximum size of a segment file is 64 MB. If you set the max_file_size parameter to 65 MB, the maximum size of a segment file is 128 MB.
delete_existing_files
Default value: false. If you set this parameter to true, all files in the directory specified by the file_path parameter are deleted before data is exported to the specified directory. Examples:
If you set the file_path parameter to /user/, all files and directories under the /user/ directory are deleted.
If you set the file_path parameter to /user/tmp/, all files and directories under the /user/tmp/ directory are deleted.
WarningRisks may occur if you set the
delete_existing_files
parameter to true. We recommend that you perform this operation only in a test environment. To specify the delete_existing_files parameter, submit a ticket to contact Alibaba Cloud technical support. After you set the enable_delete_existing_files parameter to true in the fe.conf file and restart the data export job, the delete_existing_files parameter takes effect.file_suffix
The suffix of the exported file. If this parameter is not specified, the default suffix is used.
HDFS-related properties
Parameter
Description
fs.defaultFS
The NameNode endpoint.
hadoop.username
The username used to log on to HDFS.
dfs.nameservices
The Name Service ID. The value of this parameter is the same as that in the hdfs-site.xml file.
dfs.ha.namenodes.[nameservice ID]
The NameNode IDs in the Name Service. The value of this parameter is the same as that in the hdfs-site.xml file.
dfs.namenode.rpc-address.[nameservice ID].[name node ID]
The Remote Procedure Call (RPC) URLs of the NameNodes. Each RPC URL corresponds to a NameNode. The value of this parameter is the same as that in the hdfs-site.xml file.
dfs.client.failover.proxy.provider.[nameservice ID]
The Java class that implements active NameNode connection for the HDFS client. Default value:
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
.If Kerberos authentication is enabled for a Hadoop cluster, you must also specify the properties described in the following table.
Parameter
Description
dfs.namenode.kerberos.principal
The Kerberos principal name of the HDFS NameNode.
hadoop.security.authentication
The authentication method. Set this parameter to
kerberos
to enable Kerberos authentication.hadoop.kerberos.principal
The principal for Kerberos authentication.
hadoop.kerberos.keytab
The path to the Kerberos keytab file.
Amazon S3-related properties
The following table describes the properties that you can set for storage systems that support the Amazon S3 protocol. These storage systems include Amazon S3 and OSS.
Parameter
Description
s3.endpoint
Required. The endpoint of the destination storage system.
s3.access_key
Required. The access key used to authenticate your requests to access the destination storage system.
s3.secret_key
Required. The secret key used to authenticate your requests to access the destination storage system. Type: STRING.
s3.region
Required. The region in which the destination storage system resides.
s3.session_token
The session token used for temporary session credentials to access the destination storage system. This parameter is required if temporary session authentication is enabled.
use_path_style
Optional. Specifies whether to access the destination storage system in the path style. Default value:
false
.By default, Amazon S3 SDKs use the virtual-hosted style to access the destination storage system. However, some object storage systems may not support the virtual-hosted style. In this case, you can specify the
use_path_style
parameter to forcibly access the destination storage system in the path style.NoteThe following URI schemas are supported:
http://
,https://
, ands3://
.If you use
http://
orhttps://
, theuse_path_style
parameter is used to determine whether to access the destination storage system in the path style.If you use
s3://
, the destination storage system is accessed in the virtual-hosted style.
Response parameters
The data is exported in a synchronized manner. When the results are returned, the statement execution ends. A row in the returned results shows the result of the data export.
Sample success response:
SELECT * FROM tbl1 LIMIT 10 into outfile "file:///home/work/path/result_"; +------------+-----------+----------+--------------------------------------------------------------------+ | FileNumber | TotalRows | FileSize | URL | +------------+-----------+----------+--------------------------------------------------------------------+ | 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ | +------------+-----------+----------+--------------------------------------------------------------------+ 1 row in set (0.05 sec)
The following table describes the response parameters.
Parameter
Description
FileNumber
The number of generated files.
TotalRows
The number of returned rows in the result set.
FileSize
The total size of the exported files. Unit: bytes.
URL
The compute node to which the data is exported if data is exported to a local disk.
If parallel scan is performed, multiple rows of data are returned.
+------------+-----------+----------+--------------------------------------------------------------------+ | FileNumber | TotalRows | FileSize | URL | +------------+-----------+----------+--------------------------------------------------------------------+ | 1 | 3 | 7 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ | | 1 | 2 | 4 | file:///192.168.1.11/home/work/path/result_{fragment_instance_id}_ | +------------+-----------+----------+--------------------------------------------------------------------+ 2 rows in set (2.218 sec)
If a request fails, an error message is returned. The following SQL statements show an example.
SELECT * FROM tbl INTO OUTFILE ... ERROR 1064 (HY000): errCode = 2, detailMessage = ...
Parallel scan
By default, a single thread on a BE node is used to export data and return the result set. Therefore, the export duration has a positive correlation with the result set size. You can use parallel scan to speed up the process of exporting data. To enable parallel scan, the following conditions must be met:
The
set enable_parallel_outfile
session variable is set to true.The query meets the requirements of parallel scan. For example, the top-level query plan does not require the returned data to be processed in a specific order.
The be_instance_num * parallel_fragment_exec_instance_num
parameter indicates the number of requests that can be concurrently processed by a single ApsaraDB for SelectDB instance.
Check whether parallel scan is enabled
After you set the session variable to enable parallel scan, you can execute the following statement to check whether parallel scan is enabled for the current request:
EXPLAIN SELECT xxx FROM xxx WHERE xxx INTO outfile "s3://xxx" format AS csv properties ("AWS_ENDPOINT" = "xxx", ...);
After you execute the EXPLAIN
statement on the query, ApsaraDB for SelectDB returns a plan.
If
RESULT FILE SINK
is displayed inPLAN FRAGMENT 1
, the parallel scan is enabled for the query.If
RESULT FILE SINK
is displayed inPLAN FRAGMENT 0
, the parallel scan is disabled because the query does meet the conditions to enable parallel scan.
Sample plan:
+-----------------------------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------------------------+
| PLAN FRAGMENT 0 |
| OUTPUT EXPRS:<slot 2> | <slot 3> | <slot 4> | <slot 5> |
| PARTITION: UNPARTITIONED |
| |
| RESULT SINK |
| |
| 1:EXCHANGE |
| |
| PLAN FRAGMENT 1 |
| OUTPUT EXPRS:`k1` + `k2` |
| PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` |
| |
| RESULT FILE SINK |
| FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ |
| STORAGE TYPE: S3 |
| |
| 0:OlapScanNode |
| TABLE: multi_tablet |
+-----------------------------------------------------------------------------+
Usage notes
If the result set of the SELECT INTO OUTFILE statement is empty, a file is still generated.
File segmentation ensures that each segment file stores complete rows of data. Therefore, the size of a segment file may slightly differ from the value of the
max_file_size
parameter.If the output of a function, such as a BITMAP function or HyperLogLog (HLL) function, contains invisible characters,
\N
is returned. This indicates that the function returns a null value.The output of some functions used in geospatial databases is encoded binary characters. If the output of a function is of other types, garbled text is returned. For example, the output of the
ST_Point
function is of the VARCHAR type. After you invoke the ST_Point function, garbled text is returned. In this case, use theST_AsText
function.
Examples
Export data over the HDFS protocol. Export the results of a simple query to the
hdfs://${host}:${fileSystem_port}/path/to/result.txt
file. Specify that the exported files are in the CSV format, the HDFS username is work, the column delimiter is,
, and the row delimiter is\n
. Sample code:-- The default value of the fileSystem_port prapameter is 9000. SELECT * FROM tbl INTO OUTFILE "hdfs://${host}:${fileSystem_port}/path/to/result_" FORMAT AS CSV PROPERTIES ( "fs.defaultFS" = "hdfs://ip:port", "hadoop.username" = "work" );
The following code provides an example on how to execute the SELECT INTO OUTFILE statement if high availability is enabled for the Hadoop cluster:
-- The default value of the fileSystem_port prapameter for the Hadoop cluster that enables high availability is 8020. SELECT * FROM tbl INTO OUTFILE "hdfs:///path/to/result_" FORMAT AS CSV PROPERTIES ( 'fs.defaultFS'='hdfs://hacluster/', 'dfs.nameservices'='hacluster', 'dfs.ha.namenodes.hacluster'='n1,n2', 'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020', 'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020', 'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider' );
If the total file size is less than or equal to 1 GB, the
result_0.csv
file is generated. If the total file size is greater than 1 GB, multiple segment files namedresult_0.csv, result_1.csv, ...
are generated.Export the query results of the UNION statement to the
s3://oss-bucket/result.txt
file. Specify that the exported data is stored in the OSS bucket namedoss-bucket
that resides in theChina (Hangzhou)
region. Specify that the exported files are in the PARQUET format. You do not need to specify the column delimiter. Specify that a file is generated as an identifier after the data is exported to check whether the data export job is successful.SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1 INTO OUTFILE "s3://oss-bucket/result_" FORMAT AS PARQUET PROPERTIES ( "s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com", "s3.access_key" = "****", "s3.secret_key" = "****", "s3.region" = "cn-hangzhou" );
ImportantIf you export data over the Amazon S3 protocol, you must use http://, https://, or s3:// as the URI schema at the beginning of the URL of the destination storage system regardless of the cloud service provider. Otherwise, the
ERROR 1105 (HY000): errCode = 2, detailMessage = Unknown properties: [s3.region, s3.endpoint, s3.secret_key, s3.access_key]
error message appears.Export data to OSS over the Amazon S3 protocol and enable parallel scan. The prefix of a generated file is
my_file_{fragment_instance_id}_
. Sample code:SET enable_parallel_outfile = true; SELECT k1 FROM tb1 LIMIT 1000 INTO outfile "s3://my_bucket/export/my_file_" format AS csv properties ( "s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com", "s3.access_key" = "****", "s3.secret_key" = "****", "s3.region" = "cn-hangzhou" )
Export data to OSS over the Amazon S3 protocol. In this example,
ORDER BY k1
indicates that the returned results are ordered based on the specified value. Therefore, even parallel scan is enabled for this request, the data is not concurrently scanned. Sample code:SET enable_parallel_outfile = true; SELECT k1 FROM tb1 ORDER BY k1 LIMIT 1000 INTO outfile "s3://my_bucket/export/my_file_" format AS csv properties ( "s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com", "s3.access_key" = "****", "s3.secret_key" = "****", "s3.region" = "cn-hangzhou" )
Data type mappings
Parquet files and ORC files support different data types. ApsaraDB for SelectDB provides the data export feature that can automatically export data in ApsaraDB for SelectDB instances to data of the types that are supported by Parquet files and ORC files.
The following table describes the data type mappings between ApsaraDB for SelectDB data and data in ORC files.
Data type of ApsaraDB for SelectDB
Data type supported by ORC files
BOOLEAN
BOOLEAN
TINYINT
TINYINT
SMALLINT
SMALLINT
INT
INT
BIGINT
BIGINT
LARGEINT
STRING
DATE
STRING
DATEV2
STRING
DATETIME
STRING
DATETIMEV2
TIMESTAMP
FLOAT
FLOAT
DOUBLE
DOUBLE
CHAR / VARCHAR / STRING
STRING
DECIMAL
DECIMAL
STRUCT
STRUCT
MAP
MAP
ARRAY
ARRAY
If you export data from the memory of ApsaraDB for SelectDB to Parquet files, the data is first mapped to the data types supported by Apache Arrow. The data is then written to Parquet files by Apache Arrow. The following table describes the data type mappings between ApsaraDB for SelectDB data and Apache Arrow data.
Data type of ApsaraDB for SelectDB
Data type of Apache Arrow
BOOLEAN
BOOLEAN
TINYINT
INT8
SNALLINT
INT16
INT
INT32
BIGINT
INT64
LARGEINT
UTF8
DATE
UTF8
DATEV2
UTF8
DATETIME
UTF8
DATETIMEV2
UTF8
FLOAT
FLOAT32
DOUBLE
FLOAT64
CHAR / VARCHAR / STRING
UTF8
DECIMAL
DECIMAL128
STRUCT
STRUCT
MAP
MAP
ARRAY
LIST