To back up or migrate data of an ApsaraDB for SelectDB instance, you can use the SELECT INTO OUTFILE statement to export data from the SelectDB instance to a file. You can store the file to remote storage systems such as Object Storage Service (OSS) and Hadoop Distributed File System (HDFS) over the Amazon Simple Storage Service (Amazon S3) and HDFS protocols.
Overview
You can use the SELECT INTO OUTFILE statement to back up or migrate data of a SelectDB instance. The statement has the following characteristics:
This statement is executed in a synchronized manner.
After you execute the statement, a row that describes the status of the data export job is immediately returned.
Due to the synchronization mechanism, the status of your data export job is returned only after you execute the statement. If the connection is interrupted, the status of the data export job is not returned, and you cannot determine whether all queried data is exported. To prevent this issue, we recommend that you configure the
"success_file_name" = "SUCCESS"setting in Properties when you create the data export job. This way, if the data export job succeeds, an identifier file is generated in the directory in which the exported files are stored, and you can use this file to determine whether all queried data is exported.
This statement is executed by performing an SQL query.
By default, only one thread is used during the execution.
The export duration is calculated by using the following formula:
Total execution duration = Query execution duration + Time consumed to return the result set.If you want to query a large amount of data, you must configure the query_timeout session variable to extend the timeout period for the query. For more information about abow to configure the session variable, see Variable management.
This statement exports the query results into files.
You can store the files to remote storage systems, such as OSS and HDFS, over the Amazon S3 and HDFS protocols.
This statement does not check whether the exported files and file path exist.
This statement does not automatically create a path or overwrite an existing file. The operations that the statement can perform are decided by remote storage systems such as Amazon S3 and HDFS protocols.
SelectDB does not manage the exported files. You need to manage the files generated by successful data export jobs and residual files generated by failed data export jobs in remote storage systems.
Limits
This statement can be used to export data over Amazon S3 and HDFS protocols.
Usage notes
The exported file format varies based on the data type.
Basic data types: The CVS, CSV_with_names, CVS_with_names_and_types, Parquet, and Optimized Row Columnar (ORC) file formats are supported.
Complex data types:
Data of the ARRAY, MAP, and STRUCT types can be exported to files of the following formats: CSV, CSV_WITH_NAMES, CSV_WITH_NAMES_AND_TYPES, and ORC.
Nested data cannot be exported.
Usage notes for result files:
If the result set is empty, a file is still generated.
File segmentation ensures that each segment file stores complete rows of data. Therefore, the size of a segment file may slightly differ from the value of the
max_file_sizeparameter.
Usage notes for the result file content:
If the output of a function, such as a BITMAP function or HyperLogLog (HLL) function, contains invisible characters,
\Nis returned. This indicates that the function returns a null value.The output of some functions used in geospatial databases is encoded binary characters. We recommend that you use the
ST_AsTextfunction. The functions used in geospatial databases include ST_CIRCLE, ST_POINT, ST_POLYGON, ST_GEOMETRYFROMTEXT, ST_LINEFROMTEXT, and ST_GEOMETRYFROMWKB.
Syntax
query_stmt
INTO OUTFILE "file_path"
[format_as]
[properties]Parameters
Parameter | Required | Description |
query_stmt | Yes | The dataset that you want to export. You can use a SELECT statement to query the dataset. |
file_path | Yes | The path to which the exported files are stored and prefix of the exported files. You can omit the file name prefix and specify only the file directory. Example: If you omit the file name prefix, the exported file is named in the following format: Instance name_File sequence number+File name extension.
Example values of file_path:
Description:
|
format_as | No | The format of an exported file.
|
properties | No | The properties related to exported files and HDFS and Amazon S3 protocols. For more information, see Syntax of the properties object. Note You can export data over the Amazon S3 or HDFS protocol. |
Syntax of the properties object
The following sample code shows an example on the syntax of the properties object. You can specify properties related to exported files, HDFS protocol, and Amazon S3 protocol.
[PROPERTIES ("key"="value", ...)]File-related properties
Parameter
Required
Description
column_separator
No
The column delimiter, which is used only for files of the CSV type.
line_delimiter
No
The row delimiter, which is used only for files of the CSV type.
max_file_size
No
The maximum size of a file. If the size of the query result exceeds this value, multiple segment files are generated.
Valid value: [5 MB, 2 GB].
Default value: 1 GB.
If you set format_as to ORC, the size of a segment file is calculated by using the following formula:
ceil (max_file_size/64) × 64MB.delete_existing_files
No
Specifies whether to delete all files in the directory specified by the file_path parameter. Valid values:
false (default): does not delete all files in the directory specified by the file_path parameter but directly exports the files.
true: deletes all files in the directory specified by the file_path parameter and add the exported data to the directory. Examples:
If you use the
"file_path" = "/user/tmp"setting, all files and directories in the"/user/"path are deleted.If you use the
"file_path" = "/user/tmp/"setting, all files and directories in the"/user/tmp/"path are deleted.
WarningIf you use the
delete_existing_files = truesetting, risks may occur. We recommend that you perform this operation only in a test environment.To specify the delete_existing_files parameter, submit a ticket to contact Alibaba Cloud technical support. The SelectDB technical support adds the
enable_delete_existing_files = truesetting to the fe.conf file and restarts the frontend for the delete_existing_files parameter to take effect.
file_suffix
No
The file name extension of the exported file. If this parameter is not specified, the default file name extension is used.
HDFS-related properties
Parameter
Required
Description
fs.defaultFS
Yes
The NameNode endpoint and port.
hadoop.username
Yes
The username used to log on to HDFS.
dfs.nameservices
Yes
The Name Service name. The value of this parameter is the same as that in the hdfs-site.xml file.
dfs.ha.namenodes.[nameservice ID]
Yes
The NameNode IDs in the Name Service. The value of this parameter is the same as that in the hdfs-site.xml file.
dfs.namenode.rpc-address.[nameservice ID].[name node ID]
Yes
The Remote Procedure Call (RPC) URLs of the NameNodes. Each RPC URL corresponds to a NameNode. The value of this parameter is the same as that in the hdfs-site.xml file.
dfs.client.failover.proxy.provider.[nameservice ID]
Yes
The Java class that implements active NameNode connection for the HDFS client. Default value:
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.If Kerberos authentication is enabled for a Hadoop cluster, you must also specify the following properties.
Parameter
Required
Description
dfs.namenode.kerberos.principal
Yes
The Kerberos principal name of the HDFS NameNode.
hadoop.security.authentication
Yes
The authentication method. Set this parameter to
kerberosto enable Kerberos authentication.hadoop.kerberos.principal
Yes
The principal for Kerberos authentication.
hadoop.kerberos.keytab
Yes
The path to the Kerberos keytab file.
Amazon S3-related properties
The following table describes the properties that you can set for storage systems that support the Amazon S3 protocol. These storage systems include Amazon S3 and OSS.
Parameter
Required
Description
s3.endpoint
Yes
The endpoint of the destination storage system.
s3.access_key
Yes
The access key used to authenticate your requests to access the destination storage system.
s3.secret_key
Yes
The secret key used to authenticate your requests to access the destination storage system.
s3.region
Yes
The region in which the destination storage system resides.
s3.session_token
Yes
The session token used for temporary session credentials to access the destination storage system. This parameter is required if temporary session authentication is enabled.
use_path_style
No
Specifies whether to access the destination storage system in the path style. Default value:
false.By default, Amazon S3 SDKs use the virtual-hosted style to access the destination storage system. However, some object storage systems may not support the virtual-hosted style. In this case, you can specify the
use_path_styleparameter to forcibly access the destination storage system in the path style.NoteThe following URI schemas are supported:
http://,https://, ands3://.If you use
http://orhttps://, theuse_path_styleparameter is used to determine whether to access the destination storage system in the path style.If you use
s3://, the destination storage system is accessed in the virtual-hosted style.
Response parameters
The data is exported in a synchronized manner. When the results are returned, the statement execution ends. A row in the returned results shows the result of the data export.
If the execution succeeds, the following information is returned:
SELECT * FROM tbl1 LIMIT 10 into outfile "file:///home/work/path/result_"; +------------+-----------+----------+--------------------------------------------------------------------+ | FileNumber | TotalRows | FileSize | URL | +------------+-----------+----------+--------------------------------------------------------------------+ | 1 | 2 | 8 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ | +------------+-----------+----------+--------------------------------------------------------------------+ 1 row in set (0.05 sec)The following table describes the response parameters.
Parameter
Description
FileNumber
The number of generated files.
TotalRows
The number of returned rows in the result set.
FileSize
The total size of the exported files. Unit: bytes.
URL
The compute node to which the data is exported if data is exported to a local disk.
If parallel export is performed, multiple rows of data are returned.
+------------+-----------+----------+--------------------------------------------------------------------+ | FileNumber | TotalRows | FileSize | URL | +------------+-----------+----------+--------------------------------------------------------------------+ | 1 | 3 | 7 | file:///192.168.1.10/home/work/path/result_{fragment_instance_id}_ | | 1 | 2 | 4 | file:///192.168.1.11/home/work/path/result_{fragment_instance_id}_ | +------------+-----------+----------+--------------------------------------------------------------------+ 2 rows in set (2.218 sec)If the execution fails, an error message is returned. The following SQL statements show an example.
SELECT * FROM tbl INTO OUTFILE ... ERROR 1064 (HY000): errCode = 2, detailMessage = ...
Parallel export
By default, the export of the query result set is unparallel, and a single thread on a BE node is used to export data and return the result set. Therefore, the export duration has a positive correlation with the result set size.
You can use parallel export to speed up data export. To enable parallel export, perform the following steps:
Set the
set enable_parallel_outfilesession variable to true.Check whether the query result set supports parallel export.
After you enable parallel export, you can use the EXPLAIN statement to check whether parallel export is supported. Syntax:
EXPLAIN <select_into_outfile>;Execute the
SELECT INTO OUTFILEstatement. For more information about the syntax, see Syntax.After you execute the
EXPLAINstatement, SelectDB immediately returns a query plan. You must analyze the query plan and check whether parallel export is supported.
If
RESULT FILE SINKis displayed inPLAN FRAGMENT 1, parallel export is supported.If parallel export is supported, you can use
be_instance_num * parallel_fragment_exec_instance_numto calculate the number of requests that can be concurrently processed by a single SelectDB instance.be_instance_num specifies the number of nodes in a BE cluster.
If the cluster provides up to 32 CPU cores, the cluster contains only one node.
If the cluster provides at least 32 CPU cores, the cluster contains multiple nodes, and each node has 32 CPU cores.
parallel_fragment_exec_instance_num specifies the parallel parameter.
You can execute the following statement to view the parameter:
SHOW variables LIKE '% parallel_fragment_exec_instance_num%';You can execute the following statement to configure the parameter:
SET parallel_fragment_exec_instance_num = <parallel_fragment_exec_instance_num>;
If
RESULT FILE SINKis displayed inPLAN FRAGMENT 0, parallel export is not supported.In the following sample query plan of a
SELECT INTO OUTFILEjob,RESULT FILE SINKis displayed inPLAN FRAGMENT 0. In this case, parallel export is not supported.Sample parallel export plan: +-----------------------------------------------------------------------------+ | Explain String | +-----------------------------------------------------------------------------+ | PLAN FRAGMENT 0 | | OUTPUT EXPRS:<slot 2> | <slot 3> | <slot 4> | <slot 5> | | PARTITION: UNPARTITIONED | | | | RESULT SINK | | | | 1:EXCHANGE | | | | PLAN FRAGMENT 1 | | OUTPUT EXPRS:`k1` + `k2` | | PARTITION: HASH_PARTITIONED: `default_cluster:test`.`multi_tablet`.`k1` | | | | RESULT FILE SINK | | FILE PATH: s3://ml-bd-repo/bpit_test/outfile_1951_ | | STORAGE TYPE: S3 | | | | 0:OlapScanNode | | TABLE: multi_tablet | +-----------------------------------------------------------------------------+Create a data export job to export data based on the syntax.
Examples
Export data over the HDFS protocol
Export the results of a simple query to a CSV file. Examples:
When you export data over the HDFS protocol, the PROPERTIES values of SELECT INTO OUTFILE vary based on the status of high availability for the Hadoop cluster.
High availability is disabled for the Hadoop cluster
-- fileSystem_port Default value: 9000.
SELECT * FROM tbl
INTO OUTFILE "hdfs://${host}:${fileSystem_port}/path/to/result_"
FORMAT AS CSV
PROPERTIES
(
"fs.defaultFS" = "hdfs://ip:port",
"hadoop.username" = "work"
);High availability is enabled for the Hadoop cluster
--HA fileSystem_port Default value: 8020.
SELECT * FROM tbl
INTO OUTFILE "hdfs:///path/to/result_"
FORMAT AS CSV
PROPERTIES
(
'fs.defaultFS'='hdfs://hacluster/',
'dfs.nameservices'='hacluster',
'dfs.ha.namenodes.hacluster'='n1,n2',
'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020',
'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020',
'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);If the total file size is less than or equal to 1 GB, the result_0.csv file is generated. If the total file size is greater than 1 GB, multiple segment files named result_0.csv, result_1.csv, ... are generated.
Export data to a file over the Amazon S3 protocol
Export the query results of the UNION statement to a file. Specify that the exported data is stored in the OSS bucket named oss-bucket that resides in the China (Hangzhou) region. Specify that the exported files are in the PARQUET format. You do not need to specify the column delimiter. Specify that a file is generated as an identifier after the data is exported. Sample code:
SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1
INTO OUTFILE "s3://oss-bucket/result_"
FORMAT AS PARQUET
PROPERTIES
(
"s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
"s3.access_key" = "****",
"s3.secret_key" = "****",
"s3.region" = "cn-hangzhou"
);If you export data over the Amazon S3 protocol, you must use http://, https://, or s3:// as the URI schema at the beginning of the URL of the destination storage system regardless of the cloud service provider. Otherwise, the ERROR 1105 (HY000): errCode = 2, detailMessage = Unknown properties: [s3.region, s3.endpoint, s3.secret_key, s3.access_key] error message is displayed.
OSS Export data to an OSS bucket over the Amazon S3 protocol
Export data to OSS over the Amazon S3 protocol. Sample code:
SET enable_parallel_outfile = true;
SELECT k1 FROM tb1 LIMIT 1000
INTO outfile "s3://my_bucket/export/my_file_"
format AS csv
properties
(
"s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
"s3.access_key" = "****",
"s3.secret_key" = "****",
"s3.region" = "cn-hangzhou"
)In the same result set, the query result contains ORDER BY k1, which indicates that the returned results are ordered based on the specified value. Therefore, even parallel export is enabled for this request, the data is not concurrently exported. Sample code:
SET enable_parallel_outfile = true;
SELECT k1 FROM tb1 ORDER BY k1 LIMIT 1000
INTO outfile "s3://my_bucket/export/my_file_"
format AS csv
properties
(
"s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com",
"s3.access_key" = "****",
"s3.secret_key" = "****",
"s3.region" = "cn-hangzhou"
)Data type mappings
Parquet and ORC files support different data types. SelectDB provides the data export feature to automatically export data in SelectDB instances to data of the types that are supported by Parquet and ORC files.
The following table describes the data type mappings between SelectDB data and data in ORC files.
Data type of ApsaraDB for SelectDB
Data type supported by ORC files
BOOLEAN
BOOLEAN
TINYINT
TINYINT
SMALLINT
SMALLINT
INT
INT
BIGINT
BIGINT
LARGEINT
STRING
DATE
STRING
DATEV2
STRING
DATETIME
STRING
DATETIMEV2
TIMESTAMP
FLOAT
FLOAT
DOUBLE
DOUBLE
CHAR / VARCHAR / STRING
STRING
DECIMAL
DECIMAL
STRUCT
STRUCT
MAP
MAP
ARRAY
ARRAY
If you export data from the memory of a SelectDB instance to Parquet files, the data is first mapped to the data types supported by Apache Arrow and then written to Parquet files by Apache Arrow. The following table describes the data type mappings between SelectDB and Apache Arrow data.
Data type of ApsaraDB for SelectDB
Data type of Apache Arrow
BOOLEAN
BOOLEAN
TINYINT
INT8
SNALLINT
INT16
INT
INT32
BIGINT
INT64
LARGEINT
UTF8
DATE
UTF8
DATEV2
UTF8
DATETIME
UTF8
DATETIMEV2
UTF8
FLOAT
FLOAT32
DOUBLE
FLOAT64
CHAR / VARCHAR / STRING
UTF8
DECIMAL
DECIMAL128
STRUCT
STRUCT
MAP
MAP
ARRAY
LIST