You can migrate data from an Apache Doris database to an ApsaraDB for SelectDB instance using Object Storage Service (OSS) or catalogs. You can choose a migration method based on your business scenario. This topic describes how to migrate offline data from an Apache Doris database to an ApsaraDB for SelectDB instance and provides rules for method selection.
Method selection
Select a data migration method based on your business scenario. The following table describes the available methods and their use cases.
Method | Scenario | Reason | Operation |
Export and import through OSS | Your data is stored on the Alibaba Cloud platform, and the data volume is large. |
| |
Catalog | Your data is stored on the Alibaba Cloud platform, and the data volume is small. Note This includes scenarios such as data in Alibaba Cloud EMR clusters. |
|
This topic describes how to use an OSS bucket or a catalog to migrate offline data from a Doris database to SelectDB.
Use an OSS bucket to migrate data
Prerequisites
OSS requirements
OSS is activated. For more information, see Activate OSS.
The RAM user that you use to access OSS has read and write permissions for OSS objects. For more information, see Permissions and access control.
Cluster requirements
A database account is created. For more information, see Account management.
Preparations
Create a bucket in the OSS console. For more information, see Create buckets in the console.
Make sure that the OSS bucket and your ApsaraDB for SelectDB instance are in the same region.
In this topic, the bucket is named `test-selectdb`.
Step 1: Export source data
Log on to the source cluster.
For more information about how to log on to a self-managed Doris database, see Connecting by MySQL Protocol - Apache Doris.
Optional: Prepare sample data.
If you already have data to migrate, you can skip this step.
Create a database.
Run the following statement to create a database.
CREATE DATABASE source_db;Create a table.
Run the following statement to create a table.
CREATE TABLE IF NOT EXISTS source_tb ( `c1` int(11) NULL, `c2` string NULL, `c3` bigint NULL ) DISTRIBUTED BY HASH(c1) BUCKETS 20 PROPERTIES("replication_num" = "1");Insert sample data.
INSERT INTO source_tb VALUES (1, 'doris', 18), (2, 'nereids', 20), (3, 'pipelibe', 99999), (4, 'Apache', 122123455), (5, null, null);Verify the data.
Run the following statement to verify the sample data.
SELECT * FROM `source_tb` limit 10;The following result is returned.
+--------------+--------------+--------------+ | c1 | c2 | c3 | +--------------+--------------+--------------+ | 1 | doris | 18 | | 3 | pipelibe | 99999 | | 5 | | | | 2 | nereids | 20 | | 4 | Apache | 122123455 | +--------------+--------------+--------------+
Back up the `CREATE TABLE` statement for the table that contains the data to be migrated.
You can run the
SHOW CREATE TABLEstatement to view and back up the `CREATE TABLE` statement for the source table. The following example shows how to run the statement.SHOW CREATE TABLE source_tb ;The following result is returned.
+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | source_tb | CREATE TABLE `source_tb` ( `c1` int NULL, `c2` text NULL, `c3` bigint NULL ) ENGINE=OLAP DUPLICATE KEY(`c1`) DISTRIBUTED BY HASH(`c1`) BUCKETS 20 PROPERTIES ( "file_cache_ttl_seconds" = "0", "is_being_synced" = "false", "storage_medium" = "hdd", "storage_format" = "V2", "inverted_index_storage_format" = "V2", "light_schema_change" = "true", "disable_auto_compaction" = "false", "enable_single_replica_compaction" = "false", "group_commit_interval_ms" = "10000", "group_commit_data_bytes" = "134217728" ); | +-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ Rows returned: [1], Time elapsed: [22ms]Export the data to OSS.
Run the `EXPORT` command to export the data to OSS.
The following code shows the syntax.
EXPORT TABLE table_name [PARTITION (p1[,p2])] [WHERE] TO export_path [opt_properties] WITH S3 [s3_properties];The following table describes the parameters.
Parameter
Required
Description
table_name
Yes
The name of the table to export.
partition
No
The partitions of the table to export.
where
No
The table data that meets the specified conditions to export.
export_path
Yes
The path to the exported file.
The value can be a directory or a combination of a directory and a file prefix. For example,
s3://path/to/my_file_.opt_properties
No
The properties used to specify export parameters. For more information, see opt_properties syntax and parameters.
s3_properties
Yes
The properties used to configure S3 protocol attributes. For more information, see s3_properties syntax and parameters.
The following code shows the syntax of `opt_properties`.
[PROPERTIES ("key"="value", ...)]The following table describes the parameters.
Parameter
Required
Description
label
No
The label of the export job. If you do not specify this parameter, the system randomly generates a label.
column_separator
No
The column delimiter for the exported data.
The default value is
\t. Multi-byte delimiters are supported. This parameter is valid only for the CSV file format.line_delimiter
No
The row delimiter for the exported data.
The default value is
\n. Multi-byte delimiters are supported. This parameter is valid only for the CSV file format.columns
No
The columns of the table to export.
format
No
The file format for the export job.
Supported formats: parquet, orc, csv, csv_with_names, and csv_with_names_and_types.
The default format is csv.
max_file_size
No
The maximum size of a single file for an export job. If the result exceeds this value, it is split into multiple files.
Value range: 5 MB to 2 GB
Default value: 1 GB
When you specify the export file format as ORC (that is, the format_as parameter is ORC), the actual size of the split file is
ceil (max_file_size/64) × 64MB.parallelism
No
The concurrency of the export job. The default value is 1.
The export job starts a number of threads equal to the parallelism value to execute
select into outfilestatements. If the parallelism value is greater than the number of tablets in the table, the system automatically sets the parallelism to the number of tablets. This means eachselect into outfilestatement is responsible for one tablet.delete_existing_files
No
Specifies whether to delete all files in the destination path.
false (default): Does not delete files in the destination path.
true: Deletes all files in the directory specified by
export_path, and then exports data to that directory. For example:If
"file_path" = "/user/tmp", all files and directories under"/user/"are deleted.If
"file_path" = "/user/tmp/", all files and directories under"/user/tmp/"are deleted.
WarningSetting
delete_existing_files = trueis a risky operation. Use it only in test environments.To use the delete_existing_files parameter, submit a ticket to Alibaba Cloud support. The SelectDB technical team adds the
enable_delete_existing_files = trueconfiguration to the fe.conf file and restarts the FE. The delete_existing_files parameter takes effect only after the restart.
with_bom
No
The default value is false. If set to true, the exported file is encoded in UTF-8 with a bill of materials (BOM). This is effective only for CSV-related file formats.
timeout
No
The timeout period for the export job. The default value is 2 hours. The unit is seconds.
The following code shows the syntax of `s3_properties`.
("key"="value"[,...])The following table describes the parameters.
Parameter
Required
Description
s3.endpoint
Yes
The endpoint of the destination that uses the S3 protocol.
This topic uses Alibaba Cloud OSS for migration. This parameter is the
Endpointused to access OSS data. For more information about how to obtain the endpoint, see OSS regions and endpoints.ImportantMake sure that the OSS bucket and your ApsaraDB for SelectDB instance are in the same region.
s3.access_key
Yes
The user identity key for the destination that uses the S3 protocol.
This topic uses Alibaba Cloud OSS for migration. This parameter is the
AccessKey IDof the RAM user used to access OSS.s3.secret_key
Yes
The user authentication string for the destination that uses the S3 protocol.
This topic uses Alibaba Cloud OSS for migration. This parameter is the
AccessKey secretof the RAM user used to access OSS.s3.region
Yes
The region of the destination that uses the S3 protocol.
This topic uses Alibaba Cloud OSS for migration. This parameter is the region of the Alibaba Cloud OSS bucket. For more information about how to obtain the region, see OSS regions and endpoints.
use_path_style
No
The default value is
false.The S3 SDK uses the virtual-hosted style by default.
Some object storage systems may not have virtual-hosted style access enabled or supported. You can add the
use_path_styleparameter to force the use of the path style.NoteThree URI schemas are currently supported:
http://,https://, ands3://.If you use
http://orhttps://, the system determines whether to use the path style to access the S3 protocol destination based on theuse_path_styleparameter.If you use
s3://, the system uses the virtual-hosted style to access the S3 protocol destination.
The following code provides an example.
EXPORT TABLE source_tb -- Export data from the source_tb table. TO "s3://test-selectdb/test/" -- Export data to the test folder in the OSS bucket named test-selectdb. PROPERTIES ( "label" = "test_export", -- The job label. "format" = "orc", -- The file storage format is ORC. "max_file_size" = '2048MB', -- The file split size. 'parallelism' = '10' -- The export concurrency is 10. ) WITH s3 ( "s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com", -- The URL for accessing the OSS bucket. "s3.region" = "oss-cn-hangzhou", -- The region of the OSS bucket. "s3.secret_key"="yourAccessKeySecret", -- The AccessKey secret of the RAM user for accessing OSS. "s3.access_key" = "LTAI****************" -- The AccessKey ID of the RAM user for accessing OSS. );Verify that the data is exported.
You can use one of the following two methods to check the data export status.
Check the status by running a command.
In a client that is connected to Doris, run the following statement to check the export status.
SHOW export;Export successful: If the `State` in the result is `FINISHED`, the data export is successful.
Export failed: If the `State` in the result is `CANCELLED`, the data export failed. You can check the `ErrorMsg` field to troubleshoot the failure.
Check the status in the OSS console.
Log on to the OSS console and check whether the file is generated in the specified export path.
Step 2: Import data to the destination table
Log on to the destination cluster. For more information, see Connect to an ApsaraDB for SelectDB instance using DMS.
Create the destination table.
Create a database.
Optional: Create a database.
If a destination database already exists, you can skip this step.
Run the following statement to create a database.
CREATE DATABASE aim_db;Create a table.
Run the backed-up `CREATE TABLE` statement from the source table.
The following code shows the `CREATE TABLE` statement used in this topic.
CREATE TABLE IF NOT EXISTS aim_tb ( `c1` int(11) NULL, `c2` string NULL, `c3` bigint NULL ) ENGINE=OLAP DUPLICATE KEY(`c1`) DISTRIBUTED BY HASH(c1) BUCKETS 20 PROPERTIES("replication_num" = "1");
Import the data.
Use `S3 load` to import data from OSS to SelectDB. For more information about the
S3 loadsyntax and additional examples, see OSS Load.The following code provides an example.
LOAD LABEL label_1 -- labelname: a random, unique identifier for the job ( DATA INFILE("s3://test-selectdb/test/59ab2e9dc4ec4c04-9e50e45a6fda2c8e_0.orc") -- The path of the exported data file in OSS from the previous step. INTO TABLE aim_tb -- The name of the table to import data into. FORMAT AS ORC -- The format of the imported file, which is the same as the format of the exported data. ) WITH S3 ( "AWS_PROVIDER" = "OSS", "AWS_REGION" = "oss-cn-hangzhou", -- The region of the OSS bucket. "AWS_ENDPOINT" = "oss-cn-hangzhou-internal.aliyuncs.com", -- The URL for accessing the OSS bucket. "AWS_ACCESS_KEY" = "LTAI****************", -- The AccessKey ID of the RAM user for accessing OSS. "AWS_SECRET_KEY"="yourAccessKeySecret" -- The AccessKey secret of the RAM user for accessing OSS. ) PROPERTIES ( "timeout" = "3600" -- The import timeout period. );Verify the import.
You can use one of the following two methods to check the import result.
Check the status by running a command.
In a client that is connected to SelectDB, run the following statement to check the import status.
SHOW load;Import successful: If the `State` in the result is `FINISHED`, the data import is successful.
Check the data in the destination table. The following code provides an example statement.
SELECT * FROM `aim_tb` limit 10;The following result is returned.
+--------------+--------------+--------------+ | c1 | c2 | c3 | +--------------+--------------+--------------+ | 1 | doris | 18 | | 3 | pipelibe | 99999 | | 5 | | | | 2 | nereids | 20 | | 4 | Apache | 122123455 | +--------------+--------------+--------------+The data is consistent with the source data from Step 1: Export source data. The import is successful.
Use a catalog to migrate data
Prerequisites
Ensure that the Doris instance and the SelectDB instance can communicate with each other.
The Doris instance and the SelectDB instance are in the same VPC. If they are not in the same VPC, you must first resolve the network connectivity issue. For more information, see How do I resolve network connectivity issues between a SelectDB instance and a data source?
The IP address of the Doris instance is added to the IP address whitelist of the SelectDB instance. For more information, see Configure a whitelist.
If the Doris instance has a whitelist, the IP address of the network segment where the SelectDB instance resides must be added to the whitelist of the Doris instance.
To obtain the IP address of the SelectDB instance in the VPC to which the SelectDB instance belongs, you can perform the operations provided in How do I view the IP addresses in the VPC to which my ApsaraDB SelectDB instance belongs?
To obtain the public IP address of the SelectDB instance, you can run the ping command to access the public endpoint of the SelectDB instance and obtain the IP address of the instance.
The version of the SelectDB instance cannot be earlier than the version of the Doris instance.
NoteSelectDB is a cloud-native real-time data warehouse built on Doris. For more information about the version relationship between them, see Kernel release notes.
Familiarize yourself with catalogs and their basic operations. For more information, see Data lakehouse.
Sample environment
This topic provides an example of how to migrate data from the `doris_t` table in the `doris_db` database of a Doris instance to the `test_doris2SelectDB` table in the `test_db` database of a SelectDB instance. When you perform the migration, modify the parameters based on your requirements. The following list describes the sample environment:
Destination database: `test_db`
Destination table: `test_doris2SelectDB`
Source database: `doris_db`
Source data table: `doris_t`
Source data preparation example
Log on to your source Doris database and run the following commands.
Create a database.
CREATE DATABASE doris_db;Create a table.
CREATE TABLE doris_t ( id int, name string, age int ) DISTRIBUTED BY HASH(id) BUCKETS 4 PROPERTIES("replication_num" = "1");Insert data.
INSERT INTO doris_t VALUES (1, 'Alice', 25), (2, 'Bob', 30), (3, 'Charlie', 35), (4, 'David', 40), (5, 'Eve', 45);
Procedure
Connect to the SelectDB instance. For more information, see Connect to an ApsaraDB for SelectDB instance using a MySQL client.
Create a Doris JDBC catalog.
CREATE CATALOG doris_catalog PROPERTIES ( "type"="jdbc", "user"="root", "password"="123456", "jdbc_url" = "jdbc:mysql://127.0.0.1:9030/doris_db", "driver_url" = "mysql-connector-java-8.0.25.jar", "driver_class" = "com.mysql.cj.jdbc.Driver" )Parameter
Required
Default value
Description
user
Yes
None
The account of the Doris database.
password
Yes
None
The password of the Doris database account.
jdbc_url
Yes
None
The JDBC connection string. It must contain the connection address of the Doris database.
Format:
jdbc:mysql://<host>:<port>/<database>host: The IP address of the Doris database.
port: The port number of the Doris database.
database: The name of the database to access.
Example:
jdbc:mysql://127.0.0.1:9030/doris_dbdriver_url
Yes
None
The name of the JDBC driver JAR package.
NoteWe recommend that you use
mysql-connector-java-8.0.25.jar.If you want to use other JAR packages, submit a ticket for consultation.
driver_class
Yes
None
The class name of the JDBC driver.
We recommend that you set it to
com.mysql.cj.jdbc.Driver.lower_case_table_names
NoteIn version 4.0, this parameter is renamed to lower_case_meta_names.
No
"false"
Specifies whether to synchronize the database and table names of the external JDBC data source in lowercase.
true: Maintains a mapping from lowercase names to the actual names in the remote system, which lets you query databases and tables that are not in lowercase. In this case, the names of databases, tables, and columns are all converted to lowercase.
false: You cannot query databases and tables that are not in lowercase.
ImportantFor SelectDB 3.0:
If the
lower_case_table_namesparameter of the FE is set to1or2, thelower_case_table_namesparameter of the catalog must be set totrue.If the
lower_case_table_namesparameter of the FE is set to0, the catalog parameter can betrueorfalse.
For SelectDB 4.0:
If the
lower_case_table_namesparameter of the FE is0or2, the names of databases, tables, and columns are not converted.If the
lower_case_table_namesparameter of the FE is1, table names are converted to lowercase, but database and column names are not converted.
only_specified_database
No
"false"
Specifies whether to synchronize only the specified database.
true: Synchronizes only the database of the data source specified in the JDBC URL.
false: Synchronizes all databases in the JDBC URL.
include_database_list
No
""
When
only_specified_database=true, specifies multiple databases to synchronize, separated by commas (,). Database names are case-sensitive.exclude_database_list
No
""
When
only_specified_database=true, specifies multiple databases not to synchronize, separated by commas (,). Database names are case-sensitive.meta_names_mapping
No
""
If the external data source has objects with names that differ only in case, such as DORIS and doris, an error is reported due to ambiguity when you query the catalog. In this case, configure the
meta_names_mappingparameter to resolve the conflict.For more information, see Case sensitivity settings.
ImportantThis parameter is applicable only to SelectDB 4.0.
View the catalog.
SHOW CATALOGS; -- Check whether the CATALOG was created successfully.The following result is returned.
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+ | CatalogId | CatalogName | Type | IsCurrent | CreateTime | LastUpdateTime | Comment | +--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+ | 436009309195 | doris_catalog | jdbc | | 2024-08-06 17:09:08.058 | 2024-07-19 18:04:37 | | | 0 | internal | internal | yes | UNRECORDED | NULL | Doris internal catalog | +--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+Optional: Switch to the `doris_catalog` external catalog.
You can view and access data in the `doris_catalog` external catalog in the same way as you access data in an internal catalog.
NoteCurrently, ApsaraDB for SelectDB supports only read operations on data in external catalogs.
SWITCH doris_catalog;Optional: Switch to the internal catalog.
If you did not perform Step 4, you can skip this step.
SWITCH internal;Optional: Create a database.
If the destination database already exists, you can skip this step.
CREATE database test_db;Switch to the destination database.
USE test_db;Create a table.
If a destination table already exists, check whether the column data types in the destination table correspond to the column data types in the Doris source table.
If a destination table does not exist, make sure that the column data types in the destination table correspond to the column data types in the Doris source table when you create the table.
For more information about data type mapping, see Type mapping.
CREATE TABLE test_doris2SelectDB ( id int, name string, age int ) DISTRIBUTED BY HASH(id) BUCKETS 4 PROPERTIES("replication_num" = "1");Migrate the data.
INSERT INTO test_doris2SelectDB SELECT * FROM doris_catalog.doris_db.doris_t;Check the data import status.
SELECT * FROM test_doris2SelectDB;
Notes on incremental data migration
In production environments, Doris data consists of offline data and incremental data. Because data is typically migrated from Doris to SelectDB for data warehousing and query acceleration, you can use one of the following two methods for incremental data migration:
When you write data to the Doris instance, write a copy of the data to the SelectDB instance in parallel.
Use periodic jobs to read partitioned data from the Doris instance and write the data to the SelectDB instance.