This topic describes how to migrate offline data from a self-managed Apache Doris database to an ApsaraDB for SelectDB instance. Two methods are available: export and import through Object Storage Service (OSS), and JDBC catalog.
Method comparison
Both methods require the Doris database and the SelectDB instance to be in the same region to use internal network traffic and avoid data transfer fees.
| Criteria | Export and import through OSS | Catalog |
|---|---|---|
| Best for | Large data volumes | Small data volumes |
| Scenario | Data stored on the Alibaba Cloud platform | Data stored on the Alibaba Cloud platform, including Alibaba Cloud EMR clusters |
| External components | Requires an OSS bucket | None |
| Advantage | Supports large-volume data migration | Simple setup with no external dependencies |
Migrate data through OSS
This method exports data from the source Doris database to an OSS bucket, then imports the data from OSS into ApsaraDB for SelectDB. The process involves five steps:
Activate OSS and create a bucket in the same region as the SelectDB instance.
Back up the source table schema.
Export source data to OSS.
Create the destination table in SelectDB.
Import data from OSS into SelectDB.
Prerequisites
OSS:
OSS is activated. For more information, see Activate OSS.
The RAM user that accesses OSS has read and write permissions for OSS objects. For more information, see Permissions and access control.
Cluster:
A database account is created. For more information, see Account management.
OSS bucket:
Create a bucket in the OSS console. For more information, see Create buckets in the console.
The OSS bucket and the ApsaraDB for SelectDB instance must be in the same region.
In the examples in this topic, the bucket is named test-selectdb.
Step 1: Export source data
Run on the self-managed Doris database.
Log on to the source cluster. For more information about how to connect to a self-managed Doris database, see Connecting by MySQL Protocol - Apache Doris.
(Optional) Prepare sample data. Skip this step if you already have data to migrate. Expected result:
Verify the data.
CREATE DATABASE source_db;CREATE TABLE IF NOT EXISTS source_tb ( `c1` int(11) NULL, `c2` string NULL, `c3` bigint NULL ) DISTRIBUTED BY HASH(c1) BUCKETS 20 PROPERTIES("replication_num" = "1");INSERT INTO source_tb VALUES (1, 'doris', 18), (2, 'nereids', 20), (3, 'pipelibe', 99999), (4, 'Apache', 122123455), (5, null, null);SELECT * FROM `source_tb` limit 10;+--------------+--------------+--------------+ | c1 | c2 | c3 | +--------------+--------------+--------------+ | 1 | doris | 18 | | 3 | pipelibe | 99999 | | 5 | | | | 2 | nereids | 20 | | 4 | Apache | 122123455 | +--------------+--------------+--------------+Back up the CREATE TABLE statement for the source table. Run the
SHOW CREATE TABLEstatement to view and save the table schema. Example: Expected result:SHOW CREATE TABLE source_tb;+-----------------+----------------------------------------------------+ | Table | Create Table | +-----------------+----------------------------------------------------+ | source_tb | CREATE TABLE `source_tb` ( | | | `c1` int NULL, | | | `c2` text NULL, | | | `c3` bigint NULL | | | ) ENGINE=OLAP | | | DUPLICATE KEY(`c1`) | | | DISTRIBUTED BY HASH(`c1`) BUCKETS 20 | | | PROPERTIES ( | | | "file_cache_ttl_seconds" = "0", | | | "is_being_synced" = "false", | | | "storage_medium" = "hdd", | | | "storage_format" = "V2", | | | "inverted_index_storage_format" = "V2", | | | "light_schema_change" = "true", | | | "disable_auto_compaction" = "false", | | | "enable_single_replica_compaction" = "false", | | | "group_commit_interval_ms" = "10000", | | | "group_commit_data_bytes" = "134217728" | | | ); | +-----------------+----------------------------------------------------+Export data to OSS. Run the
EXPORTcommand to export data to OSS. Syntax: opt_properties parameters: Syntax:[PROPERTIES ("key"="value", ...)]s3_properties parameters: Syntax:("key"="value"[,...])Example:Main parameters:
Parameter Required Description table_nameYes The name of the table to export. partitionNo The partitions to export. whereNo Filter conditions for the data to export. export_pathYes The destination file path. Can be a directory or a directory with a file prefix. Example: s3://path/to/my_file_.opt_propertiesNo Export parameters. See the opt_properties table below. s3_propertiesYes S3 protocol configuration. See the s3_properties table below. Parameter Required Default Description labelNo Randomly generated The label of the export job. column_separatorNo \tColumn delimiter. Supports multi-byte delimiters. Valid only for CSV format. line_delimiterNo \nRow delimiter. Supports multi-byte delimiters. Valid only for CSV format. columnsNo All columns The columns to export. formatNo csvFile format. Supported values: parquet,orc,csv,csv_with_names,csv_with_names_and_types.max_file_sizeNo 1 GB Maximum size of a single output file. Range: 5 MB to 2 GB. For ORC format, the actual split size is ceil(max_file_size/64) x 64MB.parallelismNo 1 Number of concurrent threads. Each thread runs a SELECT INTO OUTFILE statement. If the value exceeds the tablet count, the system caps it at the tablet count. delete_existing_filesNo falseWhether to delete all files in the destination path before export. WarningThis is a destructive operation. Use only in test environments. To enable this parameter, submit a ticket to Alibaba Cloud support. The SelectDB technical team adds
enable_delete_existing_files = trueto fe.conf and restarts FE.with_bomNo falseWhether to add a byte order mark (BOM) to the exported file. Applies to UTF-8 encoding. Valid only for CSV-related formats. timeoutNo 2 hours Export timeout in seconds. Parameter Required Description s3.endpointYes The endpoint for the OSS bucket. The OSS bucket and the SelectDB instance must be in the same region. For more information, see OSS regions and endpoints. s3.access_keyYes The AccessKey ID of the RAM user for accessing OSS. s3.secret_keyYes The AccessKey secret of the RAM user for accessing OSS. s3.regionYes The region of the OSS bucket. For more information, see OSS regions and endpoints. use_path_styleNo Default: false. The S3 SDK uses virtual-hosted style by default. Set totrueto force path style access. Three URI schemas are supported:http://,https://, ands3://. Forhttp://orhttps://, the system determines the access style based on theuse_path_styleparameter. Fors3://, the system always uses virtual-hosted style.EXPORT TABLE table_name [PARTITION (p1[,p2])] [WHERE] TO export_path [opt_properties] WITH S3 [s3_properties];EXPORT TABLE source_tb -- Export data from the source_tb table. TO "s3://test-selectdb/test/" -- Export data to the test folder in the OSS bucket named test-selectdb. PROPERTIES ( "label" = "test_export", -- The job label. "format" = "orc", -- The file storage format is ORC. "max_file_size" = '2048MB', -- The file split size. 'parallelism' = '10' -- The export concurrency is 10. ) WITH s3 ( "s3.endpoint" = "oss-cn-hangzhou-internal.aliyuncs.com", -- The URL for accessing the OSS bucket. "s3.region" = "oss-cn-hangzhou", -- The region of the OSS bucket. "s3.secret_key"="yourAccessKeySecret", -- The AccessKey secret of the RAM user for accessing OSS. "s3.access_key" = "LTAI****************" -- The AccessKey ID of the RAM user for accessing OSS. );Verify the export. Check the export status by using one of the following methods:
Run a command: In a client connected to the Doris database, run the following statement: ``
sql SHOW export;``If the
Statecolumn showsFINISHED, the export succeeded.If the
Statecolumn showsCANCELLED, the export failed. Check theErrorMsgfield to troubleshoot.
Check the OSS console: Log on to the OSS console and verify that files exist in the specified export path.
Step 2: Import data to the destination table
Run on ApsaraDB for SelectDB.
Connect to the SelectDB instance. For more information, see Connect to an ApsaraDB for SelectDB instance using DMS.
(Optional) Create a destination database. Skip this step if the destination database already exists.
CREATE DATABASE aim_db;Create the destination table. Run the backed-up CREATE TABLE statement from the source table. The following example shows the statement used in this topic:
CREATE TABLE IF NOT EXISTS aim_tb ( `c1` int(11) NULL, `c2` string NULL, `c3` bigint NULL ) ENGINE=OLAP DUPLICATE KEY(`c1`) DISTRIBUTED BY HASH(c1) BUCKETS 20 PROPERTIES("replication_num" = "1");Import the data. Use S3 Load to import data from OSS into SelectDB. For more information about the S3 Load syntax and additional examples, see OSS Load.
LOAD LABEL label_1 -- labelname: a random, unique identifier for the job ( DATA INFILE("s3://test-selectdb/test/59ab2e9dc4ec4c04-9e50e45a6fda2c8e_0.orc") -- The path of the exported data file in OSS from the previous step. INTO TABLE aim_tb -- The name of the table to import data into. FORMAT AS ORC -- The format of the imported file, which is the same as the format of the exported data. ) WITH S3 ( "AWS_PROVIDER" = "OSS", "AWS_REGION" = "oss-cn-hangzhou", -- The region of the OSS bucket. "AWS_ENDPOINT" = "oss-cn-hangzhou-internal.aliyuncs.com", -- The URL for accessing the OSS bucket. "AWS_ACCESS_KEY" = "LTAI****************", -- The AccessKey ID of the RAM user for accessing OSS. "AWS_SECRET_KEY"="yourAccessKeySecret" -- The AccessKey secret of the RAM user for accessing OSS. ) PROPERTIES ( "timeout" = "3600" -- The import timeout period. );Verify the import. Check the import result by using one of the following methods:
Run a command: In a client connected to SelectDB, run the following statement: ``
sql SHOW load;`If theStatecolumn showsFINISHED`, the import succeeded.Query the destination table: Compare the destination data against the source data. ``
sql SELECT * FROMaim_tblimit 10;`Expected result:`+--------------+--------------+--------------+ | c1 | c2 | c3 | +--------------+--------------+--------------+ | 1 | doris | 18 | | 3 | pipelibe | 99999 | | 5 | | | | 2 | nereids | 20 | | 4 | Apache | 122123455 | +--------------+--------------+--------------+`` The data matches the source data from Step 1. The import is successful.
Migrate data by using a catalog
This method uses a JDBC catalog to read data from the source Doris database and write it to ApsaraDB for SelectDB. The process involves four steps:
Configure network connectivity between the Doris instance and the SelectDB instance.
Create a JDBC catalog in SelectDB that points to the source Doris database.
Create the destination table in SelectDB.
Run an INSERT INTO ... SELECT statement to migrate the data.
Prerequisites
Network connectivity:
The Doris instance and the SelectDB instance are in the same VPC. If they are not in the same VPC, resolve the connectivity issue first. For more information, see How do I resolve network connectivity issues between a SelectDB instance and a data source?
The IP address of the Doris instance is added to the IP address whitelist of the SelectDB instance. For more information, see Configure a whitelist.
If the Doris instance has a whitelist, add the IP address of the network segment where the SelectDB instance resides to the whitelist of the Doris instance.
To get the VPC IP address of the SelectDB instance, see How do I view the IP addresses in the VPC to which my ApsaraDB SelectDB instance belongs?
To get the public IP address of the SelectDB instance, run the
pingcommand against the public endpoint of the SelectDB instance.
Version compatibility:
The version of the SelectDB instance must not be earlier than the version of the Doris instance.
SelectDB is a cloud-native real-time data warehouse built on Apache Doris. For more information about version mapping, see Kernel release notes.
Catalog knowledge:
Familiarize yourself with catalogs and their basic operations. For more information, see Data lakehouse.
Sample environment
This topic uses the following environment. Replace the values with your own when performing the migration.
| Role | Database | Table |
|---|---|---|
| Source (Doris) | doris_db | doris_t |
| Destination (SelectDB) | test_db | test_doris2SelectDB |
(Optional) Prepare source data
Run on the self-managed Doris database.
Log on to the source Doris database and run the following commands.
Create a database.
CREATE DATABASE doris_db;Create a table.
CREATE TABLE doris_t ( id int, name string, age int ) DISTRIBUTED BY HASH(id) BUCKETS 4 PROPERTIES("replication_num" = "1");Insert data.
INSERT INTO doris_t VALUES (1, 'Alice', 25), (2, 'Bob', 30), (3, 'Charlie', 35), (4, 'David', 40), (5, 'Eve', 45);
Procedure
Run on ApsaraDB for SelectDB.
Connect to the SelectDB instance. For more information, see Connect to an ApsaraDB for SelectDB instance using a MySQL client.
Create a Doris JDBC catalog.
JDBC catalog parameters:
Parameter Required Default Description userYes None The account of the Doris database. passwordYes None The password of the Doris database account. jdbc_urlYes None The JDBC connection string. Format: jdbc:mysql://<host>:<port>/<database>.host: the IP address of the Doris database.port: the port number.database: the name of the database. Example:jdbc:mysql://127.0.0.1:9030/doris_db.driver_urlYes None The name of the JDBC driver JAR package. We recommend mysql-connector-java-8.0.25.jar. To use other JAR packages, submit a ticket for consultation.driver_classYes None The class name of the JDBC driver. We recommend com.mysql.cj.jdbc.Driver.lower_case_table_namesNo "false"Whether to synchronize database and table names in lowercase. true: converts names to lowercase and maintains a mapping to the actual remote names.false: does not query databases and tables that are not in lowercase. For SelectDB 3.0: If the FElower_case_table_namesparameter is1or2, set this catalog parameter totrue. If the FE parameter is0, eithertrueorfalseis acceptable. For SelectDB 4.0: this parameter is renamed tolower_case_meta_names. If the FE parameter is0or2, names are not converted. If the FE parameter is1, only table names are converted to lowercase.only_specified_databaseNo "false"Whether to synchronize only the database specified in the JDBC URL. true: synchronizes only the specified database.false: synchronizes all databases.include_database_listNo ""When only_specified_database=true, specifies additional databases to synchronize, separated by commas. Case-sensitive.exclude_database_listNo ""When only_specified_database=true, specifies databases to exclude from synchronization, separated by commas. Case-sensitive.meta_names_mappingNo ""Resolves conflicts when external data source objects differ only in case (for example, DORIS and doris). For more information, see Case sensitivity settings. Applicable to SelectDB 4.0 only. CREATE CATALOG doris_catalog PROPERTIES ( "type"="jdbc", "user"="root", "password"="123456", "jdbc_url" = "jdbc:mysql://127.0.0.1:9030/doris_db", "driver_url" = "mysql-connector-java-8.0.25.jar", "driver_class" = "com.mysql.cj.jdbc.Driver" )Verify the catalog. Expected result:
SHOW CATALOGS; -- Check whether the catalog was created successfully.+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+ | CatalogId | CatalogName | Type | IsCurrent | CreateTime | LastUpdateTime | Comment | +--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+ | 436009309195 | doris_catalog | jdbc | | 2024-08-06 17:09:08.058 | 2024-07-19 18:04:37 | | | 0 | internal | internal | yes | UNRECORDED | NULL | Doris internal catalog | +--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+(Optional) Switch to the
doris_catalogexternal catalog. You can view and access data in the external catalog the same way as the internal catalog.NoteApsaraDB for SelectDB supports only read operations on data in external catalogs.
SWITCH doris_catalog;(Optional) Switch back to the internal catalog. Skip this step if you did not perform step 4.
SWITCH internal;(Optional) Create the destination database. Skip this step if the destination database already exists.
CREATE database test_db;Switch to the destination database.
USE test_db;Create the destination table. Make sure the column data types in the destination table match the source Doris table. For more information about data type mapping, see Type mapping.
CREATE TABLE test_doris2SelectDB ( id int, name string, age int ) DISTRIBUTED BY HASH(id) BUCKETS 4 PROPERTIES("replication_num" = "1");Migrate the data.
INSERT INTO test_doris2SelectDB SELECT * FROM doris_catalog.doris_db.doris_t;Verify the migration. Expected result:
SELECT * FROM test_doris2SelectDB;+------+---------+------+ | id | name | age | +------+---------+------+ | 1 | Alice | 25 | | 2 | Bob | 30 | | 3 | Charlie | 35 | | 4 | David | 40 | | 5 | Eve | 45 | +------+---------+------+
Incremental data migration
In production environments, Doris data consists of offline data and incremental data. Because data is typically migrated from Doris to SelectDB for data warehousing and query acceleration, use one of the following methods for incremental data migration:
Dual write: When writing data to the Doris instance, write a copy of the data to the SelectDB instance in parallel.
Periodic sync: Use periodic jobs to read partitioned data from the Doris instance and write the data to the SelectDB instance.