This topic describes how to import data from Object Storage Service (OSS) to ApsaraDB for ClickHouse.
Prerequisites
OSS is activated. For more information about how to activate OSS, see Activate OSS.
An OSS bucket is created in the region where the ApsaraDB for ClickHouse cluster is deployed. For more information about how to create an OSS bucket, see Create buckets.
The Alibaba Cloud account that you use to access the OSS bucket is granted the required permissions to read objects in the bucket. For more information about how to grant permissions, see Overview.
Precautions
Make sure that your OSS bucket and your ApsaraDB for ClickHouse cluster are deployed in the same region, and that you can access OSS over a virtual private cloud (VPC).
Step 1: Upload test data to OSS
Prepare test data. Store the following test data in a CSV file named test.csv.
1,yang,32,shanghai,http://example.com 2,wang,22,beijing,http://example.com 3,xiao,23,shenzhen,http://example.com 4,jess,45,hangzhou,http://example.com 5,jack,14,shanghai,http://example.com 6,tomy,25,hangzhou,http://example.com 7,lucy,45,shanghai,http://example.com 8,tengyin,26,shanghai,http://example.com 9,wangli,27,shenzhen,http://example.com 10,xiaohua,37,shanghai,http://example.com
Log on to the OSS console.
In the left-side navigation pane, click Buckets.
On the Buckets page, find the OSS bucket that you want to manage and click the bucket name.
In the left-side navigation pane, click Files. On the page that appears, click Upload.
On the Upload page, upload the CSV file named test.csv.
NoteFor more information about parameter settings, see Upload objects.
Step 2: Create a table in ApsaraDB for ClickHouse
After you create a table in ApsaraDB for ClickHouse, import data from OSS to ApsaraDB for ClickHouse.
Log on to the ApsaraDB for ClickHouse console.
On the Clusters page, click the Default Instances tab, and click the ID of the cluster that you want to manage.
In the upper-right corner of the cluster details page, click Log On to Database.
In the Log On to Database Instance dialog box, specify the Database Account and Database password parameters, and then click Login.
Create a local table.
NoteThe schema of the ApsaraDB for ClickHouse table must correspond to the schema of the CSV file that contains the data in OSS.
You need to execute a statement based on the edition of the cluster to create a local table. You can click Cluster Information in the left-side navigation pane, and view the edition of the cluster in the Cluster Properties section.
Statement used to create a local table for a cluster of the Single-replica Edition
create table oss_test_tbl_local on cluster default ( id UInt8, user_name String, age UInt16, city String, ac cess_url String ) engine = MergeTree() order by id;
Statement used to create a local table for a cluster of the Double-replica Edition
create table oss_test_tbl_local on cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) engine = ReplicatedMergeTree('/clickhouse/tables/{database}/{table}/{shard}', '{replica}') order by id;
Create a distributed table.
NoteIf you want to import the data from OSS only to the local table, skip this step.
create table oss_test_tbl_distributed on cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) engine = Distributed(default, default, oss_test_tbl_local, rand());
Step 3: Import data from OSS to ApsaraDB for ClickHouse
You can use a table engine or a table function to import data from OSS to ApsaraDB for ClickHouse.
Method 1: Use a table engine to import data from OSS
Create an OSS external table.
NoteYou need to execute a statement based on the kernel version of the cluster to create an OSS external table. You can click Cluster Information in the left-side navigation pane, and view the version of the cluster in the Cluster Properties section.
Syntax used to create an OSS external table for a cluster of version 21.8 and earlier
CREATE TABLE <table_name> [on cluster default] ( 'col_name1' col_type1, 'col_name2' col_type2, ... ) ENGINE = OSS('<oss-endpoint>', '<access-key-id>', '<access-key-secret>', '<oss-file-path>', '<file-format-name>');
Syntax used to create an OSS external table for a cluster of version 22.8
CREATE TABLE <table_name> [on cluster default] ( 'col_name1' col_type1, 'col_name2' col_type2, ... ) ENGINE = OSS('https://<BucketName>.<oss-endpoint>/<file-name>', '<access-key-id>', '<access-key-secret>', '<file-format-name>');
The following table describes the parameters.
Parameter
Description
table_name
The name of the table.
col_name1,col_name2
The names of the columns.
col_type1,col_type2
The data types of the columns.
ImportantThe schema of the OSS external table must correspond to the schema of the CSV file that contains the data in OSS.
BucketName
The name of the bucket.
oss-endpoint
The internal endpoint that you want to use to access an Elastic Compute Service (ECS) instance over the VPC in which the OSS bucket is deployed. For more information, see Regions and endpoints.
ImportantMake sure that the OSS bucket is deployed in the same region as the ApsaraDB for ClickHouse cluster.
file-name
The name of the file.
access-key-id
The AccessKey ID that you want to use to access the data in OSS.
access-key-secret
The AccessKey secret that you want to use to access the data in OSS.
oss-file-path
The storage path of the CSV file that you uploaded. The path is in the oss://<bucket-name>/<path-to-file> format.
NoteYou can set the oss-file-path parameter to a value that includes wildcard characters to perform a fuzzy match. For more information, see Use wildcard characters to perform fuzzy match for storage paths in OSS in this topic.
file-format-name
The format of the file. In this topic, the CSV format is used.
Sample statements:
Sample statements used to create an OSS external table for a cluster of version 21.8 and earlier
CREATE TABLE oss_test_tbl on cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = OSS('oss-cn-shanghai-internal.aliyuncs.com', 'LTAI5tDVcUKu2CbcBwhr****', 'WAcroHBcL7IVOK8FIHzNJy91Lc****', 'oss://testBucketName/test.csv', 'CSV');
Sample statements used to create an OSS external table for a cluster of version 22.8
CREATE TABLE oss_test_tbl on cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = OSS('http://testBucketName.oss-cn-shanghai-internal.aliyuncs.com/test.csv', 'LTAI5tDVcUKu2CbcBwhr****', 'WAcroHBcL7IVOK8FIHzNJy91Lc****','CSV')
Import data from the OSS external table to the distributed table created in Step 2.
insert into oss_test_tbl_distributed select * from oss_test_tbl;
By default, a comma (,) is used as the column delimiter of a CSV file. If a comma (,) is not used as the column delimiter of the CSV file that you want to import, you must use the
format_csv_delimiter
parameter to specify the column delimiter. For example, if a vertical bar (|) is used as the column delimiter of your CSV file, execute the following statement:insert into oss_test_tbl_distributed select * from oss('<oss-endpoint>', '<access-key-id>', '<access-key-secret>', '<oss-file-path>', '<file-format-name>', '<col_name> <col_type>(,...)') settings format_csv_delimiter='|';
Method 2: Use a table function to import data from OSS
You need to execute a statement based on the kernel version of the cluster to create an OSS external table. You can click Cluster Information in the left-side navigation pane, and view the version of the cluster in the Cluster Properties section.
Syntax used to import data from OSS to a cluster of version 21.8 and earlier
insert into oss_test_tbl_distributed select * from oss('<oss-endpoint>', '<access-key-id>', '<access-key-secret>', '<oss-file-path>', '<file-format-name>', '<col_name> <col_type>(,...)');
Syntax used to import data from OSS to a cluster of version 22.8
insert into oss_test_tbl_distributed select * from oss('https://<BucketName>.<oss-endpoint>/<file-name>','<access-key-id>', '<access-key-secret>', '<file-format-name>', '<col_name> <col_type>(,...)');
For more information about parameters, see the table that describes parameters.
Sample statements:
Sample statements used to query the data in OSS for a cluster of version 21.8 and earlier
SELECT * FROM oss('oss-cn-shanghai-internal.aliyuncs.com', 'LTAI5tDVcUKu2CbcBwhr****', 'WAcroHBcL7IVOK8FIHzNJy91Lc****', 'oss://testBucketName/test.csv', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');
Sample statements used to query the data in OSS for a cluster of version 22.8
SELECT * FROM oss('http://testBucketName.oss-cn-shanghai-internal.aliyuncs.com/test.csv', 'LTAI5tDVcUKu2CbcBwhr****', 'WAcroHBcL7IVOK8FIHzNJy91Lc****', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');
By default, a comma (,) is used as the column delimiter of a CSV file. If a comma (,) is not used as the column delimiter of the CSV file that you want to import, you must use the format_csv_delimiter
parameter to specify the column delimiter. For example, if a vertical bar (|) is used as the column delimiter of your CSV file, execute the following statement:
insert into oss_test_tbl_distributed select * from oss('<oss-endpoint>', '<access-key-id>', '<access-key-secret>', '<oss-file-path>', '<file-format-name>', '<col_name> <col_type>(,...)') settings format_csv_delimiter='|';
Step 4: Query the OSS data that is imported to ApsaraDB for ClickHouse
Execute the following statement to query the data.
If you import data to a local table, you need to replace the name of the distributed table in the query statement with the name of the local table. Then, execute the preceding statement.
select * from oss_test_tbl_distributed;
The following result is returned:
┌─id─┬─user_name─┬──age──┬───city─────┬─────access_url────────┐
│ 1 │ yang │ 32 │ shanghai │ http://example.com │
│ 2 │ wang │ 22 │ beijing │ http://example.com │
│ 3 │ xiao │ 23 │ shenzhen │ http://example.com │
│ 4 │ jess │ 45 │ hangzhou │ http://example.com │
│ 5 │ jack │ 14 │ shanghai │ http://example.com │
│ 6 │ tomy │ 25 │ hangzhou │ http://example.com │
│ 7 │ lucy │ 45 │ shanghai │ http://example.com │
│ 8 │ tengyin │ 26 │ shanghai │ http://example.com │
│ 9 │ wangli │ 27 │ shenzhen │ http://example.com │
│ 10 │ xiaohua │ 37 │ shanghai │ http://example.com │
└────┴───────────┴───────┴────────────┴───────────────────────┘
Use wildcard characters to perform fuzzy match for storage paths in OSS
In most cases, multiple small files in OSS are created based on the same naming conventions. To simplify the analysis of small files, you can use the following wildcard characters when you configure the oss-file-path
parameter to perform fuzzy match:
*
: matches all files or directories. For example,/dir/*
matches all files in the/dir
directory.{x, y, z}
: matches one of the values that are enclosed in the brace {}. For example,file_{x, y, z}
matchesfile_x
,file_y
, orfile_z
.{num1..num2}
: matches the smallest value, the largest value, or a value between them. For example,file_{1..3}
matchesfile_1
,file_2
, orfile_3
.?
: matches a random character. For example,file_?
matchesfile_a
,file_b
,file_c
, or other files whose names are in the file_Variable format.
Example
Files can be uploaded to a directory that uses the following structure:
oss://testBucketName/
doc-data/
oss-import/
small_files/
access_log_csv_1.txt
access_log_csv_2.txt
access_log_csv_3.txt
The following list provides values of the oss-file-path
parameter for reference:
oss://testBucketName/doc-data/oss-import/small_files/*
oss://testBucketName/doc-data/oss-import/small_files/access*
oss://testBucketName/doc-data/oss-import/small_files/access_log_csv_{1,2,3}.txt
oss://testBucketName/doc-data/oss-import/*/access_log_csv_{1,2,3}.txt
oss://testBucketName/doc-data/oss-import/*/*
oss://testBucketName/doc-data/oss-import/*/access_log_csv_{1..3}.txt
oss://testBucketName/doc-data/oss-import/*/access_log_csv_?.txt