通过表引擎和表函数导入OSS数据至云数据库clickhouse - ApsaraDB for ClickHouse

You can use table engines or table functions to import data from Object Storage Service (OSS) to ApsaraDB for ClickHouse for log query, analysis, and processing. This topic describes how to import data from OSS to ApsaraDB for ClickHouse.

Prerequisites

OSS is activated. For more information about how to activate OSS, see Activate OSS.
A bucket is created. The bucket and ApsaraDB for ClickHouse cluster are deployed in the same region. For more information about how to create a bucket, see Create a bucket.
The Alibaba Cloud account that is used to access the OSS bucket is granted the read permissions on the objects in the bucket. For more information about how to grant permissions, see Overview.

Data preparations

Store the following test data in a CSV file named test.csv and upload the file to OSS. When you import data, the default column delimiter supported by the ApsaraDB for ClickHouse cluster is a comma (,). For more information about how to upload data, see Upload an object.

1,yang,32,shanghai,http://example1.com
2,wang,22,beijing,http://example2.com
3,xiao,23,shenzhen,http://example3.com
4,jess,45,hangzhou,http://example4.com
5,jack,14,shanghai,http://example5.com
6,tomy,25,hangzhou,http://example6.com
7,lucy,45,shanghai,http://example7.com
8,tengyin,26,shanghai,http://example8.com
9,wangli,27,shenzhen,http://example9.com
10,xiaohua,37,shanghai,http://example10.com

Procedure

Connect to the ApsaraDB for ClickHouse cluster. For more information, see Connect to an ApsaraDB for ClickHouse cluster.
Create a local table named oss_test_tbl_local.
Important
- The schema of the ApsaraDB for ClickHouse table must be the same as the schema of the OSS external table.
- You must execute a statement based on the edition of the ApsaraDB for ClickHouse cluster to create a local table. You can click Cluster Information in the left-side navigation pane and view the edition of the cluster in the Cluster Properties section.
- You can select a table engine based on your business requirements. For more information about table engines, see Table engines.
Statement used to create a local table in a cluster of Single-replica Edition
```
CREATE TABLE oss_test_tbl_local ON CLUSTER default
(
id UInt8,
user_name String,
age UInt16,
city String,
access_url String
)
ENGINE = MergeTree()
ORDER BY id;
```
Statement used to create a local table in a cluster of Double-replica Edition
```
CREATE TABLE oss_test_tbl_local ON CLUSTER default
(
id UInt8,
user_name String,
age UInt16,
city String,
access_url String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{database}/{table}/{shard}', '{replica}')
ORDER BY id;
```

(Optional) Create a distributed table named oss_test_tbl_distributed.

Note

If you want to distribute data to the local tables in a collection, you must create a distributed table.

CREATE TABLE oss_test_tbl_distributed ON CLUSTER default
(
id UInt8,
user_name String,
age UInt16,
city String,
access_url String
)
ENGINE = Distributed(default, default, oss_test_tbl_local, rand());

Import data from OSS to ApsaraDB for ClickHouse.

You can use a table engine or a table function to import data from OSS to ApsaraDB for ClickHouse.

Important

You must execute a statement based on the kernel version of the cluster to create an OSS external table. You can click Cluster Information in the left-side navigation pane and view the kernel version of the cluster in the Cluster Properties section.

Method 1: Use a table engine to import data from OSS

Create an OSS external table named oss_test_tbl.

Syntax used to create an OSS external table in a cluster of a version earlier than 22.8

CREATE TABLE <table_name> [ON CLUSTER cluster]
(
'col_name1' col_type1,
'col_name2' col_type2,
...
)
ENGINE = OSS('<oss-endpoint>', '<access-key-id>', '<access-key-secret>', '<oss-file-path>', '<file-format-name>');

Syntax used to create an OSS external table in a cluster of version 22.8 or later

CREATE TABLE <table_name> [ON CLUSTER cluster]
(
'col_name1' col_type1,
'col_name2' col_type2,
...
)
ENGINE = OSS('https://<BucketName>.<oss-endpoint>/<file-name>', '<access-key-id>', '<access-key-secret>', '<file-format-name>');

The following table describes the parameters.

Parameter	Description
table_name	The name of the table.
ON CLUSTER cluster	Specifies that a local table is created on each node. Set the value to `ON CLUSTER default`.
col_name1,col_name2	The names of the columns.
col_type1,col_type2	The data types of the columns. Important The schema of the OSS external table must correspond to the schema of the CSV file that contains the data in OSS.
BucketName	The name of the bucket.
oss-endpoint	The endpoint that is used to access OSS. For more information, see Regions and endpoints. Important Make sure that the bucket and the ApsaraDB for ClickHouse cluster are deployed in the same region, and that you can access OSS over a virtual private cloud (VPC).
file-name	The name of the file.
access-key-id	The AccessKey ID that you want to use to access the data in OSS. For more information, see How do I obtain an AccessKey pair?
access-key-secret	The AccessKey secret that you want to use to access the data in OSS. For more information, see How do I obtain an AccessKey pair?
oss-file-path	The storage path of the CSV file that you uploaded. The path is in the oss://<bucket-name>/<path-to-file> format. Note The oss-file-path parameter supports fuzzy matching in which wildcard characters are used. For more information, see Use wildcard characters to perform fuzzy matching for storage paths in OSS.
file-format-name	The format of the file. In this topic, the CSV format is used.

Sample statements:

Sample statement used to create an OSS external table in a cluster of a version earlier than 22.8

CREATE TABLE oss_test_tbl on cluster default
(
id UInt8,
user_name String,
age UInt16,
city String,
access_url String
)
ENGINE = OSS('oss-cn-shanghai-internal.aliyuncs.com', 'LTAI5tDVcUKu2CbcBwhr****', 'WAcroHBcL7IVOK8FIHzNJy91Lc****', 'oss://testBucketName/test.csv', 'CSV');

Sample statement used to create an OSS external table in a cluster of version 22.8 or later

CREATE TABLE oss_test_tbl on cluster default
(
id UInt8,
user_name String,
age UInt16,
city String,
access_url String
)
ENGINE = OSS('http://testBucketName.oss-cn-shanghai-internal.aliyuncs.com/test.csv', 'LTAI5tDVcUKu2CbcBwhr****', 'WAcroHBcL7IVOK8FIHzNJy91Lc****','CSV')

Import data from the OSS external table oss_test_tbl to the distributed table oss_test_tbl_distributed.
Note
If you need to import data from OSS only to the local table, replace the name of the distributed table in the INSERT INTO statement with the name of the local table.
```
INSERT INTO oss_test_tbl_distributed SELECT * FROM oss_test_tbl;
```
If a comma (,) is not used as the column delimiter of the CSV file that you want to import, you must use the format_csv_delimiter parameter in the INSERT INTO statement to specify another column delimiter. For example, if a vertical bar (|) is used as the column delimiter of your CSV file, execute the following statement:
```
INSERT INTO oss_test_tbl_distributed SELECT * FROM oss_test_tbl settings format_csv_delimiter='|';
```

Method 2: Use a table function to import data from OSS

Syntax used to import data from OSS to a cluster of a version earlier than 22.8

INSERT INTO <table_name> SELECT * FROM oss('<oss-endpoint>', '<access-key-id>', '<access-key-secret>', '<oss-file-path>', '<file-format-name>', '<col_name> <col_type>(,...)');

Syntax used to import data from OSS to a cluster of version 22.8 or later

INSERT INTO <table_name> SELECT * FROM oss('https://<BucketName>.<oss-endpoint>/<file-name>','<access-key-id>', '<access-key-secret>', '<file-format-name>', '<col_name> <col_type>(,...)');

For more information about parameters, see the table that describes the parameters.

Sample statements:

Sample statement used to import data from OSS to a cluster of a version earlier than 22.8

INSERT INTO oss_test_tbl_distributed SELECT * FROM oss('oss-cn-shanghai-internal.aliyuncs.com', 'LTAI5tDVcUKu2CbcBwhr****', 'WAcroHBcL7IVOK8FIHzNJy91Lc****', 'oss://testBucketName/test.csv', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');

Sample statement used to import data from OSS to a cluster of version 22.8 or later

INSERT INTO oss_test_tbl_distributed SELECT * FROM oss('http://testBucketName.oss-cn-shanghai-internal.aliyuncs.com/test.csv', 'LTAI5tDVcUKu2CbcBwhr****', 'WAcroHBcL7IVOK8FIHzNJy91Lc****', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');

If a comma (,) is not used as the column delimiter of the CSV file that you want to import, you must use the format_csv_delimiter parameter in the INSERT INTO statement to specify another column delimiter. For example, if a vertical bar (|) is used as the column delimiter of your CSV file, execute the following statement:

INSERT INTO oss_test_tbl_distributed SELECT * FROM oss('<oss-endpoint>', '<access-key-id>', '<access-key-secret>', '<oss-file-path>', '<file-format-name>',  '<col_name> <col_type>(,...)') settings format_csv_delimiter='|';

Query data in the distributed table oss_test_tbl_distributed and check whether the data in OSS is imported to ApsaraDB for ClickHouse.

SELECT * FROM oss_test_tbl_distributed;

The following result is returned:

┌─id─┬─user_name─┬──age──┬───city─────┬─────access_url────────┐
│  1 │  yang     │   32  │  shanghai  │  http://example1.com  │
│  2 │  wang     │   22  │  beijing   │  http://example2.com  │
│  3 │  xiao     │   23  │  shenzhen  │  http://example3.com  │
│  4 │  jess     │   45  │  hangzhou  │  http://example4.com  │
│  5 │  jack     │   14  │  shanghai  │  http://example5.com  │
│  6 │  tomy     │   25  │  hangzhou  │  http://example6.com  │
│  7 │  lucy     │   45  │  shanghai  │  http://example7.com  │
│  8 │  tengyin  │   26  │  shanghai  │  http://example8.com  │
│  9 │  wangli   │   27  │  shenzhen  │  http://example9.com  │
│ 10 │  xiaohua  │   37  │  shanghai  │  http://example10.com │
└────┴───────────┴───────┴────────────┴───────────────────────┘

Use wildcard characters to perform fuzzy matching for storage paths in OSS

In most cases, multiple small files in OSS are created based on the same naming conventions. To simplify the analysis of small files, you can use the following wildcard characters when you set the oss-file-path parameter to perform fuzzy matching:

*: matches all files or directories. For example, /dir/* matches all files in the /dir directory.
{x, y, z}: matches one of the values that are enclosed in the braces {}. For example, file_{x, y, z} matches file_x, file_y, or file_z.
{num1..num2}: matches the smallest value, the largest value, or a value between them. For example, file_{1..3} matches file_1, file_2, or file_3.
?: matches a random character. For example, file_? matches file_a, file_b, file_c, or other files whose names are similar.

Example

Files can be uploaded to a directory that uses the following structure:

oss://testBucketName/
               doc-data/
                    oss-import/
                        small_files/
                            access_log_csv_1.txt
                            access_log_csv_2.txt
                            access_log_csv_3.txt

The following list provides the values of the oss-file-path parameter only for reference:

oss://testBucketName/doc-data/oss-import/small_files/*
oss://testBucketName/doc-data/oss-import/small_files/access*
oss://testBucketName/doc-data/oss-import/small_files/access_log_csv_{1,2,3}.txt
oss://testBucketName/doc-data/oss-import/*/access_log_csv_{1,2,3}.txt
oss://testBucketName/doc-data/oss-import/*/*
oss://testBucketName/doc-data/oss-import/*/access_log_csv_{1..3}.txt
oss://testBucketName/doc-data/oss-import/*/access_log_csv_?.txt