Object Storage Service (OSS) is compatible with the Amazon Simple Storage Service (Amazon S3) protocol. In an E-MapReduce (EMR) ClickHouse cluster, use S3 table engines or S3 table functions to read data from and write data to OSS.
Prerequisites
Before you begin, ensure that you have:
-
An OSS bucket. For more information, see Create buckets
-
An EMR ClickHouse cluster. For more information, see Create a ClickHouse cluster
Import data from OSS to a ClickHouse cluster
Step 1: Create a business table
-
Log on to the ClickHouse cluster in SSH mode. For more information, see Log on to a cluster.
-
Start the ClickHouse client:
NoteIn the preceding command,
core-1-1is the name of the core node. If your cluster has multiple core nodes, specify any one of them.clickhouse-client -h core-1-1 -m -
Create a database named
productand a business table namedorders:Note{shard}and{replica}are macros automatically generated by EMR for ClickHouse clusters.CREATE DATABASE IF NOT EXISTS product ON CLUSTER cluster_emr; CREATE TABLE IF NOT EXISTS product.orders ON CLUSTER cluster_emr ( `uid` UInt32, `date` DateTime, `skuId` UInt32, `order_revenue` UInt32 ) Engine = ReplicatedMergeTree('/cluster_emr/product/orders/{shard}', '{replica}') PARTITION BY toYYYYMMDD(date) ORDER BY toYYYYMMDD(date); CREATE TABLE IF NOT EXISTS product.orders_all ON CLUSTER cluster_emr ( `uid` UInt32, `date` DateTime, `skuId` UInt32, `order_revenue` UInt32 ) Engine = Distributed(cluster_emr, product, orders, rand());
Step 2: Import data
Use S3 table engines to import data
S3 table engines create a persistent table backed by an OSS object. After the table is created, query or insert from it like any local table.
Syntax:
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
name1 type1,
name2 type2,
...
)
ENGINE = S3(path, [access_key_id, access_key_secret,] format, [compression]);
| Parameter | Description |
|---|---|
db |
The database name. |
table_name |
The table name. |
name1, name2 |
Column names. |
type1, type2 |
Column data types. |
path |
The OSS path to the object. Use a virtual-hosted URL (recommended) or a path-style URL. For endpoint details, see Access to OSS resources from ECS instances by using an internal endpoint of OSS. The path supports wildcards for SELECT statements: * (any characters except /), ? (single character), {str1,str2,...,strn} (string from list), {N..M} (number range, supports leading zeros such as {001..099}). Important
Wildcards apply only to |
access_key_id |
The AccessKey ID of your Alibaba Cloud account. |
access_key_secret |
The AccessKey secret of your Alibaba Cloud account. |
format |
The object format. CSV and XML are supported. For all supported formats, see Formats for input and output data. |
compression |
(Optional) The compression format. If omitted, the system infers it from the file extension. Supported values by EMR version: V3.X: none, gzip/gz, brotli/b, deflate, auto. V5.X: none, gzip/gz, brotli/b, lzma(xz), auto, zstd/zst. |
Procedure:
-
Download the sample object orders.csv and upload it to the root directory of an OSS bucket named
test. -
Create an OSS table using the S3 table engine:
Notehttp://test.oss-cn-beijing-internal.aliyuncs.com/orders.csvis the internal endpoint path toorders.csvin an OSS bucket namedtestin the China (Beijing) region.CREATE DATABASE IF NOT EXISTS oss ON CLUSTER cluster_emr; CREATE TABLE oss.orders_oss ( uid UInt32, date DateTime, skuId UInt32, order_revenue UInt32 ) ENGINE = S3('http://test.oss-cn-beijing-internal.aliyuncs.com/orders.csv', '<access_key_id>', '<access_key_secret>', 'CSV'); -
Import data into
product.orders_all:INSERT INTO product.orders_all SELECT uid, date, skuId, order_revenue FROM oss.orders_oss; -
Verify data consistency by comparing row counts:
-- Row count in the destination table SELECT count(1) FROM product.orders_all; -- Row count in the source OSS table SELECT count(1) FROM oss.orders_oss;
Use S3 table functions to import data
S3 table functions read data inline without creating a persistent table — useful for one-off imports.
Syntax:
s3(path, [access_key_id, access_key_secret,] format, structure, [compression])
| Parameter | Description |
|---|---|
path |
The OSS path to the object. Supports the same URL styles and wildcards as the S3 table engine. |
access_key_id |
The AccessKey ID of your Alibaba Cloud account. |
access_key_secret |
The AccessKey secret of your Alibaba Cloud account. |
format |
The object format. See Formats for input and output data. |
structure |
Column definitions. Format: 'column1_name column1_type, column2_name column2_type, ...'. Example: 'uid UInt32, date DateTime'. |
compression |
(Optional) The compression format. Supported values follow the same version-specific rules as the S3 table engine. |
Procedure:
-
Import data directly into
product.orders_all:INSERT INTO product.orders_all SELECT uid, date, skuId, order_revenue FROM s3('http://test.oss-cn-beijing-internal.aliyuncs.com/orders.csv', '<access_key_id>', '<access_key_secret>', 'CSV', 'uid UInt32, date DateTime, skuId UInt32, order_revenue UInt32'); -
Verify data consistency by comparing row counts:
-- Row count in the destination table SELECT count(1) FROM product.orders_all; -- Row count in the source OSS object SELECT count(1) FROM oss.orders_oss;
Export data from a ClickHouse cluster to OSS
Step 1: Create a business table
Use the same schema described in Step 1: Create a business table in the import section.
Step 2: Prepare data
-
Insert sample data into
product.orders_all:INSERT INTO product.orders_all VALUES (60333391,'2021-08-04 11:26:01',49358700,89) (38826285,'2021-08-03 10:47:29',25166907,27) (10793515,'2021-07-31 02:10:31',95584454,68) (70246093,'2021-08-01 00:00:08',82355887,97) (70149691,'2021-08-02 12:35:45',68748652,1) (87307646,'2021-08-03 19:45:23',16898681,71) (61694574,'2021-08-04 23:23:32',79494853,35) (61337789,'2021-08-02 07:10:42',23792355,55) (66879038,'2021-08-01 16:13:19',95820038,89); -
(Optional) Configure a conflict strategy to prevent export failures when the target object already exists. This is supported on EMR V5.8.0, V3.45.0, and later minor versions. Incremental export — if the target object already exists, a new object is created in the same directory:
NoteBy default, if the target object already exists in OSS, the export fails. Set one of the following parameters to override this behavior.
SET s3_create_new_file_on_insert=1;Overwrite export — if the target object already exists, its content is replaced. Use with caution:
SET s3_truncate_on_insert=1;
Step 3: Export data
Use S3 table engines to export data
-
Create an S3 table pointing to the destination OSS path:
CREATE TABLE oss.orders_oss ( uid UInt32, date DateTime, skuId UInt32, order_revenue UInt32 ) ENGINE = S3('http://test.oss-cn-beijing-internal.aliyuncs.com/orders.csv', '<access_key_id>', '<access_key_secret>', 'CSV'); -
Write data from the business table to OSS:
INSERT INTO oss.orders_oss SELECT uid, date, skuId, order_revenue FROM product.orders_all; -
Verify the exported object in the OSS console.
Use S3 table functions to export data
-
Export data directly using an S3 table function:
INSERT INTO FUNCTION s3('http://test.oss-cn-beijing-internal.aliyuncs.com/orders.csv', '<access_key_id>', '<access_key_secret>', 'CSV', 'uid UInt32, date DateTime, skuId UInt32, order_revenue UInt32') SELECT uid, date, skuId, order_revenue FROM product.orders_all; -
Verify the exported object in the OSS console.
Configure OSS-related parameters
Profile
Use the s3_min_upload_part_size parameter to set the minimum part size for multipart uploads to OSS. The default is 512 MB. The value must be a UInt64 integer.
Set the parameter at one of the following scopes:
| Scope | Method |
|---|---|
| Query | Add SETTINGS s3_min_upload_part_size=<value> at the end of the INSERT statement. |
| Session | Run SET s3_min_upload_part_size=<value> before the INSERT statement. Applies to all statements in the current session. |
| Table | Add SETTINGS s3_min_upload_part_size=<value> to the CREATE TABLE statement. Applies whenever the table is accessed. |
| User | In the EMR console, go to the ClickHouse service page, click Configure > server-users > Add Configuration Item. Set the key to users.<YourUserName>.s3_min_upload_part_size and the value to the desired integer. Persists across sessions. |
Query-level example:
INSERT INTO OSS_TABLE
SELECT ...
FROM ...
SETTINGS s3_min_upload_part_size=1073741824;
Session-level example:
SET s3_min_upload_part_size=1073741824;
INSERT INTO OSS_TABLE
SELECT ...
FROM ...
;
Table-level example:
CREATE TABLE OSS_TABLE
(
...
) ENGINE = s3(...)
SETTINGS s3_min_upload_part_size=1073741824;
Configuration
To avoid embedding credentials in every SQL statement, configure OSS credentials at the cluster level using the following XML structure:
<s3>
<endpoint-name>
<endpoint>https://oss-cn-beijing-internal.aliyuncs.com/bucket</endpoint>
<access_key_id>ACCESS_KEY_ID</access_key_id>
<secret_access_key>ACCESS_KEY_SECRET</secret_access_key>
</endpoint-name>
</s3>
| Parameter | Description |
|---|---|
endpoint-name |
A label for this endpoint configuration. |
endpoint |
The OSS endpoint. For more information, see OSS domain names. |
access_key_id |
The AccessKey ID of your Alibaba Cloud account. |
secret_access_key |
The AccessKey secret of your Alibaba Cloud account. |
In the EMR console, go to the ClickHouse service page, click the Configure tab, then click server-config. Add the configuration using one of the following methods:
-
Method 1: Add three separate configuration items with keys
oss.<endpoint-name>.endpoint,oss.<endpoint-name>.access_key_id, andoss.<endpoint-name>.secret_access_key. Replace<endpoint-name>with the actual OSS endpoint. -
Method 2: Add a single configuration item with the key
ossand the following XML block as the value. Replace the placeholder values with the actual endpoint, AccessKey ID, and AccessKey secret:<endpoint-name> <endpoint>https://oss-cn-beijing-internal.aliyuncs.com/bucket</endpoint> <access_key_id>ACCESS_KEY_ID</access_key_id> <secret_access_key>ACCESS_KEY_SECRET</secret_access_key> </endpoint-name>
After the cluster-level configuration is complete, omit access_key_id and access_key_secret from S3 engine and function calls:
-
Create an OSS table:
CREATE TABLE OSS_TABLE ( column1 UInt32, column2 String ... ) ENGINE = S3(path, format, [compression]); -
Use an S3 table function:
s3(path, format, structure, [compression]);