All Products
Search
Document Center

E-MapReduce:Import and export data between OSS and ClickHouse

Last Updated:Mar 26, 2026

Object Storage Service (OSS) is compatible with the Amazon Simple Storage Service (Amazon S3) protocol. In an E-MapReduce (EMR) ClickHouse cluster, use S3 table engines or S3 table functions to read data from and write data to OSS.

Prerequisites

Before you begin, ensure that you have:

Import data from OSS to a ClickHouse cluster

Step 1: Create a business table

  1. Log on to the ClickHouse cluster in SSH mode. For more information, see Log on to a cluster.

  2. Start the ClickHouse client:

    Note

    In the preceding command, core-1-1 is the name of the core node. If your cluster has multiple core nodes, specify any one of them.

    clickhouse-client -h core-1-1 -m
  3. Create a database named product and a business table named orders:

    Note

    {shard} and {replica} are macros automatically generated by EMR for ClickHouse clusters.

    CREATE DATABASE IF NOT EXISTS product ON CLUSTER cluster_emr;
    CREATE TABLE IF NOT EXISTS product.orders ON CLUSTER cluster_emr
    (
        `uid` UInt32,
        `date` DateTime,
        `skuId` UInt32,
        `order_revenue` UInt32
    )
    Engine = ReplicatedMergeTree('/cluster_emr/product/orders/{shard}', '{replica}')
    PARTITION BY toYYYYMMDD(date)
    ORDER BY toYYYYMMDD(date);
    CREATE TABLE IF NOT EXISTS product.orders_all ON CLUSTER cluster_emr
    (
        `uid` UInt32,
        `date` DateTime,
        `skuId` UInt32,
        `order_revenue` UInt32
    )
    Engine = Distributed(cluster_emr, product, orders, rand());

Step 2: Import data

Use S3 table engines to import data

S3 table engines create a persistent table backed by an OSS object. After the table is created, query or insert from it like any local table.

Syntax:

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
  name1 type1,
  name2 type2,
  ...
)
ENGINE = S3(path, [access_key_id, access_key_secret,] format, [compression]);
Parameter Description
db The database name.
table_name The table name.
name1, name2 Column names.
type1, type2 Column data types.
path The OSS path to the object. Use a virtual-hosted URL (recommended) or a path-style URL. For endpoint details, see Access to OSS resources from ECS instances by using an internal endpoint of OSS. The path supports wildcards for SELECT statements: * (any characters except /), ? (single character), {str1,str2,...,strn} (string from list), {N..M} (number range, supports leading zeros such as {001..099}).
Important

Wildcards apply only to SELECT — do not use them for write operations.

access_key_id The AccessKey ID of your Alibaba Cloud account.
access_key_secret The AccessKey secret of your Alibaba Cloud account.
format The object format. CSV and XML are supported. For all supported formats, see Formats for input and output data.
compression (Optional) The compression format. If omitted, the system infers it from the file extension. Supported values by EMR version: V3.X: none, gzip/gz, brotli/b, deflate, auto. V5.X: none, gzip/gz, brotli/b, lzma(xz), auto, zstd/zst.

Procedure:

  1. Download the sample object orders.csv and upload it to the root directory of an OSS bucket named test.

  2. Create an OSS table using the S3 table engine:

    Note

    http://test.oss-cn-beijing-internal.aliyuncs.com/orders.csv is the internal endpoint path to orders.csv in an OSS bucket named test in the China (Beijing) region.

    CREATE DATABASE IF NOT EXISTS oss ON CLUSTER cluster_emr;
    CREATE TABLE oss.orders_oss
    (
      uid UInt32,
      date DateTime,
      skuId UInt32,
      order_revenue UInt32
    ) ENGINE = S3('http://test.oss-cn-beijing-internal.aliyuncs.com/orders.csv', '<access_key_id>', '<access_key_secret>', 'CSV');
  3. Import data into product.orders_all:

    INSERT INTO product.orders_all
    SELECT
      uid,
      date,
      skuId,
      order_revenue
    FROM
      oss.orders_oss;
  4. Verify data consistency by comparing row counts:

    -- Row count in the destination table
    SELECT count(1) FROM product.orders_all;
    
    -- Row count in the source OSS table
    SELECT count(1) FROM oss.orders_oss;

Use S3 table functions to import data

S3 table functions read data inline without creating a persistent table — useful for one-off imports.

Syntax:

s3(path, [access_key_id, access_key_secret,] format, structure, [compression])
Parameter Description
path The OSS path to the object. Supports the same URL styles and wildcards as the S3 table engine.
access_key_id The AccessKey ID of your Alibaba Cloud account.
access_key_secret The AccessKey secret of your Alibaba Cloud account.
format The object format. See Formats for input and output data.
structure Column definitions. Format: 'column1_name column1_type, column2_name column2_type, ...'. Example: 'uid UInt32, date DateTime'.
compression (Optional) The compression format. Supported values follow the same version-specific rules as the S3 table engine.

Procedure:

  1. Import data directly into product.orders_all:

    INSERT INTO product.orders_all
    SELECT
      uid,
      date,
      skuId,
      order_revenue
    FROM
      s3('http://test.oss-cn-beijing-internal.aliyuncs.com/orders.csv',
          '<access_key_id>',
          '<access_key_secret>',
          'CSV',
          'uid UInt32, date DateTime, skuId UInt32, order_revenue UInt32');
  2. Verify data consistency by comparing row counts:

    -- Row count in the destination table
    SELECT count(1) FROM product.orders_all;
    
    -- Row count in the source OSS object
    SELECT count(1) FROM oss.orders_oss;

Export data from a ClickHouse cluster to OSS

Step 1: Create a business table

Use the same schema described in Step 1: Create a business table in the import section.

Step 2: Prepare data

  1. Insert sample data into product.orders_all:

    INSERT INTO product.orders_all VALUES
      (60333391,'2021-08-04 11:26:01',49358700,89)
      (38826285,'2021-08-03 10:47:29',25166907,27)
      (10793515,'2021-07-31 02:10:31',95584454,68)
      (70246093,'2021-08-01 00:00:08',82355887,97)
      (70149691,'2021-08-02 12:35:45',68748652,1)
      (87307646,'2021-08-03 19:45:23',16898681,71)
      (61694574,'2021-08-04 23:23:32',79494853,35)
      (61337789,'2021-08-02 07:10:42',23792355,55)
      (66879038,'2021-08-01 16:13:19',95820038,89);
  2. (Optional) Configure a conflict strategy to prevent export failures when the target object already exists. This is supported on EMR V5.8.0, V3.45.0, and later minor versions. Incremental export — if the target object already exists, a new object is created in the same directory:

    Note

    By default, if the target object already exists in OSS, the export fails. Set one of the following parameters to override this behavior.

    SET s3_create_new_file_on_insert=1;

    Overwrite export — if the target object already exists, its content is replaced. Use with caution:

    SET s3_truncate_on_insert=1;

Step 3: Export data

Use S3 table engines to export data

  1. Create an S3 table pointing to the destination OSS path:

    CREATE TABLE oss.orders_oss
    (
      uid UInt32,
      date DateTime,
      skuId UInt32,
      order_revenue UInt32
    ) ENGINE = S3('http://test.oss-cn-beijing-internal.aliyuncs.com/orders.csv', '<access_key_id>', '<access_key_secret>', 'CSV');
  2. Write data from the business table to OSS:

    INSERT INTO oss.orders_oss
    SELECT
      uid,
      date,
      skuId,
      order_revenue
    FROM
      product.orders_all;
  3. Verify the exported object in the OSS console.

Use S3 table functions to export data

  1. Export data directly using an S3 table function:

    INSERT INTO FUNCTION
    s3('http://test.oss-cn-beijing-internal.aliyuncs.com/orders.csv',
          '<access_key_id>',
          '<access_key_secret>',
          'CSV',
          'uid UInt32, date DateTime, skuId UInt32, order_revenue UInt32')
    SELECT
      uid,
      date,
      skuId,
      order_revenue
    FROM
      product.orders_all;
  2. Verify the exported object in the OSS console.

Configure OSS-related parameters

Profile

Use the s3_min_upload_part_size parameter to set the minimum part size for multipart uploads to OSS. The default is 512 MB. The value must be a UInt64 integer.

Set the parameter at one of the following scopes:

Scope Method
Query Add SETTINGS s3_min_upload_part_size=<value> at the end of the INSERT statement.
Session Run SET s3_min_upload_part_size=<value> before the INSERT statement. Applies to all statements in the current session.
Table Add SETTINGS s3_min_upload_part_size=<value> to the CREATE TABLE statement. Applies whenever the table is accessed.
User In the EMR console, go to the ClickHouse service page, click Configure > server-users > Add Configuration Item. Set the key to users.<YourUserName>.s3_min_upload_part_size and the value to the desired integer. Persists across sessions.

Query-level example:

INSERT INTO OSS_TABLE
SELECT ...
FROM ...
SETTINGS s3_min_upload_part_size=1073741824;

Session-level example:

SET s3_min_upload_part_size=1073741824;
INSERT INTO OSS_TABLE
SELECT ...
FROM ...
;

Table-level example:

CREATE TABLE OSS_TABLE
(
  ...
) ENGINE = s3(...)
SETTINGS s3_min_upload_part_size=1073741824;

Configuration

To avoid embedding credentials in every SQL statement, configure OSS credentials at the cluster level using the following XML structure:

<s3>
    <endpoint-name>
        <endpoint>https://oss-cn-beijing-internal.aliyuncs.com/bucket</endpoint>
        <access_key_id>ACCESS_KEY_ID</access_key_id>
        <secret_access_key>ACCESS_KEY_SECRET</secret_access_key>
    </endpoint-name>
</s3>
Parameter Description
endpoint-name A label for this endpoint configuration.
endpoint The OSS endpoint. For more information, see OSS domain names.
access_key_id The AccessKey ID of your Alibaba Cloud account.
secret_access_key The AccessKey secret of your Alibaba Cloud account.

In the EMR console, go to the ClickHouse service page, click the Configure tab, then click server-config. Add the configuration using one of the following methods:

  • Method 1: Add three separate configuration items with keys oss.<endpoint-name>.endpoint, oss.<endpoint-name>.access_key_id, and oss.<endpoint-name>.secret_access_key. Replace <endpoint-name> with the actual OSS endpoint.

  • Method 2: Add a single configuration item with the key oss and the following XML block as the value. Replace the placeholder values with the actual endpoint, AccessKey ID, and AccessKey secret:

    <endpoint-name>
      <endpoint>https://oss-cn-beijing-internal.aliyuncs.com/bucket</endpoint>
      <access_key_id>ACCESS_KEY_ID</access_key_id>
      <secret_access_key>ACCESS_KEY_SECRET</secret_access_key>
    </endpoint-name>

After the cluster-level configuration is complete, omit access_key_id and access_key_secret from S3 engine and function calls:

  • Create an OSS table:

    CREATE TABLE OSS_TABLE
    (
      column1 UInt32,
      column2 String
      ...
    )
    ENGINE = S3(path, format, [compression]);
  • Use an S3 table function:

    s3(path, format, structure, [compression]);