This topic describes how to import Object Storage Service (OSS) data to an ApsaraDB for ClickHouse cluster.

Background information

OSS is a low-cost cloud storage service provided by Alibaba Cloud that is used to store large numbers of logs, videos, and photos. ApsaraDB for ClickHouse allows access to OSS files in a variety of formats except Protobuf and Cap'n Proto.

ApsaraDB for ClickHouse can use the following methods to access OSS data:
  • Table function: The oss('<oss-endpoint>', '<access-key-id>', '<access-key-secret>', '<oss-file-path>', '<file-format-name>', '<column-definitions>') table function is provided to read OSS data.
  • Table engine: The oss('<oss-endpoint>', '<access-key-id>', '<access-key-secret>', '<oss-file-path>', '<file-format-name>') table engine is provided to read OSS data. The table engine can also write data to OSS.
Prerequisites:
  • To ensure normal connection between the ApsaraDB for ClickHouse cluster and OSS, the OSS endpoint specified in the table function or table engine must be an endpoint of the Virtual Private Cloud (VPC) type. The OSS bucket must reside within the same region as the ApsaraDB for ClickHouse cluster, such as both in the China (Shanghai) region.
  • The account that corresponds to the AccessKey ID and AccessKey secret specified in the table function must have read permissions on the specified OSS file path.
  • The account that corresponds to the AccessKey ID and AccessKey secret specified in the table engine must have read permissions on the specified OSS file path. To perform the INSERT operation, the account must have the write permissions on the specified oss-file-path.
  • The format of the specified oss-file-path must meet the OSS path specifications. The general format is oss://<bucket-name/<path-to-file>.
  • The file-format-name and column-definitions values must be in a format consistent with the actual one. Otherwise, an error occurs when the values are parsed. The file-format-name value must be in a format supported by ApsaraDB for ClickHouse. Protobuf and Cap'n Proto are not supported. For more information, see Formats for Input and Output Data.
  • To perform the INSERT operation on existing files on OSS, you must make sure that the files can be uploaded to OSS by calling the AppendObject operation because OSS only supports the write operation on appendable objects.

To import OSS data to an ApsaraDB for ClickHouse cluster, perform the following steps:

Procedure

  1. Prepare the test dataset. Click here to download the sample dataset.

    This dataset is a collection of simulated log data and is stored in the CSV format. Each row includes fields such as id, user_name, age, city, and access_url. Fields are separated by commas (,). Each row ends with a line break. The following example shows the file content:

    1,tick,32,shanghai,http://xdbdsd.com/xgwgwe
    2,wangl,22,beijing,http://ghwbw.com/xgwgwe
  2. Upload the dataset to OSS.
    Transfer the sample dataset to the specified OSS path. You can also use the publicly readable OSS path: oss://clickhouse-release-open-access/doc-data/oss-import/access_log_csv.txt
  3. Connect to the ApsaraDB for ClickHouse cluster. For more information, see Connect to a cluster.
  4. Invoke the table function to read OSS data.
    SELECT *
    FROM oss('oss-cn-shanghai-internal.aliyuncs.com', '<your-access-key-id>', '<your-access-key-secret>', 'oss://clickhouse-release-open-access/doc-data/oss-import/access_log_csv.txt', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');
    
    ┌─id─┬─user_name─┬─age─┬─city─────┬─access_url───────────────┐
    │  1 │ tick      │  32 │ shanghai │ http://xdbdsd.com/xgwgwe │
    │  2 │ wangl     │  22 │ beijing  │ http://ghwbw.com/xgwgwe  │
    │  3 │ xiaoh     │  23 │ shenzhen │ http://holko.com/xgwgwe  │
    │  4 │ jess      │  45 │ hangzhou │ http://jopjop.com/xgwgwe │
    │  5 │ jack      │  14 │ shanghai │ http://wewsd.com/xgwgwe  │
    │  6 │ tomy      │  25 │ hangzhou │ http://sbedr.com/xgwgwe  │
    │  7 │ lucy      │  45 │ shanghai │ http://ghhwed.com/xgwgwe │
    │  8 │ tengyin   │  26 │ shanghai │ http://hewhe.com/xgwgwe  │
    │  9 │ cuos      │  27 │ shenzhen │ http://yoiuj.com/xgwgwe  │
    │ 10 │ wangsh    │  37 │ shanghai │ http://hhou.com/xgwgwe   │
    └────┴───────────┴─────┴──────────┴──────────────────────────┘
    Perform SQL query analysis.
    SELECT *
    FROM oss('oss-cn-shanghai-internal.aliyuncs.com', '<your-access-key-id>', '<your-access-key-secret>', 'oss://clickhouse-release-open-access/doc-data/oss-import/access_log_csv.txt', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String')
    WHERE id = 9;
    
    ┌─id─┬─user_name─┬─age─┬─city─────┬─access_url──────────────┐
    │  9 │ cuos      │  27 │ shenzhen │ http://yoiuj.com/xgwgwe │
    └────┴───────────┴─────┴──────────┴─────────────────────────┘
    
    
    SELECT
        user_name,
        count(*) AS uv
    FROM oss('oss-cn-shanghai-internal.aliyuncs.com', '<your-access-key-id>', '<your-access-key-secret>', 'oss://clickhouse-release-open-access/doc-data/oss-import/access_log_csv.txt', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String')
    GROUP BY user_name;
    
    ┌─user_name─┬─uv─┐
    │ tick      │  1 │
    │ lucy      │  1 │
    │ tengyin   │  1 │
    │ wangl     │  1 │
    │ wangsh    │  1 │
    │ jess      │  1 │
    │ tomy      │  1 │
    │ cuos      │  1 │
    │ xiaoh     │  1 │
    │ jack      │  1 │
    └───────────┴────┘
    Note After you invoke the table function to read OSS data, the data remains in OSS and is not synchronized to the ApsaraDB for ClickHouse cluster. If a large amount of data is involved, read performance may degrade due to limited bandwidth for remote reads of OSS data. If high-performance analysis is required, we recommend that you synchronize data to the ApsaraDB for ClickHouse cluster as specified in the following steps.
  5. Invoke the table engine to read OSS data.
    create table oss_test_tbl on cluster default
    (
    id UInt8,
    user_name String,
    age UInt16,
    city String,
    access_url String
    )
    engine = OSS('oss-cn-shanghai-internal.aliyuncs.com', '<your-access-key-id>', '<your-access-key-secret>', 'oss://clickhouse-release-open-access/doc-data/oss-import/access_log_csv.txt', 'CSV');
    
    
    SELECT *
    FROM oss_test_tbl
    
    ┌─id─┬─user_name─┬─age─┬─city─────┬─access_url───────────────┐
    │  1 │ tick      │  32 │ shanghai │ http://xdbdsd.com/xgwgwe │
    │  2 │ wangl     │  22 │ beijing  │ http://ghwbw.com/xgwgwe  │
    │  3 │ xiaoh     │  23 │ shenzhen │ http://holko.com/xgwgwe  │
    │  4 │ jess      │  45 │ hangzhou │ http://jopjop.com/xgwgwe │
    │  5 │ jack      │  14 │ shanghai │ http://wewsd.com/xgwgwe  │
    │  6 │ tomy      │  25 │ hangzhou │ http://sbedr.com/xgwgwe  │
    │  7 │ lucy      │  45 │ shanghai │ http://ghhwed.com/xgwgwe │
    │  8 │ tengyin   │  26 │ shanghai │ http://hewhe.com/xgwgwe  │
    │  9 │ cuos      │  27 │ shenzhen │ http://yoiuj.com/xgwgwe  │
    │ 10 │ wangsh    │  37 │ shanghai │ http://hhou.com/xgwgwe   │
    └────┴───────────┴─────┴──────────┴──────────────────────────┘
    
    
    SELECT
        city,
        count(*) AS pv
    FROM oss_test_tbl
    GROUP BY city
    
    ┌─city─────┬─pv─┐
    │ shanghai │  5 │
    │ shenzhen │  2 │
    │ hangzhou │  2 │
    │ beijing  │  1 │
    └──────────┴────┘
    Note After you invoke the table engine to read OSS data, the data remains in OSS and is not synchronized to the ApsaraDB for ClickHouse cluster. If a large amount of data is involved, read performance may degrade due to limited bandwidth for remote reads of OSS data. If high-performance analysis is required, we recommend that you synchronize data to the ApsaraDB for ClickHouse cluster as specified in the following steps.
  6. Invoke the table engine to write data to OSS.

    You must have the write permissions on the OSS bucket which stores the sample dataset. The preceding path of oss://clickhouse-release/doc-data/oss-import/access_log_csv.txt is only publicly readable, and you do not have the write permissions on it. Therefore, you must upload the sample dataset to your own OSS bucket. Replace the oss-endpoint and oss-file-path parameters in the statement with the actual values to create a table.

    Files in OSS can be normal objects and appendable objects. Files uploaded by calling the PutObject operation are normal objects. Normal objects cannot be changed after they are uploaded. Files uploaded by calling the AppendObject operation are appendable objects. Appendable objects can be modified later. If you want to invoke the table engine to write data to an OSS file, call the AppendObject operation to upload the file to OSS. For more information about the installation and parameters of the ossutil tool, see OSS documentation.

    ./ossutil appendfromfile /<your-loocal-path>/access_log_csv.txt oss://<your-oss-bucket>/<your-dir>/access_log_csv.txt
    Execute the following statements to write data to OSS:
    create table oss_test_tbl_appenable on cluster default
    (
    id UInt8,
    user_name String,
    age UInt16,
    city String,
    access_url String
    )
    engine = OSS('<your-oss-vpc-endpoint>', '<your-access-key-id>', '<your-access-key-secret>', 'oss://<your-oss-bucket>/<your-dir>/access_log_csv.txt', 'CSV');
    
    select * from oss_test_tbl_appenable;
    
    insert into oss_test_tbl_appenable values(11, 'wangw', 25, 'beijing', 'http://asewg.com/jhlue');
    
    select * from oss_test_tbl_appenable;
    Note After you invoke the table engine to read OSS data, the data remains in OSS and is not synchronized to the ApsaraDB for ClickHouse cluster. If a large amount of data is involved, read performance may degrade due to limited bandwidth for remote reads of OSS data. If high-performance analysis is required, we recommend that you synchronize data to the ApsaraDB for ClickHouse cluster as specified in the following steps.
  7. Synchronize data to the ApsaraDB for ClickHouse cluster.
    Create an ApsaraDB for ClickHouse table and execute the INSERT FROM SELECT statement to synchronize data.
    -- For a Standalone Edition cluster, execute the following statement to create a local table:
    create table oss_test_tbl_local on cluster default
    (
    id UInt8,
    user_name String,
    age UInt16,
    city String,
    access_url String
    )
    engine = MergeTree()
    order by id;
    
    -- For a High-availability Edition cluster, execute the following statement to create a local table:
    create table oss_test_tbl_local on cluster default
    (
    id UInt8,
    user_name String,
    age UInt16,
    city String,
    access_url String
    )
    engine = ReplicatedMergeTree('/clickhouse/tables/{database}/{table}/{shard}', '{replica}')
    order by id;
    
    -- Create a distributed table.
    create table oss_test_tbl_distributed on cluster default
    (
    id UInt8,
    user_name String,
    age UInt16,
    city String,
    access_url String
    )
    engine = Distributed(default, default, oss_test_tbl_local, rand());
    
    -- Import data.
    insert into oss_test_tbl_distributed select * from oss_test_tbl;
    
    select * from oss_test_tbl_distributed; 
  8. Wildcards are supported in the oss-file-path value.
    Multiple small files in OSS use the same naming conventions. To simplify the analysis of such files, the oss-file-path parameter supports fuzzy match by using the following characters:
    • *: matches a file name or directory name. For example, /dir/* matches all files in the /dir directory.
    • {x, y, z}: matches a value enclosed in braces. For example, file_{x, y, z} matches file_x, file_y, or file_z.
    • {num1..num2}: matches all numbers in [num1, num2]. For example, file_{1..3} is equivalent to file_1, file_2, and file_3.
    • ?: matches a single character. For example, file_? matches files such as file_a, file_b, and file_c.
    The following examples show how to use such characters. The sample directory structure:
    oss://clickhouse-release-open-access/
                   doc-data/
                        oss-import/
                            small_files/
                                access_log_csv_1.txt
                                access_log_csv_2.txt
                                access_log_csv_3.txt
    Perform SQL queries by using the following methods. Take note of different characters in the oss-file-path parameter.
    
    select * from oss('oss-cn-shanghai-internal.aliyuncs.com', '<your-access-key-id>', '<your-access-key-secret>', 'oss://clickhouse-release-open-access/doc-data/oss-import/small_files/*', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');
    
    select * from oss('oss-cn-shanghai-internal.aliyuncs.com', '<your-access-key-id>', '<your-access-key-secret>', 'oss://clickhouse-release-open-access/doc-data/oss-import/small_files/access*', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');
    
    select * from oss('oss-cn-shanghai-internal.aliyuncs.com', '<your-access-key-id>', '<your-access-key-secret>', 'oss://clickhouse-release-open-access/doc-data/oss-import/small_files/access_log_csv_{1,2,3}.txt', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');
    
    select * from oss('oss-cn-shanghai-internal.aliyuncs.com', '<your-access-key-id>', '<your-access-key-secret>', 'oss://clickhouse-release-open-access/doc-data/oss-import/*/access_log_csv_{1,2,3}.txt', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');
    
    select * from oss('oss-cn-shanghai-internal.aliyuncs.com', '<your-access-key-id>', '<your-access-key-secret>', 'oss://clickhouse-release-open-access/doc-data/oss-import/*/*', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');
    
    select * from oss('oss-cn-shanghai-internal.aliyuncs.com', '<your-access-key-id>', '<your-access-key-secret>', 'oss://clickhouse-release-open-access/doc-data/oss-import/*/access_log_csv_{1..3}.txt', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');
    
    select * from oss('oss-cn-shanghai-internal.aliyuncs.com', '<your-access-key-id>', '<your-access-key-secret>', 'oss://clickhouse-release-open-access/doc-data/oss-import/*/access_log_csv_?.txt', 'CSV', 'id UInt8, user_name String, age UInt16, city String, access_url String');