Use OSS to import data to an ApsaraDB for SelectDB instance - ApsaraDB for SelectDB

ApsaraDB for SelectDB works with Object Storage Service (OSS) to provide the high-speed import feature over an internal network. You can use this feature to import hundreds of GB of data at a time. This topic describes how to use OSS to import data to an ApsaraDB for SelectDB instance.

Prerequisites

An AccessKey pair is obtained. For more information, see Create an AccessKey pair.
OSS is activated and an OSS bucket is created. The OSS bucket resides in the same region as the ApsaraDB for SelectDB instance to which you want to import data. For more information, see Get started by using the OSS console.
The OSS bucket resides in the same region as the ApsaraDB for SelectDB instance. OSS can be accessed over a virtual private cloud (VPC).

Syntax for importing data

LOAD LABEL load_label
(
    data_desc1[, data_desc2, ...]
)
WITH S3
(
    "AWS_PROVIDER" = "OSS",
    "AWS_REGION" = "REGION",  
    "AWS_ENDPOINT" = "ENDPOINT",
    "AWS_ACCESS_KEY" = "AccessKey ID",
    "AWS_SECRET_KEY"="AccessKey Secret"
)
PROPERTIES 
(
  "key1"="value1", ...
);

Note

ApsaraDB for SelectDB allows you to import data from all object storage systems that are compatible with the Amazon Simple Storage Service (Amazon S3) protocol. Therefore, the syntax for importing data uses keywords such as AWS and S3.

Parameters

The data_desc1 parameter is used to describe a group of files to be imported. The following sample code provides the complete definition of the parameter.

[MERGE|APPEND|DELETE]
DATA INFILE
(
"file_path1"[, file_path2, ...]
)
[NEGATIVE]
INTO TABLE `table_name`
[PARTITION (p1, p2, ...)]
[COLUMNS TERMINATED BY "column_separator"]
[FORMAT AS "file_type"]
[(column_list)]
[COLUMNS FROM PATH AS (c1, c2, ...)]
[PRECEDING FILTER predicate]
[SET (column_mapping)]
[WHERE predicate]
[DELETE ON expr]
[ORDER BY source_sequence]
[PROPERTIES ("key1"="value1", ...)]

Parameter	Description
`[MERGE\|APPEND\|DELETE]`	The type of data merging. By default, this parameter is set to `APPEND`, which appends new data to the existing data. You can set this parameter to `MERGE` or `DELETE` only if a table in the Unique key model is used. If you set this parameter to `MERGE`, you must use the `DELETE ON` statement to add a delete_flag column. If you set this parameter to `DELETE`, the data specified in this import job is deleted.
`DATA INFILE`	The path of the file to be imported. You can specify multiple values for this parameter. Wildcards are supported. The path that you specify must be a path of a file. If the path is a path of a directory, the import fails.
`NEGTIVE`	The data of this batch is imported in a negative method. This method is applicable only to a table that is aggregated by using the SUM method and contains integers. In this method, the integers that correspond to the columns aggregated in the imported data by using the SUM method are reversed to offset the error data that was imported.
`PARTITION(p1, p2, ...)`	The partitions of the table to be imported. Data that is not in the specified partitions is ignored.
`COLUMNS TERMINATED BY`	The column delimiter. The column delimiter is valid only in files of the `CSV` format and must be single-byte.
`FORMAT AS`	The format of the file to be imported. Valid values: `CSV`, `PARQUET` and, `ORC`. Default value: `CSV`.
`column list`	The order in which the columns are sorted in the source file. For more information, see Converting Source Data.
`COLUMNS FROM PATH AS`	The columns that are extracted from the file to be imported.
`PRECEDING FILTER predicate`	The preset conditions for filtering data. The data is merged into source data rows in sequence based on the values of the `column list` and `COLUMNS FROM PATH AS` parameters. Then, the data is filtered based on the preset conditions for filtering data. For more information, see Converting Source Data.
`SET (column_mapping)`	The function that is used to convert the specified column.
`WHERE predicate`	The conditions based on which the imported data is filtered. For more information, see Converting Source Data.
`DELETE ON expr`	This parameter must be used together with the `MEREGE` type of data merging and is applicable only to tables in the Unique key model. This parameter specifies the delete_flag column and the calculation relationship in the data to be imported.
`ORDER BY`	The sequence_col column in the data to be imported. This parameter is applicable only to tables in the Unique key model and ensures the data sequence when data is imported.
`PROPERTIES ("key1"="value1", ...)`	The parameters that specify the format of the file to be imported. If the file to be imported is in the `JSON` format, you can specify parameters such as `json_root`, `jsonpaths`, and `fuzzy_parse`.

The following table describes the parameters that are used to configure the cloud storage service.

Parameter	Description
`AWS_PROVIDER`	The service provider of the object storage service, which is fixed to OSS.
`AWS_REGION`	The region in which the OSS data to be accessed resides.
`AWS_ENDPOINT`	The `endpoint` that is used to access OSS data. For more information about how to obtain the endpoint, see Regions and endpoints. Important Make sure that the OSS bucket and your ApsaraDB for SelectDB instance reside in the same region.
`AWS_ACCESS_KEY`	The `AccessKey ID` that is used to access OSS data.
`AWS_SECRET_KEY`	The `AccessKey secret` that is used to access OSS data.

The following table describes the fields of PROPERTIES.

Field	Description
`timeout`	The timeout period for the import job. Unit: seconds. Default value: `14400`, which indicates 4 hours.
`max_filter_ratio`	The maximum ratio of data that can be filtered out. For example, data is filtered out because it does not conform to standards. Default value: `0`. Valid values: 0 to 1.
`exec_mem_limit`	The maximum size of memory that is available for the import job. Unit: bytes. Default value: `2147483648`, which indicates 2 GiB.
`strict_mode`	Specifies whether to enable the strict mode for the import job. Default value: `false`.
`timezone`	The time zone of the time zone-related functions, such as `strftime, alignment_timestamp, and from_unixtime`. For more information, see List of all time zones. Default value: `Asia/Shanghai`.
`load_parallelism`	The maximum number of import jobs that can concurrently run at a time. Default value: `1`. You can set this parameter to a greater value to run multiple import jobs at the same time to accelerate data import.
`send_batch_parallelism`	The maximum number of concurrent jobs to send data for processing in batches. If the value of this parameter is greater than the value of the `max_send_batch_parallelism_per_job` parameter of the BE configurations, the BE uses the value of the `max_send_batch_parallelism_per_job` parameter.
`load_to_single_tablet`	Specifies whether to import data to only one tablet of the corresponding partition. Default value: `false`. This parameter is available only if you import data to a table that contains random partitions in the Duplicate key model.

Example

Create a table into which you want to import data in an ApsaraDB for SelectDB instance. Sample code:

CREATE TABLE test_table
(
    id int,
    name varchar(50),
    age int,
    address varchar(50),
    url varchar(500)
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");

Create a file to be imported, which is named test_file.txt. Sample code:

1,yang,32,shanghai,http://example.com
2,wang,22,beijing,http://example.com
3,xiao,23,shenzhen,http://example.com
4,jess,45,hangzhou,http://example.com
5,jack,14,shanghai,http://example.com
6,tomy,25,hangzhou,http://example.com
7,lucy,45,shanghai,http://example.com
8,tengyin,26,shanghai,http://example.com
9,wangli,27,shenzhen,http://example.com
10,xiaohua,37,shanghai,http://example.com

Import data by using OSS Load. Sample code:

LOAD LABEL test_db.test_label_1
(
    DATA INFILE("s3://your_bucket_name/test_file.txt")
    INTO TABLE test_table
    COLUMNS TERMINATED BY ","
)
WITH S3
(
    "AWS_PROVIDER" = "OSS",
    "AWS_REGION" = "oss-cn-beijing",  
    "AWS_ENDPOINT" = "oss-cn-beijing-internal.aliyuncs.com",
    "AWS_ACCESS_KEY" = "<your_access_key>",
    "AWS_SECRET_KEY"="<your_secret_key>"
)
PROPERTIES
(
    "timeout" = "3600"
);

Note

The path of the OSS bucket must start with s3://.

Cancel an OSS Load job

If an OSS Load job is not in the CANCELLED or FINISHED state, you can manually cancel the job. You must specify the label of the import job to be canceled. After an import job is canceled, the data written in the job is rolled back and does not take effect.

Syntax

CANCEL LOAD
[FROM db_name]
WHERE [LABEL = "load_label" | LABEL like "label_pattern"];

Parameters

Parameter	Description
`db_name`	The name of the database. By default, if you do not specify this parameter, the current database is used.
`load_label`	The `label` of the import job. Exact match is supported. If you use the `LABEL LIKE` statement, the import jobs whose `labels` contain `label_pattern` are matched.

Examples

Cancel an import job whose label is example_db_test_load_label from the example_db database.
```
CANCEL LOAD
FROM example_db
WHERE LABEL = "example_db_test_load_label";
```
Cancel import jobs whose labels contain example_ from the example_db database.
```
CANCEL LOAD
FROM example_db
WHERE LABEL like "example_";
```

View the data import job

OSS Load is an asynchronous data import method. If a data import statement is successfully executed, the OSS Load job is successfully committed but the data may not be imported. To query the status of a Broker Load job, you can execute the SHOW LOAD statement.

Syntax

SHOW LOAD
[FROM db_name]
[
   WHERE
   [LABEL [ = "your_label" | LIKE "label_matcher"]]
   [STATE = ["PENDING"|"ETL"|"LOADING"|"FINISHED"|"CANCELLED"|]]
]
[ORDER BY ...]
[LIMIT limit][OFFSET offset];

Parameters

Parameter	Description
`db_name`	The name of the database. By default, if you do not specify this parameter, the current database is used.
`your_label`	The `label` of the import job. Exact match is supported. If you use the `LABEL LIKE` statement, the import jobs whose `labels` contain `label_matcher` are matched.
`STATE`	The status of the import job. You can view only import jobs that are in the specified state.
`ORDER BY`	The order in which the returned data records are sorted.
`LIMIT`	The `limit` on the number of data records that are displayed. If you do not specify this parameter, all data records are displayed.
`OFFSET`	The number of initial records to skip before the query results start to be displayed. Default value: `0`.

Examples

Query the import jobs whose labels contain 2014_01_02 in the example_db database, and display 10 import jobs that are stored for the longest period of time.
```
SHOW LOAD FROM example_db WHERE LABEL LIKE "2014_01_02" LIMIT 10;
```
Query the import jobs whose labels are load_example_db_20140102 in the example_db database, and sort these jobs by LoadStartTime in descending order.
```
SHOW LOAD FROM example_db WHERE LABEL = "load_example_db_20140102" ORDER BY LoadStartTime DESC;
```
Query the import jobs that are in the loading state and whose labels are load_example_db_20140102 in the example_db database.
```
SHOW LOAD FROM example_db WHERE LABEL = "load_example_db_20140102" AND STATE = "loading";
```
Query the import jobs in the example_db database and sort these jobs by LoadStartTime in descending order. Skip the initial five query results and display the next 10 query results.
```
SHOW LOAD FROM example_db ORDER BY LoadStartTime DESC limit 5,10;
SHOW LOAD FROM example_db ORDER BY LoadStartTime DESC limit 10 offset 5;
```