ApsaraDB for SelectDB provides table-valued functions (TVFs) that map files in remote storage—such as Amazon Simple Storage Service (Amazon S3) and Hadoop Distributed File System (HDFS)—directly to queryable tables. This lets you run SQL against external files without loading them first.
Two TVFs are available: s3() for S3-compatible object storage and hdfs() for HDFS.
Amazon S3 TVF
s3() reads files from any S3-compatible object storage system. Supported formats: CSV, csv_with_names, csv_with_names_and_types, JSON, Parquet, and ORC.
Syntax
s3(
"uri" = "<uri>",
"s3.access_key" = "<access-key>",
"s3.secret_key" = "<secret-key>",
"s3.region" = "<region>",
"format" = "<format>"
[, "s3.session_token" = "<session-token>"]
[, "use_path_style" = "true|false"]
[, "keyn" = "valuen" ...]
)Required parameters are listed without brackets. Optional parameters are enclosed in [...].
Parameters
Each parameter is a key-value pair in "key" = "value" format.
Required parameters
| Parameter | Description |
|---|---|
uri | The URI for accessing Amazon S3. Supported schemas: http://, https://, and s3://. If no file matches the URI or all matched files are empty, the TVF returns an empty result set. |
s3.access_key | The access key ID. |
s3.secret_key | The secret access key. |
s3.region | The Amazon S3 region. Default: us-east-1. |
format | The file format. Valid values: csv, csv_with_names, csv_with_names_and_types, json, parquet, orc. |
Optional parameters
| Parameter | Default | Description |
|---|---|---|
s3.session_token | — | The temporary session token. Required when temporary session authentication is enabled. |
use_path_style | false | Controls whether to use path style instead of virtual-hosted style. Set to true for storage systems that do not support virtual-hosted style (for example, MinIO). Note If the URI uses the |
column_separator | , | The column delimiter. |
line_delimiter | \n | The row delimiter. |
compress_type | unknown | The compression type. unknown auto-infers the type from the URI suffix. Other valid values: plain, gz, lzo, bz2, lz4frame, deflate. |
read_json_by_line | true | Reads JSON data row by row. |
num_as_string | false | Processes numeric values as strings. |
fuzzy_parse | false | Accelerates JSON import performance. |
jsonpaths | — | The fields to extract from JSON data. Format: jsonpaths: ["$.k2", "$.k1"]. |
strip_outer_array | false | Treats a top-level JSON array as multiple rows, one element per row. Format: strip_outer_array: true. |
json_root | (empty) | The root node for JSON parsing. ApsaraDB for SelectDB extracts and parses only the elements under this node. Format: json_root: $.RECORDS. |
path_partition_keys | — | Comma-separated partition key column names embedded in the file path. For example, for the path /path/to/city=beijing/date=2023-07-09, set this to city,date. ApsaraDB for SelectDB reads the corresponding column names and column values from the path during data import. |
Examples
Read a CSV file from a MinIO-compatible storage system
MinIO uses path style by default, so set use_path_style to true:
SELECT * FROM s3(
"uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key" = "minioadmin",
"s3.secret_key" = "minioadmin",
"format" = "csv",
"use_path_style"= "true")
ORDER BY c1;Read a Parquet file from Object Storage Service (OSS)
OSS requires virtual-hosted style. Set use_path_style to false:
SELECT * FROM s3(
"uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"format" = "parquet",
"use_path_style"= "false");Read a CSV file using path style
When use_path_style is true, the bucket name is part of the URI path:
SELECT * FROM s3(
"uri" = "https://endpoint/bucket/file/student.csv",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"= "true");Read a CSV file using virtual-hosted style
When use_path_style is false, the bucket name is part of the hostname:
SELECT * FROM s3(
"uri" = "https://bucket.endpoint/bucket/file/student.csv",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"= "false");HDFS TVF
hdfs() reads files from HDFS in the same way as s3() reads from object storage. Supported formats: CSV, csv_with_names, csv_with_names_and_types, JSON, Parquet, and ORC.
Syntax
hdfs(
"uri" = "<uri>",
"fs.defaultFS" = "<hostname:port>",
"hadoop.username" = "<username>",
"format" = "<format>"
[, "hadoop.security.authentication" = "Simple|Kerberos"]
[, "keyn" = "valuen" ...]
)Parameters
Required parameters
| Parameter | Description |
|---|---|
uri | The URI for accessing HDFS. If no file matches the URI or all matched files are empty, the TVF returns an empty result set. |
fs.defaultFS | The hostname and port of the HDFS NameNode. |
hadoop.username | The username for HDFS access. Cannot be empty. |
format | The file format. Valid values: csv, csv_with_names, csv_with_names_and_types, json, parquet, orc. |
Optional parameters
| Parameter | Default | Description |
|---|---|---|
hadoop.security.authentication | — | The authentication method. Valid values: Simple, Kerberos. |
hadoop.kerberos.principal | — | The Kerberos principal. Required when Kerberos authentication is enabled. |
hadoop.kerberos.keytab | — | The path to the Kerberos keytab file. Required when Kerberos authentication is enabled. |
dfs.client.read.shortcircuit | — | Enables short-circuit reads for local HDFS data (BOOLEAN). |
dfs.domain.socket.path | — | The UNIX domain socket path for DataNode-to-client communication. The string _PORT in the path is replaced by the DataNode TCP port. |
dfs.nameservices | — | The logical names of the nameservices. Corresponds to dfs.nameservices in core-site.xml. |
dfs.ha.namenodes.<nameservice> | — | The logical names of the NameNodes. Required for Hadoop high availability (HA) deployments. |
dfs.namenode.rpc-address.<nameservice>.<namenode> | — | The HTTP URL to which the NameNode listens. Required for Hadoop HA deployments. |
dfs.client.failover.proxy.provider.<nameservice> | — | The failover proxy provider implementation class for HA client connections. Required for Hadoop HA deployments. |
read_json_by_line | true | Reads JSON data row by row. |
num_as_string | false | Processes numeric values as strings. |
fuzzy_parse | false | Accelerates JSON import performance. |
jsonpaths | — | The fields to extract from JSON data. Format: jsonpaths: ["$.k2", "$.k1"]. |
strip_outer_array | false | Treats a top-level JSON array as multiple rows, one element per row. Format: strip_outer_array: true. |
json_root | (empty) | The root node for JSON parsing. Format: json_root: $.RECORDS. |
trim_double_quotes | false | Trims the outermost double quotation marks from each field in CSV files. |
skip_lines | 0 | The number of leading rows to skip in CSV files. Range: [0–Integer.MaxValue]. Has no effect when format is csv_with_names or csv_with_names_and_types. |
path_partition_keys | — | Comma-separated partition key column names embedded in the file path. Same behavior as the S3 TVF. |
Examples
Read a CSV file from HDFS
SELECT * FROM hdfs(
"uri" = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
"fs.defaultFS" = "hdfs://127.0.0.1:8424",
"hadoop.username" = "doris",
"format" = "csv");
-- Sample response
+------+---------+------+
| c1 | c2 | c3 |
+------+---------+------+
| 1 | alice | 18 |
| 2 | bob | 20 |
| 3 | jack | 24 |
| 4 | jackson | 19 |
| 5 | liming | 18 |
+------+---------+------+Read a CSV file from HDFS in HA mode
For Hadoop HA deployments, add the three HA-specific parameters:
SELECT * FROM hdfs(
"uri" = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
"fs.defaultFS" = "hdfs://127.0.0.1:8424",
"hadoop.username" = "doris",
"format" = "csv",
"dfs.nameservices" = "my_hdfs",
"dfs.ha.namenodes.my_hdfs" = "nn1,nn2",
"dfs.namenode.rpc-address.my_hdfs.nn1" = "nanmenode01:8020",
"dfs.namenode.rpc-address.my_hdfs.nn2" = "nanmenode02:8020",
"dfs.client.failover.proxy.provider.my_hdfs" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
-- Sample response
+------+---------+------+
| c1 | c2 | c3 |
+------+---------+------+
| 1 | alice | 18 |
| 2 | bob | 20 |
| 3 | jack | 24 |
| 4 | jackson | 19 |
| 5 | liming | 18 |
+------+---------+------+Query and analyze files
All examples in this section use the Amazon S3 TVF. The same patterns apply to the HDFS TVF.
Inspect the file schema
Use DESC FUNCTION to inspect a file's schema before querying it. ApsaraDB for SelectDB automatically infers column types for Parquet, ORC, CSV, and JSON files.
DESC FUNCTION s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"format" = "parquet",
"use_path_style" = "true");
-- Sample response
+---------------+--------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey | INT | Yes | false | NULL | NONE |
| p_name | TEXT | Yes | false | NULL | NONE |
| p_mfgr | TEXT | Yes | false | NULL | NONE |
| p_brand | TEXT | Yes | false | NULL | NONE |
| p_type | TEXT | Yes | false | NULL | NONE |
| p_size | INT | Yes | false | NULL | NONE |
| p_container | TEXT | Yes | false | NULL | NONE |
| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
| p_comment | TEXT | Yes | false | NULL | NONE |
+---------------+--------------+------+-------+---------+-------+For CSV files, all columns are inferred as STRING by default. To specify column names and types explicitly, use the csv_schema parameter with the format name1:type1;name2:type2;...:
SELECT * FROM s3(
"uri" = "https://bucket1/inventory.dat",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"column_separator" = "|",
"csv_schema" = "k1:int;k2:int;k3:int;k4:decimal(38,10)",
"use_path_style" = "true");If a specified type mismatches the actual data, or if you specify more columns than the file contains, ApsaraDB for SelectDB returns NULL for those columns.
The following column types are supported in csv_schema:
| Specified type | Mapped type |
|---|---|
tinyint | tinyint |
smallint | smallint |
int | int |
bigint | bigint |
largeint | largeint |
float | float |
double | double |
decimal(p,s) | decimalv3(p,s) |
date | datev2 |
datetime | datetimev2 |
char | string |
varchar | string |
string | string |
boolean | boolean |
Run SQL queries
Use the TVF anywhere a table name is valid in SQL—including FROM clauses and common table expressions (CTEs):
SELECT * FROM s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"format" = "parquet",
"use_path_style" = "true")
LIMIT 5;
-- Sample response
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+Create a view
Create a view over a TVF to share access and manage permissions without exposing credentials in each query:
CREATE VIEW v1 AS
SELECT * FROM s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"format" = "parquet",
"use_path_style" = "true");
DESC v1;
SELECT * FROM v1;
GRANT SELECT_PRIV ON db1.v1 TO user1;Import file data into a table
Use INSERT INTO SELECT with a TVF to load file data into an ApsaraDB for SelectDB table:
-- Step 1: Create a target table.
CREATE TABLE IF NOT EXISTS test_table
(
id INT,
name VARCHAR(50),
age INT
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");
-- Step 2: Insert data from the S3 file.
INSERT INTO test_table (id, name, age)
SELECT CAST(id AS INT) AS id, name, CAST(age AS INT) AS age
FROM s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"format" = "parquet",
"use_path_style" = "true");Usage notes
Empty URI or no matching files: If the URI does not exist or all matched files are empty, the TVF returns an empty result set. Running
DESC FUNCTIONin this case returns a dummy column__dummy_col, which can be ignored.Empty first row in CSV files: If the file format is CSV and the file is non-empty but the first row is empty, the following error is returned:
The first line is empty, can not parse column numbers. Make sure the first row of your CSV file is not empty.Type mismatches in CSV schema: When you specify column types with
csv_schemaand a value does not match the declared type—for example, a string value in a column declared as INT—ApsaraDB for SelectDB returns NULL for that value rather than raising an error.