All Products
Search
Document Center

ApsaraDB for SelectDB:File analysis

Last Updated:Mar 28, 2026

ApsaraDB for SelectDB provides table-valued functions (TVFs) that map files in remote storage—such as Amazon Simple Storage Service (Amazon S3) and Hadoop Distributed File System (HDFS)—directly to queryable tables. This lets you run SQL against external files without loading them first.

Two TVFs are available: s3() for S3-compatible object storage and hdfs() for HDFS.

Amazon S3 TVF

s3() reads files from any S3-compatible object storage system. Supported formats: CSV, csv_with_names, csv_with_names_and_types, JSON, Parquet, and ORC.

Syntax

s3(
  "uri"           = "<uri>",
  "s3.access_key" = "<access-key>",
  "s3.secret_key" = "<secret-key>",
  "s3.region"     = "<region>",
  "format"        = "<format>"
  [, "s3.session_token" = "<session-token>"]
  [, "use_path_style"   = "true|false"]
  [, "keyn"             = "valuen" ...]
)

Required parameters are listed without brackets. Optional parameters are enclosed in [...].

Parameters

Each parameter is a key-value pair in "key" = "value" format.

Required parameters

ParameterDescription
uriThe URI for accessing Amazon S3. Supported schemas: http://, https://, and s3://. If no file matches the URI or all matched files are empty, the TVF returns an empty result set.
s3.access_keyThe access key ID.
s3.secret_keyThe secret access key.
s3.regionThe Amazon S3 region. Default: us-east-1.
formatThe file format. Valid values: csv, csv_with_names, csv_with_names_and_types, json, parquet, orc.

Optional parameters

ParameterDefaultDescription
s3.session_tokenThe temporary session token. Required when temporary session authentication is enabled.
use_path_stylefalseControls whether to use path style instead of virtual-hosted style. Set to true for storage systems that do not support virtual-hosted style (for example, MinIO).
Note

If the URI uses the s3:// schema, virtual-hosted style is always used regardless of this setting.

column_separator,The column delimiter.
line_delimiter\nThe row delimiter.
compress_typeunknownThe compression type. unknown auto-infers the type from the URI suffix. Other valid values: plain, gz, lzo, bz2, lz4frame, deflate.
read_json_by_linetrueReads JSON data row by row.
num_as_stringfalseProcesses numeric values as strings.
fuzzy_parsefalseAccelerates JSON import performance.
jsonpathsThe fields to extract from JSON data. Format: jsonpaths: ["$.k2", "$.k1"].
strip_outer_arrayfalseTreats a top-level JSON array as multiple rows, one element per row. Format: strip_outer_array: true.
json_root(empty)The root node for JSON parsing. ApsaraDB for SelectDB extracts and parses only the elements under this node. Format: json_root: $.RECORDS.
path_partition_keysComma-separated partition key column names embedded in the file path. For example, for the path /path/to/city=beijing/date=2023-07-09, set this to city,date. ApsaraDB for SelectDB reads the corresponding column names and column values from the path during data import.

Examples

Read a CSV file from a MinIO-compatible storage system

MinIO uses path style by default, so set use_path_style to true:

SELECT * FROM s3(
    "uri"           = "http://127.0.0.1:9312/test2/student1.csv",
    "s3.access_key" = "minioadmin",
    "s3.secret_key" = "minioadmin",
    "format"        = "csv",
    "use_path_style"= "true")
ORDER BY c1;

Read a Parquet file from Object Storage Service (OSS)

OSS requires virtual-hosted style. Set use_path_style to false:

SELECT * FROM s3(
    "uri"           = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
    "s3.access_key" = "ak",
    "s3.secret_key" = "sk",
    "format"        = "parquet",
    "use_path_style"= "false");

Read a CSV file using path style

When use_path_style is true, the bucket name is part of the URI path:

SELECT * FROM s3(
    "uri"           = "https://endpoint/bucket/file/student.csv",
    "s3.access_key" = "ak",
    "s3.secret_key" = "sk",
    "format"        = "csv",
    "use_path_style"= "true");

Read a CSV file using virtual-hosted style

When use_path_style is false, the bucket name is part of the hostname:

SELECT * FROM s3(
    "uri"           = "https://bucket.endpoint/bucket/file/student.csv",
    "s3.access_key" = "ak",
    "s3.secret_key" = "sk",
    "format"        = "csv",
    "use_path_style"= "false");

HDFS TVF

hdfs() reads files from HDFS in the same way as s3() reads from object storage. Supported formats: CSV, csv_with_names, csv_with_names_and_types, JSON, Parquet, and ORC.

Syntax

hdfs(
  "uri"              = "<uri>",
  "fs.defaultFS"     = "<hostname:port>",
  "hadoop.username"  = "<username>",
  "format"           = "<format>"
  [, "hadoop.security.authentication" = "Simple|Kerberos"]
  [, "keyn"                           = "valuen" ...]
)

Parameters

Required parameters

ParameterDescription
uriThe URI for accessing HDFS. If no file matches the URI or all matched files are empty, the TVF returns an empty result set.
fs.defaultFSThe hostname and port of the HDFS NameNode.
hadoop.usernameThe username for HDFS access. Cannot be empty.
formatThe file format. Valid values: csv, csv_with_names, csv_with_names_and_types, json, parquet, orc.

Optional parameters

ParameterDefaultDescription
hadoop.security.authenticationThe authentication method. Valid values: Simple, Kerberos.
hadoop.kerberos.principalThe Kerberos principal. Required when Kerberos authentication is enabled.
hadoop.kerberos.keytabThe path to the Kerberos keytab file. Required when Kerberos authentication is enabled.
dfs.client.read.shortcircuitEnables short-circuit reads for local HDFS data (BOOLEAN).
dfs.domain.socket.pathThe UNIX domain socket path for DataNode-to-client communication. The string _PORT in the path is replaced by the DataNode TCP port.
dfs.nameservicesThe logical names of the nameservices. Corresponds to dfs.nameservices in core-site.xml.
dfs.ha.namenodes.<nameservice>The logical names of the NameNodes. Required for Hadoop high availability (HA) deployments.
dfs.namenode.rpc-address.<nameservice>.<namenode>The HTTP URL to which the NameNode listens. Required for Hadoop HA deployments.
dfs.client.failover.proxy.provider.<nameservice>The failover proxy provider implementation class for HA client connections. Required for Hadoop HA deployments.
read_json_by_linetrueReads JSON data row by row.
num_as_stringfalseProcesses numeric values as strings.
fuzzy_parsefalseAccelerates JSON import performance.
jsonpathsThe fields to extract from JSON data. Format: jsonpaths: ["$.k2", "$.k1"].
strip_outer_arrayfalseTreats a top-level JSON array as multiple rows, one element per row. Format: strip_outer_array: true.
json_root(empty)The root node for JSON parsing. Format: json_root: $.RECORDS.
trim_double_quotesfalseTrims the outermost double quotation marks from each field in CSV files.
skip_lines0The number of leading rows to skip in CSV files. Range: [0–Integer.MaxValue]. Has no effect when format is csv_with_names or csv_with_names_and_types.
path_partition_keysComma-separated partition key column names embedded in the file path. Same behavior as the S3 TVF.

Examples

Read a CSV file from HDFS

SELECT * FROM hdfs(
    "uri"              = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
    "fs.defaultFS"     = "hdfs://127.0.0.1:8424",
    "hadoop.username"  = "doris",
    "format"           = "csv");

-- Sample response
+------+---------+------+
| c1   | c2      | c3   |
+------+---------+------+
| 1    | alice   | 18   |
| 2    | bob     | 20   |
| 3    | jack    | 24   |
| 4    | jackson | 19   |
| 5    | liming  | 18   |
+------+---------+------+

Read a CSV file from HDFS in HA mode

For Hadoop HA deployments, add the three HA-specific parameters:

SELECT * FROM hdfs(
    "uri"                                         = "hdfs://127.0.0.1:842/user/doris/csv_format_test/student.csv",
    "fs.defaultFS"                                = "hdfs://127.0.0.1:8424",
    "hadoop.username"                             = "doris",
    "format"                                      = "csv",
    "dfs.nameservices"                            = "my_hdfs",
    "dfs.ha.namenodes.my_hdfs"                   = "nn1,nn2",
    "dfs.namenode.rpc-address.my_hdfs.nn1"       = "nanmenode01:8020",
    "dfs.namenode.rpc-address.my_hdfs.nn2"       = "nanmenode02:8020",
    "dfs.client.failover.proxy.provider.my_hdfs" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");

-- Sample response
+------+---------+------+
| c1   | c2      | c3   |
+------+---------+------+
| 1    | alice   | 18   |
| 2    | bob     | 20   |
| 3    | jack    | 24   |
| 4    | jackson | 19   |
| 5    | liming  | 18   |
+------+---------+------+

Query and analyze files

All examples in this section use the Amazon S3 TVF. The same patterns apply to the HDFS TVF.

Inspect the file schema

Use DESC FUNCTION to inspect a file's schema before querying it. ApsaraDB for SelectDB automatically infers column types for Parquet, ORC, CSV, and JSON files.

DESC FUNCTION s3(
    "uri"            = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"  = "ak",
    "s3.secret_key"  = "sk",
    "format"         = "parquet",
    "use_path_style" = "true");

-- Sample response
+---------------+--------------+------+-------+---------+-------+
| Field         | Type         | Null | Key   | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey     | INT          | Yes  | false | NULL    | NONE  |
| p_name        | TEXT         | Yes  | false | NULL    | NONE  |
| p_mfgr        | TEXT         | Yes  | false | NULL    | NONE  |
| p_brand       | TEXT         | Yes  | false | NULL    | NONE  |
| p_type        | TEXT         | Yes  | false | NULL    | NONE  |
| p_size        | INT          | Yes  | false | NULL    | NONE  |
| p_container   | TEXT         | Yes  | false | NULL    | NONE  |
| p_retailprice | DECIMAL(9,0) | Yes  | false | NULL    | NONE  |
| p_comment     | TEXT         | Yes  | false | NULL    | NONE  |
+---------------+--------------+------+-------+---------+-------+

For CSV files, all columns are inferred as STRING by default. To specify column names and types explicitly, use the csv_schema parameter with the format name1:type1;name2:type2;...:

SELECT * FROM s3(
    "uri"              = "https://bucket1/inventory.dat",
    "s3.access_key"    = "ak",
    "s3.secret_key"    = "sk",
    "format"           = "csv",
    "column_separator" = "|",
    "csv_schema"       = "k1:int;k2:int;k3:int;k4:decimal(38,10)",
    "use_path_style"   = "true");

If a specified type mismatches the actual data, or if you specify more columns than the file contains, ApsaraDB for SelectDB returns NULL for those columns.

The following column types are supported in csv_schema:

Specified typeMapped type
tinyinttinyint
smallintsmallint
intint
bigintbigint
largeintlargeint
floatfloat
doubledouble
decimal(p,s)decimalv3(p,s)
datedatev2
datetimedatetimev2
charstring
varcharstring
stringstring
booleanboolean

Run SQL queries

Use the TVF anywhere a table name is valid in SQL—including FROM clauses and common table expressions (CTEs):

SELECT * FROM s3(
    "uri"            = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"  = "ak",
    "s3.secret_key"  = "sk",
    "format"         = "parquet",
    "use_path_style" = "true")
LIMIT 5;

-- Sample response
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name                                   | p_mfgr         | p_brand  | p_type                  | p_size | p_container | p_retailprice | p_comment           |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
|         1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER  |      7 | JUMBO PKG   |           901 | ly. slyly ironi     |
|         2 | blush thistle blue yellow saddle         | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS     |      1 | LG CASE     |           902 | lar accounts amo    |
|         3 | spring green yellow purple cornsilk      | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS |     21 | WRAP CASE   |           903 | egular deposits hag |
|         4 | cornflower chocolate smoke green pink    | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS      |     14 | MED DRUM    |           904 | p furiously r       |
|         5 | forest brown coral puff cream            | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN   |     15 | SM PKG      |           905 |  wake carefully     |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+

Create a view

Create a view over a TVF to share access and manage permissions without exposing credentials in each query:

CREATE VIEW v1 AS
SELECT * FROM s3(
    "uri"            = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"  = "ak",
    "s3.secret_key"  = "sk",
    "format"         = "parquet",
    "use_path_style" = "true");

DESC v1;

SELECT * FROM v1;

GRANT SELECT_PRIV ON db1.v1 TO user1;

Import file data into a table

Use INSERT INTO SELECT with a TVF to load file data into an ApsaraDB for SelectDB table:

-- Step 1: Create a target table.
CREATE TABLE IF NOT EXISTS test_table
(
    id   INT,
    name VARCHAR(50),
    age  INT
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");

-- Step 2: Insert data from the S3 file.
INSERT INTO test_table (id, name, age)
SELECT CAST(id AS INT) AS id, name, CAST(age AS INT) AS age
FROM s3(
    "uri"            = "http://127.0.0.1:9312/test2/test.snappy.parquet",
    "s3.access_key"  = "ak",
    "s3.secret_key"  = "sk",
    "format"         = "parquet",
    "use_path_style" = "true");

Usage notes

  • Empty URI or no matching files: If the URI does not exist or all matched files are empty, the TVF returns an empty result set. Running DESC FUNCTION in this case returns a dummy column __dummy_col, which can be ignored.

  • Empty first row in CSV files: If the file format is CSV and the file is non-empty but the first row is empty, the following error is returned: The first line is empty, can not parse column numbers. Make sure the first row of your CSV file is not empty.

  • Type mismatches in CSV schema: When you specify column types with csv_schema and a value does not match the declared type—for example, a string value in a column declared as INT—ApsaraDB for SelectDB returns NULL for that value rather than raising an error.