Starting with Hologres V4.1, you can use the EXTERNAL_FILES function to directly query, import, and export structured data files stored in Object Storage Service (OSS) without creating foreign tables. This topic describes prerequisites, syntax, parameters, examples, and limits.
Overview
EXTERNAL_FILES lets you run SQL queries directly against CSV, Parquet, and ORC files in OSS — no foreign table required. Use it for three scenarios:
-
Query data: Run SELECT statements against CSV, Parquet, and ORC files in OSS.
-
Import data: Load data from OSS files into a Hologres internal table.
-
Export data: Write data from a Hologres table to OSS as CSV files.
Prerequisites
Before you begin, ensure that you have:
-
A Hologres instance running V4.1 or later. To upgrade, see Instance upgrades.
-
An Alibaba Cloud account or RAM user. Custom accounts are not supported.
Permission configuration
EXTERNAL_FILES accesses OSS through a RAM role. Complete the following steps before running EXTERNAL_FILES.
Step 1: Create a RAM role
-
Log on to the RAM console and go to RAM Roles.
-
Click Create Role.
-
In the dialog box, set Trusted Entity Type to Cloud Service and Trusted Entity to Real-time Data Warehouse Hologres.
-
Enter a role name and complete the creation.
Step 2: Grant OSS permissions to the role
Attach an OSS policy to the RAM role based on the operations you need:
-
Read data: Attach
AliyunOSSReadOnlyAccess. -
Write data: Attach
AliyunOSSFullAccess.
Step 3: Grant the GrantAssumeRole permission to RAM users
Skip this step if the user running the SQL is an Alibaba Cloud account. For RAM users, grant them the GrantAssumeRole permission as follows:
-
In the RAM console, go to Policies and click Create Policy.
-
Switch to Script Editor and paste the following JSON. Replace
<RoleARN>with the ARN of the role you created. To find the ARN, see FAQ about RAM roles and STS tokens.{ "Version": "1", "Statement": [ { "Effect": "Allow", "Action": "hologram:GrantAssumeRole", "Resource": "<RoleARN>" } ] } -
Attach this policy to the user:
-
For a RAM user: go to Users > Permissions > Add Permissions.
-
For a RAM role: go to Roles > Permissions > Add Permissions.
-
Syntax
Query data
SELECT * FROM EXTERNAL_FILES(
path = 'oss://bucket/path',
format = 'csv|parquet|orc',
oss_endpoint = 'oss_endpoint',
role_arn = 'acs:ram::xxx:role/xxx'
[, other parameters...]
) [AS (col1 type1, col2 type2, ...)]
Import data
INSERT INTO target_table
SELECT * FROM EXTERNAL_FILES(
path = 'oss://bucket/path',
format = 'csv|parquet|orc',
oss_endpoint = 'oss_endpoint',
role_arn = 'acs:ram::xxx:role/xxx'
[, other parameters...]
) [AS (col1 type1, col2 type2, ...)]
Export data
INSERT INTO EXTERNAL_FILES(
path = 'oss://bucket/path',
format = 'csv',
oss_endpoint = 'oss_endpoint',
role_arn = 'acs:ram::xxx:role/xxx'
[, other parameters...]
) SELECT * FROM source_table;
Parameters
Common parameters
| Parameter | Type | Description | Required | Default | Example |
|---|---|---|---|---|---|
path |
STRING | The OSS file path. Supports directories, individual files, and multiple comma-separated paths. Supports wildcards — see table below. | Yes | — | oss://bucket/dir/, oss://bucket/dir/*.csv |
format |
STRING | The file format. Supported formats for queries: csv, parquet, orc. Exports support csv only. |
Yes | — | csv |
oss_endpoint |
STRING | The OSS classic network endpoint. For endpoint values, see Regions and endpoints. | No | — | oss-cn-hangzhou-internal.aliyuncs.com |
role_arn |
STRING | The ARN of the RAM role used to access OSS. | No | — | acs:ram::xxx:role/xxx |
Wildcards supported in `path`
| Wildcard | Description | Example |
|---|---|---|
* |
Matches zero or more characters | oss://bucket/dir/*.csv |
? |
Matches any single character | oss://bucket/dir/file?.csv |
Read parameters
| Parameter | Type | Description | Required | Default |
|---|---|---|---|---|
schema_deduce_file_num |
INT | Maximum number of files to scan when inferring the schema. | No | 5 |
schema_deduce_file_order |
STRING | Order in which files are scanned for schema inference. Valid values: latest_first, earliest_first, random. |
No | latest_first |
csv_skip_header |
BOOLEAN | Specifies whether to treat the first row of a CSV file as a header and skip it during reads. | No | false |
csv_delimiter |
STRING | The column delimiter for CSV files. | No | , (comma) |
Write parameters
| Parameter | Type | Description | Required | Default |
|---|---|---|---|---|
target_file_size_mb |
INT | The target size of each output file, in MB. | No | 10 |
single_file |
BOOLEAN | Specifies whether to write all exported data into a single file. | No | false |
Examples
Query a CSV file
Query a CSV file with automatic schema inference:
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/data/',
format = 'csv',
csv_skip_header = 'true',
csv_delimiter = ',',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
);
Query a CSV file with a manually specified schema:
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/data/',
format = 'csv',
csv_skip_header = 'true',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
) AS (id int, name text, amount decimal(10,2));
Query a Parquet file
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/parquet_data/',
format = 'parquet',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
);
Query an ORC file
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/orc_data/',
format = 'orc',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
);
Import data into a Hologres table
-- Create the target table
CREATE TABLE orders (
order_id int,
customer_name text,
amount decimal(10,2),
PRIMARY KEY(order_id)
);
-- Import data from OSS
INSERT INTO orders
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/orders/',
format = 'csv',
csv_skip_header = 'true',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
) AS (order_id int, customer_name text, amount decimal(10,2));
Export data to OSS
Export to multiple files with a size limit per file:
INSERT INTO EXTERNAL_FILES(
path = 'oss://mybucket/export/',
format = 'csv',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role',
target_file_size_mb = '100'
) SELECT * FROM orders;
Export all data to a single file:
INSERT INTO EXTERNAL_FILES(
path = 'oss://mybucket/export/',
format = 'csv',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role',
single_file = 'true'
) SELECT * FROM orders;
Use a serverless resource group
SET hg_computing_resource = 'serverless';
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/data/',
format = 'csv',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
);
Schema inference
Automatic inference
If you omit the AS clause, Hologres infers the schema automatically:
-
Parquet and ORC: Schema is inferred from file metadata.
-
CSV with a header row: Schema is inferred from the header and data content. Set
csv_skip_header = 'true'so the header row is not read as data. -
CSV without a header row: All files in the path must have identical column structures.
Hologres scans up to schema_deduce_file_num files (default: 5) and takes the union of all discovered schemas.
Schema mismatch behavior
When the inferred or specified schema does not exactly match the file contents, Hologres handles mismatches as follows:
| Scenario | Behavior |
|---|---|
| Column exists in schema but not in the file | Filled with NULL |
| Column exists in the file but not in schema | Ignored |
| Column types differ but are convertible | Automatic type conversion |
| Column types differ and are not convertible | Returns NULL |
Example: querying files with mismatched columns
Suppose you have two CSV files in the same path:
orders_jan.csv — 3 columns: order_id, customer_name, amount
orders_feb.csv — 4 columns: order_id, customer_name, amount, region
When you query the path without specifying a schema, Hologres takes the union of both schemas. Rows from orders_jan.csv return NULL in the region column:
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/orders/',
format = 'csv',
csv_skip_header = 'true',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
);
| order_id | customer_name | amount | region |
|---|---|---|---|
| 1001 | Alice | 99.00 | NULL |
| 1002 | Bob | 149.00 | NULL |
| 2001 | Carol | 79.00 | us-west |
| 2002 | Dave | 199.00 | eu-central |
To avoid unexpected NULLs, specify the schema explicitly using the AS clause, or ensure all files share the same column structure.
Type mapping
ORC type mapping
| ORC type | PostgreSQL type |
|---|---|
| BOOLEAN | BOOLEAN |
| TINYINT / SMALLINT | SMALLINT |
| INT | INTEGER |
| BIGINT | BIGINT |
| FLOAT | REAL |
| DOUBLE | DOUBLE PRECISION |
| DECIMAL(p, s) | DECIMAL(p, s) |
| STRING | TEXT |
| VARCHAR(n) | VARCHAR(n) |
| CHAR(n) | CHAR(n) |
| BINARY | BYTEA |
| DATE | DATE |
| TIMESTAMP | TIMESTAMP WITHOUT TIME ZONE |
| TIMESTAMP WITH LOCAL TIMEZONE | TIMESTAMP WITH TIME ZONE |
| LIST | Array (pg_type[]) |
ORC typesUNION,STRUCT, andMAPare not supported.
Parquet type mapping
| Parquet physical type | Parquet logical type | PostgreSQL type |
|---|---|---|
| BOOLEAN | — | BOOLEAN |
| INT32 | — | INTEGER |
| INT32 | DATE | DATE |
| INT32 | DECIMAL(p, s) | DECIMAL(p, s) |
| INT64 | — | BIGINT |
| INT64 | TIMESTAMP_MILLIS / MICROS | TIMESTAMP / TIMESTAMPTZ |
| INT64 | DECIMAL(p, s) | DECIMAL(p, s) |
| FLOAT | — | REAL |
| DOUBLE | — | DOUBLE PRECISION |
| BYTE_ARRAY | — | BYTEA |
| BYTE_ARRAY | STRING | TEXT |
| BYTE_ARRAY | JSON / BSON | JSONB |
| BYTE_ARRAY | ENUM | TEXT |
| FIXED_LEN_BYTE_ARRAY | DECIMAL(p, s) | DECIMAL(p, s) |
| FIXED_LEN_BYTE_ARRAY | UUID | UUID |
| LIST | LIST | Array (pg_type[]) |
Parquet typesSTRUCTandMAPare not supported.
Limitations
-
Export format: Only CSV is supported for exports.
-
Recursive directory search: Subdirectories are not searched recursively. Use wildcards in the
pathparameter to match files across a flat directory. -
Unsupported ORC types:
UNION,STRUCT, andMAP. -
Unsupported Parquet types:
STRUCTandMAP. -
Supported accounts: Only RAM users and Alibaba Cloud accounts. Custom accounts are not supported.
FAQ
I get a permission error when exporting data. What should I do?
Make sure the RAM role has the AliyunOSSFullAccess permission. Read operations only require AliyunOSSReadOnlyAccess, but write operations (exports) require full access.
How do I control the number and size of exported files?
Use target_file_size_mb to set the target size for each output file (default: 10 MB). For finer-grained control over the number of rows per file, adjust the hg_experimental_query_batch_size parameter.