Starting with Hologres V4.1, you can use the EXTERNAL_FILES function to directly query, import, and export structured data files stored in Object Storage Service (OSS) without creating foreign tables. This topic describes prerequisites, syntax, parameters, examples, and limits.
Overview
The EXTERNAL_FILES function enables you to query, import, and export structured data files in OSS without creating foreign tables. It supports the following scenarios:
-
Query data: Query CSV, Parquet, and ORC files in OSS.
-
Import data: Import data from OSS into a Hologres internal table.
-
Export data: Export data from a Hologres table to OSS.
Prerequisites
Version requirement
Your Hologres instance must run version V4.1 or later. If your instance runs an earlier version, upgrade it. For more information, see Instance upgrades.
Permission configuration
Only RAM users or Alibaba Cloud accounts can use EXTERNAL_FILES. Custom accounts are not supported. Before using EXTERNAL_FILES, complete the following steps. EXTERNAL_FILES accesses OSS through a RAM role. First, create a RAM role and grant it to the Hologres service. Then, grant OSS permissions to that role.
Create a role
-
Log on to the RAM console at RAM Roles. Go to the Roles page and click Create Role.
-
In the dialog box, set Trusted Entity Type to Cloud Service and Trusted Entity to Real-time Data Warehouse Hologres.
-
Enter a role name and complete the creation.
Configure OSS access permissions
Grant OSS permissions to the RAM role:
-
Read data: Grant the
AliyunOSSReadOnlyAccesspermission. -
Write data: Grant the
AliyunOSSFullAccesspermission.
Grant the GrantAssumeRole permission to the SQL user
If the user who runs the SQL is not an Alibaba Cloud account, first grant them the GrantAssumeRole permission:
-
In the RAM console, create a permission policy. Go to Policies > Create Policy > Script Editor. Paste the following JSON. Replace
Role_ARNwith the samerole_arnused in EXTERNAL_FILES. To find the Role ARN, see FAQ about RAM roles and STS tokens.
{
"Version": "1",
"Statement": [
{
"Effect": "Allow",
"Action": "hologram:GrantAssumeRole",
"Resource": "<RoleARN>"
}
]
}
Then attach this policy to the user who runs the query. For a RAM user, go to Users > Permissions > Add Permissions. For a RAM role, go to Roles > Permissions > Add Permissions.
Syntax
Query data
SELECT * FROM EXTERNAL_FILES(
path = 'oss://bucket/path',
format = 'csv|parquet|orc',
oss_endpoint = 'oss_endpoint',
role_arn = 'acs:ram::xxx:role/xxx'
[, other parameters...]
) [AS (col1 type1, col2 type2, ...)]
Import data
INSERT INTO target_table
SELECT * FROM EXTERNAL_FILES(
path = 'oss://bucket/path',
format = 'csv|parquet|orc',
oss_endpoint = 'oss_endpoint',
role_arn = 'acs:ram::xxx:role/xxx'
[, other parameters...]
) [AS (col1 type1, col2 type2, ...)]
Export data
INSERT INTO EXTERNAL_FILES(
path = 'oss://bucket/path',
format = 'csv',
oss_endpoint = 'oss_endpoint',
role_arn = 'acs:ram::xxx:role/xxx'
[, other parameters...]
) SELECT * FROM source_table;
Parameters
Common parameters
|
Parameter |
Description |
Required |
Example |
|
|
The file path. Supports directories and files. You can specify multiple paths separated by commas. Supports the |
Yes |
|
|
|
The file format. For queries, supports |
Yes |
|
|
|
The OSS classic network endpoint. For more information, see Regions and endpoints. |
No |
|
|
|
The ARN of the Alibaba Cloud RAM role. |
No |
|
Read parameters
|
Parameter Name |
Description |
Required |
Example |
|
|
The maximum number of files to scan when inferring the schema. Default: 5. |
No |
|
|
|
The order in which files are scanned for schema inference. Default: latest_first. |
No |
|
|
|
Whether to skip the first row of a CSV file as the table header. Default: |
No |
|
|
|
The column delimiter for CSV files. Default: comma. |
No |
|
Write parameters
|
Parameter name |
Description |
Required |
Example |
|
|
The size of each output file in MB. Default: 10. |
No |
|
|
|
Whether to write all data to a single file. Default: |
No |
|
Examples
Example 1: Query a CSV file
Query a CSV file in OSS and infer the schema automatically:
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/data/',
format = 'csv',
csv_skip_header = 'true',
csv_delimiter = ',',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
);
Query a CSV file and specify the schema manually:
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/data/',
format = 'csv',
csv_skip_header = 'true',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
) AS (id int, name text, amount decimal(10,2));
Example 2: Query a Parquet file
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/parquet_data/',
format = 'parquet',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
);
Example 3: Query an ORC file
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/orc_data/',
format = 'orc',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
);
Example 4: Import data into a Hologres table
-- Create the target table
CREATE TABLE orders (
order_id int,
customer_name text,
amount decimal(10,2),
PRIMARY KEY(order_id)
);
-- Import data from OSS
INSERT INTO orders
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/orders/',
format = 'csv',
csv_skip_header = 'true',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
) AS (order_id int, customer_name text, amount decimal(10,2));
Example 5: Export data to OSS
Export data to multiple files:
INSERT INTO EXTERNAL_FILES(
path = 'oss://mybucket/export/',
format = 'csv',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role',
target_file_size_mb = '100'
) SELECT * FROM orders;
Export data to a single file:
INSERT INTO EXTERNAL_FILES(
path = 'oss://mybucket/export/',
format = 'csv',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role',
single_file = 'true'
) SELECT * FROM orders;
Example 6: Use a serverless resource group
SET hg_computing_resource = 'serverless';
SELECT * FROM EXTERNAL_FILES(
path = 'oss://mybucket/data/',
format = 'csv',
oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
role_arn = 'acs:ram::123456789:role/hologres-role'
);
Schema inference rules
Automatic schema inference
If you do not specify a schema using the AS clause, Hologres infers the schema automatically:
-
Parquet/ORC: Infer from file metadata.
-
CSV with headers: Infer from the first row (header) and data content.
-
CSV without headers: All files must have identical schemas.
Hologres scans the number of files specified by schema_deduce_file_num and takes the union of all schemas.
Schema inference behavior
|
Scenario |
Action |
|
A column in the schema does not exist in the file |
Fill with NULL |
|
A column in the file does not exist in the schema |
Ignore the column |
|
Column types do not match but are convertible |
Automatic Type Conversion |
|
Column types do not match and are not convertible |
Return NULL |
Type mapping
When EXTERNAL_FILES reads ORC or Parquet files, it maps file types to PostgreSQL types. The mappings are listed below. Unsupported types are noted under each table.
ORC type mapping
ORC type |
PostgreSQL type |
|---|---|
BOOLEAN | BOOLEAN |
TINYINT / SMALLINT | SMALLINT |
INT | INTEGER |
BIGINT | BIGINT |
FLOAT | REAL |
DOUBLE | DOUBLE PRECISION |
DECIMAL(p, s) | DECIMAL(p, s) |
STRING | TEXT |
VARCHAR(n) | VARCHAR(n) |
CHAR(n) | CHAR(n) |
BINARY | BYTEA |
DATE | DATE |
TIMESTAMP | TIMESTAMP WITHOUT TIME ZONE |
TIMESTAMP WITH LOCAL TIMEZONE | TIMESTAMP WITH TIME ZONE |
LIST | Array (pg_type[]) |
ORC types UNION, STRUCT, and MAP are not supported.
Parquet type mapping
Parquet physical type |
Parquet logical type |
PostgreSQL type |
|---|---|---|
BOOLEAN | — | BOOLEAN |
INT32 | — | INTEGER |
INT32 | DATE | DATE |
INT32 | DECIMAL(p, s) | DECIMAL(p, s) |
INT64 | — | BIGINT |
INT64 | TIMESTAMP_MILLIS / MICROS | TIMESTAMP / TIMESTAMPTZ |
INT64 | DECIMAL(p, s) | DECIMAL(p, s) |
FLOAT | — | REAL |
DOUBLE | — | DOUBLE PRECISION |
BYTE_ARRAY | — | BYTEA |
BYTE_ARRAY | STRING | TEXT |
BYTE_ARRAY | JSON / BSON | JSONB |
BYTE_ARRAY | ENUM | TEXT |
FIXED_LEN_BYTE_ARRAY | DECIMAL(p, s) | DECIMAL(p, s) |
FIXED_LEN_BYTE_ARRAY | UUID | UUID |
LIST | LIST | Array (pg_type[]) |
Parquet types STRUCT and MAP are not supported.
Limits
-
Export format: Only CSV is supported for exports.
-
Recursive search: Subdirectories are not searched recursively.
-
Unsupported types: ORC types UNION, STRUCT, and MAP. Parquet types STRUCT and MAP.
FAQ
Q: I get a permission error when exporting data. What should I do?
A: Make sure the RAM role has the AliyunOSSFullAccess permission.
Q: How do I control the size of exported files?
A: Use the target_file_size_mb parameter to set the size of each file. For finer control, adjust the hg_experimental_query_batch_size parameter.