EXTERNAL_FILES OSS file query import export - Hologres - Alibaba Cloud Documentation Center

Starting with Hologres V4.1, you can use the EXTERNAL_FILES function to directly query, import, and export structured data files stored in Object Storage Service (OSS) without creating foreign tables. This topic describes prerequisites, syntax, parameters, examples, and limits.

Overview

The EXTERNAL_FILES function enables you to query, import, and export structured data files in OSS without creating foreign tables. It supports the following scenarios:

Query data: Query CSV, Parquet, and ORC files in OSS.
Import data: Import data from OSS into a Hologres internal table.
Export data: Export data from a Hologres table to OSS.

Prerequisites

Version requirement

Your Hologres instance must run version V4.1 or later. If your instance runs an earlier version, upgrade it. For more information, see Instance upgrades.

Permission configuration

Only RAM users or Alibaba Cloud accounts can use EXTERNAL_FILES. Custom accounts are not supported. Before using EXTERNAL_FILES, complete the following steps. EXTERNAL_FILES accesses OSS through a RAM role. First, create a RAM role and grant it to the Hologres service. Then, grant OSS permissions to that role.

Create a role

Log on to the RAM console at RAM Roles. Go to the Roles page and click Create Role.
In the dialog box, set Trusted Entity Type to Cloud Service and Trusted Entity to Real-time Data Warehouse Hologres.
Enter a role name and complete the creation.

Configure OSS access permissions

Grant OSS permissions to the RAM role:

Read data: Grant the AliyunOSSReadOnlyAccess permission.
Write data: Grant the AliyunOSSFullAccess permission.

Grant the GrantAssumeRole permission to the SQL user

If the user who runs the SQL is not an Alibaba Cloud account, first grant them the GrantAssumeRole permission:

In the RAM console, create a permission policy. Go to Policies > Create Policy > Script Editor. Paste the following JSON. Replace Role_ARN with the same role_arn used in EXTERNAL_FILES. To find the Role ARN, see FAQ about RAM roles and STS tokens.

{
  "Version": "1",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "hologram:GrantAssumeRole",
      "Resource": "<RoleARN>"
    }
  ]
}

Then attach this policy to the user who runs the query. For a RAM user, go to Users > Permissions > Add Permissions. For a RAM role, go to Roles > Permissions > Add Permissions.

Syntax

Query data

SELECT * FROM EXTERNAL_FILES(
    path = 'oss://bucket/path',
    format = 'csv|parquet|orc',
    oss_endpoint = 'oss_endpoint',
    role_arn = 'acs:ram::xxx:role/xxx'
    [, other parameters...]
) [AS (col1 type1, col2 type2, ...)]

Import data

INSERT INTO target_table 
SELECT * FROM EXTERNAL_FILES(
    path = 'oss://bucket/path',
    format = 'csv|parquet|orc',
    oss_endpoint = 'oss_endpoint',
    role_arn = 'acs:ram::xxx:role/xxx'
    [, other parameters...]
) [AS (col1 type1, col2 type2, ...)]

Export data

INSERT INTO EXTERNAL_FILES(
    path = 'oss://bucket/path',
    format = 'csv',
    oss_endpoint = 'oss_endpoint',
    role_arn = 'acs:ram::xxx:role/xxx'
    [, other parameters...]
) SELECT * FROM source_table;

Parameters

Common parameters

Parameter	Description	Required	Example
`path`	The file path. Supports directories and files. You can specify multiple paths separated by commas. Supports the `?` wildcard (matches any single character) and the `*` wildcard (matches zero or more characters).	Yes	`oss://bucket/dir/`, `oss://bucket/dir/*.csv`
`format`	The file format. For queries, supports `csv`, `parquet`, and `orc`. For exports, supports only `csv`.	Yes	`csv`
`oss_endpoint`	The OSS classic network endpoint. For more information, see Regions and endpoints.	No	`oss-cn-hangzhou-internal.aliyuncs.com`
`role_arn`	The ARN of the Alibaba Cloud RAM role.	No	`acs:ram::xxx:role/xxx`

Read parameters

Parameter Name	Description	Required	Example
`schema_deduce_file_num`	The maximum number of files to scan when inferring the schema. Default: 5.	No	`10`
`schema_deduce_file_order`	The order in which files are scanned for schema inference. Default: latest_first.	No	`latest_first`, `earliest_first`, `random`
`csv_skip_header`	Whether to skip the first row of a CSV file as the table header. Default: `false`.	No	`true`
`csv_delimiter`	The column delimiter for CSV files. Default: comma.	No	`\|`

Write parameters

Parameter name	Description	Required	Example
`target_file_size_mb`	The size of each output file in MB. Default: 10.	No	`500`
`single_file`	Whether to write all data to a single file. Default: `false`.	No	`true`

Examples

Example 1: Query a CSV file

Query a CSV file in OSS and infer the schema automatically:

SELECT * FROM EXTERNAL_FILES(
    path = 'oss://mybucket/data/',
    format = 'csv',
    csv_skip_header = 'true',
    csv_delimiter = ',',
    oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
    role_arn = 'acs:ram::123456789:role/hologres-role'
);

Query a CSV file and specify the schema manually:

SELECT * FROM EXTERNAL_FILES(
    path = 'oss://mybucket/data/',
    format = 'csv',
    csv_skip_header = 'true',
    oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
    role_arn = 'acs:ram::123456789:role/hologres-role'
) AS (id int, name text, amount decimal(10,2));

Example 2: Query a Parquet file

SELECT * FROM EXTERNAL_FILES(
    path = 'oss://mybucket/parquet_data/',
    format = 'parquet',
    oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
    role_arn = 'acs:ram::123456789:role/hologres-role'
);

Example 3: Query an ORC file

SELECT * FROM EXTERNAL_FILES(
    path = 'oss://mybucket/orc_data/',
    format = 'orc',
    oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
    role_arn = 'acs:ram::123456789:role/hologres-role'
);

Example 4: Import data into a Hologres table

-- Create the target table
CREATE TABLE orders (
    order_id int,
    customer_name text,
    amount decimal(10,2),
    PRIMARY KEY(order_id)
);

-- Import data from OSS
INSERT INTO orders 
SELECT * FROM EXTERNAL_FILES(
    path = 'oss://mybucket/orders/',
    format = 'csv',
    csv_skip_header = 'true',
    oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
    role_arn = 'acs:ram::123456789:role/hologres-role'
) AS (order_id int, customer_name text, amount decimal(10,2));

Example 5: Export data to OSS

Export data to multiple files:

INSERT INTO EXTERNAL_FILES(
    path = 'oss://mybucket/export/',
    format = 'csv',
    oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
    role_arn = 'acs:ram::123456789:role/hologres-role',
    target_file_size_mb = '100'
) SELECT * FROM orders;

Export data to a single file:

INSERT INTO EXTERNAL_FILES(
    path = 'oss://mybucket/export/',
    format = 'csv',
    oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
    role_arn = 'acs:ram::123456789:role/hologres-role',
    single_file = 'true'
) SELECT * FROM orders;

Example 6: Use a serverless resource group

SET hg_computing_resource = 'serverless';

SELECT * FROM EXTERNAL_FILES(
    path = 'oss://mybucket/data/',
    format = 'csv',
    oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',
    role_arn = 'acs:ram::123456789:role/hologres-role'
);

Schema inference rules

Automatic schema inference

If you do not specify a schema using the AS clause, Hologres infers the schema automatically:

Parquet/ORC: Infer from file metadata.
CSV with headers: Infer from the first row (header) and data content.
CSV without headers: All files must have identical schemas.

Hologres scans the number of files specified by schema_deduce_file_num and takes the union of all schemas.

Schema inference behavior

Scenario	Action
A column in the schema does not exist in the file	Fill with NULL
A column in the file does not exist in the schema	Ignore the column
Column types do not match but are convertible	Automatic Type Conversion
Column types do not match and are not convertible	Return NULL

Type mapping

When EXTERNAL_FILES reads ORC or Parquet files, it maps file types to PostgreSQL types. The mappings are listed below. Unsupported types are noted under each table.

ORC type mapping

ORC type	PostgreSQL type
BOOLEAN	BOOLEAN
TINYINT / SMALLINT	SMALLINT
INT	INTEGER
BIGINT	BIGINT
FLOAT	REAL
DOUBLE	DOUBLE PRECISION
DECIMAL(p, s)	DECIMAL(p, s)
STRING	TEXT
VARCHAR(n)	VARCHAR(n)
CHAR(n)	CHAR(n)
BINARY	BYTEA
DATE	DATE
TIMESTAMP	TIMESTAMP WITHOUT TIME ZONE
TIMESTAMP WITH LOCAL TIMEZONE	TIMESTAMP WITH TIME ZONE
LIST	Array (pg_type[])

Note

ORC types UNION, STRUCT, and MAP are not supported.

Parquet type mapping

Parquet physical type	Parquet logical type	PostgreSQL type
BOOLEAN	—	BOOLEAN
INT32	—	INTEGER
INT32	DATE	DATE
INT32	DECIMAL(p, s)	DECIMAL(p, s)
INT64	—	BIGINT
INT64	TIMESTAMP_MILLIS / MICROS	TIMESTAMP / TIMESTAMPTZ
INT64	DECIMAL(p, s)	DECIMAL(p, s)
FLOAT	—	REAL
DOUBLE	—	DOUBLE PRECISION
BYTE_ARRAY	—	BYTEA
BYTE_ARRAY	STRING	TEXT
BYTE_ARRAY	JSON / BSON	JSONB
BYTE_ARRAY	ENUM	TEXT
FIXED_LEN_BYTE_ARRAY	DECIMAL(p, s)	DECIMAL(p, s)
FIXED_LEN_BYTE_ARRAY	UUID	UUID
LIST	LIST	Array (pg_type[])

Note

Parquet types STRUCT and MAP are not supported.

Limits

Export format: Only CSV is supported for exports.
Recursive search: Subdirectories are not searched recursively.
Unsupported types: ORC types UNION, STRUCT, and MAP. Parquet types STRUCT and MAP.

FAQ

Q: I get a permission error when exporting data. What should I do?

A: Make sure the RAM role has the AliyunOSSFullAccess permission.

Q: How do I control the size of exported files?

A: Use the target_file_size_mb parameter to set the size of each file. For finer control, adjust the hg_experimental_query_batch_size parameter.