EMR Serverless Spark integrates DuckDB for lightweight, efficient data analysis. You can use DuckDB SQL tasks to explore data, develop prototypes, and run ad-hoc queries.
Prerequisites
-
You have created an EMR Serverless Spark workspace. For more information, see Create a workspace.
-
You have created and started a DuckDB session. For more information, see DuckDB sessions.
Limitations
-
You can only read data from OSS files and a DLF Catalog.
-
DuckDB access to a DLF Catalog has the following limitations:
-
Operation type: You can only read Paimon append-only tables. Reading primary key tables and performing write operations are not supported.
-
Table format: Only Paimon tables with
file.format = 'parquet'are supported. -
Catalog type: You cannot access a DLF Catalog created with data sharing.
-
Create a DuckDB SQL task
-
Navigate to the data development page.
-
Log on to the EMR console.
-
In the left-side navigation pane, choose .
-
On the Spark page, click the target workspace.
-
In the left-side navigation pane, click Data Development.
-
-
Create a task.
-
On the Data Development page, click Create Job.
-
In the Create Job dialog box, select .
-
Enter a name for the task and click OK.
-
In the upper-right corner, select a DuckDB session.
-
Enter SQL statements in the task editor.
-
Read data from OSS files
Before you read from or write to OSS files, ensure that you have activated OSS, created a bucket, and granted the execution role of your EMR Serverless Spark workspace permission to access the bucket.
-
View the content of Parquet, CSV, and JSON files.
-- View the content of a Parquet file SELECT * FROM read_parquet('oss://bucket/path/to/test.parquet'); -- View the content of a CSV file SELECT * FROM read_csv('oss://bucket/path/to/test.csv'); -- View the content of a JSON file SELECT * FROM read_json('oss://bucket/path/to/test.json'); -
View the schema of a Parquet file.
SELECT * FROM parquet_schema('oss://your-bucket/path/to/test.parquet'); -
Export query results to a CSV file.
COPY (SELECT * FROM read_parquet('oss://your-bucket/path/to/test.parquet')) TO 'oss://your-bucket/path/to/output.csv' (HEADER, DELIMITER ',');
Access a DLF Catalog
DuckDB access to a DLF Catalog has the following limitations:
-
Operation type: You can only read Paimon append-only tables. Reading primary key tables and performing write operations are not supported.
-
Table format: Only Paimon tables with
file.format = 'parquet'are supported. -
Catalog type: You cannot access a DLF Catalog created with data sharing.
-
View the DLF Catalogs bound to the current workspace. For more information, see Manage data catalogs.
SHOW DATABASES; -
Query table data in a DLF Catalog.
USE <catalogName>.<databaseName>; SELECT * FROM <tableName> LIMIT 10;