EMR Serverless Spark integrates DuckDB to provide lightweight and efficient data analytics. With DuckDB SQL jobs, you can quickly explore data, develop prototypes, and run ad hoc queries.
Prerequisites
You have created an EMR Serverless Spark workspace. For more information, see Create a workspace.
You have created and started a DuckDB session. For more information, see Manage DuckDB sessions.
Limitations
You can only read data from OSS files and DLF Catalogs.
The following limitations apply when you use DuckDB to read data from a DLF Catalog:
Operation type: You can only read data from Paimon append-only tables. Reading from primary key tables and write operations are not supported.
Table format: Only Paimon tables with
file.format = 'parquet'are supported.Catalog type: You cannot access DLF Catalogs created using data sharing.
Create a DuckDB SQL job
Go to the data development page.
Log on to the EMR console.
In the left-side navigation pane, select .
On the Spark page, click the target workspace name.
In the left-side navigation pane, click Data development.
Create a job.
On the Data development page, click Create job.
In the Create job dialog box, select .
Enter a job name and click OK.
In the upper-right corner, select a DuckDB session.
In the new job editor, enter SQL statements.
Read OSS files
Before you read from or write to files in OSS, ensure you have activated OSS and created a bucket. Also, confirm that the execution role for your EMR Serverless Spark workspace has the required permissions to access the bucket.
View the contents of Parquet, CSV, and JSON files.
-- View the contents of a Parquet file. SELECT * FROM 'oss://bucket/path/to/test.parquet'; -- View the contents of a CSV file. SELECT * FROM 'oss://bucket/path/to/test.csv'; -- View the contents of a JSON file. SELECT * FROM 'oss://bucket/path/to/test.json';View the schema of a Parquet file.
SELECT * FROM parquet_schema('oss://your-bucket/path/to/test.parquet');Export query results to a CSV file.
COPY (SELECT * FROM 'oss://your-bucket/path/to/test.parquet') TO 'oss://your-bucket/path/to/output.csv' (HEADER, DELIMITER ',');
Access a DLF Catalog
The following limitations apply when you use DuckDB to read data from a DLF Catalog:
Operation type: You can only read data from Paimon append-only tables. Reading from primary key tables and write operations are not supported.
Table format: Only Paimon tables with
file.format = 'parquet'are supported.Catalog type: You cannot access DLF Catalogs created using data sharing.
View the DLF Catalog that is bound to the current workspace. For more information about DLF data catalogs, see Manage data catalogs.
SHOW DATABASES;Query table data in a DLF Catalog.
USE <catalogName>.<databaseName>; SELECT * FROM <tableName> LIMIT 10;