DuckDB SQL development - E-MapReduce - Alibaba Cloud Documentation Center

EMR Serverless Spark integrates DuckDB to provide lightweight and efficient data analytics. With DuckDB SQL jobs, you can quickly explore data, develop prototypes, and run ad hoc queries.

Prerequisites

You have created an EMR Serverless Spark workspace. For more information, see Create a workspace.
You have created and started a DuckDB session. For more information, see Manage DuckDB sessions.

Limitations

You can only read data from OSS files and DLF Catalogs.
The following limitations apply when you use DuckDB to read data from a DLF Catalog:
- Operation type: You can only read data from Paimon append-only tables. Reading from primary key tables and write operations are not supported.
- Table format: Only Paimon tables with file.format = 'parquet' are supported.
- Catalog type: You cannot access DLF Catalogs created using data sharing.

Create a DuckDB SQL job

Go to the data development page.
1. Log on to the EMR console.
2. In the left-side navigation pane, select EMR Serverless > Spark.
3. On the Spark page, click the target workspace name.
4. In the left-side navigation pane, click Data development.
Create a job.
1. On the Data development page, click Create job.
2. In the Create job dialog box, select Interactive development > DuckDB SQL.
3. Enter a job name and click OK.
4. In the upper-right corner, select a DuckDB session.
5. In the new job editor, enter SQL statements.

Read OSS files

Note

Before you read from or write to files in OSS, ensure you have activated OSS and created a bucket. Also, confirm that the execution role for your EMR Serverless Spark workspace has the required permissions to access the bucket.

View the contents of Parquet, CSV, and JSON files.

-- View the contents of a Parquet file.
SELECT * FROM 'oss://bucket/path/to/test.parquet';
-- View the contents of a CSV file.
SELECT * FROM 'oss://bucket/path/to/test.csv';
-- View the contents of a JSON file.
SELECT * FROM 'oss://bucket/path/to/test.json';

View the schema of a Parquet file.

SELECT * FROM parquet_schema('oss://your-bucket/path/to/test.parquet');

Export query results to a CSV file.

COPY (SELECT * FROM 'oss://your-bucket/path/to/test.parquet')
TO 'oss://your-bucket/path/to/output.csv'
(HEADER, DELIMITER ',');

Access a DLF Catalog

Note

The following limitations apply when you use DuckDB to read data from a DLF Catalog:

Operation type: You can only read data from Paimon append-only tables. Reading from primary key tables and write operations are not supported.
Table format: Only Paimon tables with file.format = 'parquet' are supported.
Catalog type: You cannot access DLF Catalogs created using data sharing.

View the DLF Catalog that is bound to the current workspace. For more information about DLF data catalogs, see Manage data catalogs.
```
SHOW DATABASES;
```

Query table data in a DLF Catalog.

USE <catalogName>.<databaseName>;
SELECT * FROM <tableName> LIMIT 10;

Prerequisites

Limitations

Create a DuckDB SQL job

Read OSS files

Access a DLF Catalog

Related topics