All Products
Search
Document Center

E-MapReduce:Manage DuckDB sessions

Last Updated:Jun 04, 2026

Create and manage DuckDB sessions for interactive SQL-based data analysis in EMR Serverless Spark.

What is DuckDB?

DuckDB is a lightweight, high-performance embedded analytical database engine optimized for Online Analytical Processing (OLAP) use cases.

  • Features

    • Embedded architecture: Runs as an in-process library (like SQLite) with in-memory and on-disk modes. No separate server required.

    • Columnar storage: Stores data by column to optimize aggregate queries and scans.

    • Vectorized execution: Processes data in batches with SIMD instructions to reduce CPU overhead.

    • Standard SQL: Supports SQL-92 and SQL:2011, including CTEs, window functions, JOINs (such as ASOF JOIN), and subqueries.

    • Multi-format reads: Query CSV, Parquet, and JSON files directly without importing.

    • Zero-copy integration: Converts data to and from Pandas and Arrow without migration overhead.

    • Federated queries: Access remote files on S3 via the httpfs extension, or query external databases like PostgreSQL.

  • Use cases

    • Interactive analysis: Process GB- to TB-scale datasets as an alternative to Pandas or Excel.

    • Edge computing: Deploy on edge devices for local data analysis.

    • Data science: Integrates with Python and R ecosystems as a preprocessing engine for ML.

    • Real-time OLAP: Handles workloads requiring frequent updates and complex queries.

Limitations

  • DuckDB sessions are supported only on engine versions esr-4.8.0 or later and esr-3.7.0 or later.

Create a DuckDB session

  1. Log on to the EMR Console.

    1. In the left-side navigation pane, choose EMR Serverless > Spark.

    2. On the Spark page, click the name of the target workspace.

    3. On the EMR Serverless Spark page, in the left-side navigation pane, click Session Management.

    4. On the Session Management page, click the DuckDB Session tab.

  2. On the DuckDB session list page, click Create DuckDB Session.

  3. In the Create DuckDB Session dialog box, set the following parameters.

    Important

    Set the maximum concurrency of the deployment queue to at least the resource size required by the DuckDB session. The required value is displayed on the console.

    Parameter

    Description

    Name

    The session name. Must be 1 to 64 characters, consisting of Chinese characters, letters, digits, underscores (_), or hyphens (-). Must be unique within the workspace.

    Deployment Queue

    The resource queue for the DuckDB session. Only queues in the ready state are available. To create a queue, go to the Resource Queue Management page.

    Engine Version

    The engine version for the DuckDB session. Supported versions are listed in the release notes.

    Auto Stop

    The idle duration before the session stops automatically. Enabled by default with a 45-minute timeout.

    Network Connection

    Required if your DuckDB job accesses data sources or services in a VPC. Select an existing network connection from the drop-down list.

    Cores

    The number of CPU cores. Default: 2.

    Memory

    The memory size. Default: 8 GB.

    MemoryOverhead

    The overhead memory size. Default: max(384 MB, 10% × memory).

    Spark Configurations

    Spark configuration entries, separated by spaces. Example: spark.sql.catalog.paimon.metastore dlf.

  4. Click Create.

The session appears on the DuckDB session list. When the status changes to Running, you can start DuckDB SQL development.

View run records

  1. After a job completes, view its run records on the Session Management page.

  2. On the session list page, click the session name.

  3. Click the Run Records tab.

    This page shows run details such as the run ID, start time, and Spark UI.image