All Products
Search
Document Center

E-MapReduce:Use DuckDB in a Notebook

Last Updated:Jan 30, 2026

Alibaba Cloud EMR Serverless Spark Notebooks integrate DuckDB, providing enhanced integration with cloud data sources. This allows for passwordless access to OSS and OSS-HDFS to efficiently operate on various data file formats. You can also directly query metadata tables in Data Lake Formation (DLF) using SQL, enabling a streamlined and lightweight process for data lake exploration and analysis.

Background information

DuckDB is a lightweight, high-performance, embedded analytical database engine optimized for Online Analytical Processing (OLAP) scenarios.

  • Features

    • Embedded architecture: Embeds directly into an application as a library, similar to SQLite, without requiring a separate server. It supports both in-memory and on-disk modes.

    • Columnar storage: Stores data by column to optimize the performance of aggregate queries and scans.

    • Vectorized execution: Uses SIMD instructions to process data batches in parallel, which reduces CPU overhead.

    • Standard compatibility: Supports SQL-92 and SQL:2011 standards, including Common Table Expressions (CTEs), window functions, JOINs (including ASOF JOIN), and subqueries.

    • Direct read for multiple formats: Query CSV, Parquet, and JSON files directly without importing them.

    • Zero-copy integration: Seamlessly integrates with in-memory data structures, such as Pandas and Arrow, to avoid data migration overhead.

    • Federated query: Uses the httpfs extension to access remote files, such as files on S3, or connect to external databases, such as PostgreSQL, for federated queries.

  • Scenarios

    • Interactive analytics: Processes datasets that range from gigabytes to terabytes. It can be used as a replacement for Pandas or Excel when handling big data.

    • Edge computing: Can be deployed on edge devices for local data analytics.

    • Data science: Seamlessly integrates with the Python and R ecosystems to serve as a pre-processing engine for machine learning (ML).

    • Real-time OLAP: Supports analytical needs that require both frequent updates and complex queries.

Limits

  • Version requirements:

    • esr-4.x: esr-4.7.0 and later.

    • esr-3.x: esr-3.6.0 and later.

  • The following limits apply when you read Paimon tables using DuckDB:

    • Operations: You can only read Paimon append-only tables. Reading primary key tables or writing data is not supported.

    • Table format: Only Paimon tables with file.format = 'parquet' are supported.

Procedure

  1. On the EMR Serverless Spark page, click Data Development in the left navigation pane.

  2. Create a Notebook.

    1. On the Development tab, click the image icon.

    2. In the dialog box that appears, enter a name, set Type to Interactive Development > Notebook, and then click OK.

  3. In the upper-right corner, select a Notebook session instance that has been created and started.

    You can also select Create Notebook Session from the drop-down list to create a new Notebook session instance. For more information, see Manage Notebook sessions.

    Note

    DuckDB is a single-node SQL engine that uses only the resources of the Notebook Spark driver. Therefore, you may need to increase the resource configurations for the Spark driver's spark.driver.cores and spark.driver.memory parameters.

  4. To use DuckDB in the Python code of a Notebook, run import duckdb. The following examples demonstrate common use cases.

    Read files

    • View the content of Parquet, CSV, and JSON files.

      import duckdb
      
      # View the content of a Parquet file
      duckdb.sql("SELECT * FROM 'oss://bucket/path/to/test.parquet'")
      # View the content of a CSV file
      duckdb.sql("SELECT * FROM 'oss://bucket/path/to/test.csv'")
      # View the content of a JSON file
      duckdb.sql("SELECT * FROM 'oss://bucket/path/to/test.json'")

      image

    • View the schema of a Parquet file.

      import duckdb
      
      # View the schema of a Parquet file
      duckdb.sql("SELECT * FROM parquet_schema('oss://bucket/path/to/test.parquet')")

      image

    • Export the content of a Parquet file to a CSV file.

      import duckdb
      
      # Export the content of a Parquet file to a CSV file
      duckdb.sql("COPY (SELECT * FROM 'oss://bucket/path/to/test.parquet') TO 'oss://bucket/path/to/output.csv' (HEADER, DELIMITER ',');")
    Note

    Before you read from or write to OSS files, you must activate OSS, create a bucket, and confirm that you have the required access permissions for the bucket.

    Access a DLF catalog

    Note

    DLF catalogs created by data sharing are not supported.

    • View the DLF catalogs that are mounted in the current workspace. For more information about DLF data catalogs, see Manage data catalogs.

      import duckdb
      duckdb.sql("show databases")

      image

    • Query table data in a DLF catalog.

      import duckdb
      duckdb.sql("use <catalogName>.<databaseName>")
      duckdb.sql("select * from <tableName> limit 10")

      image

For more information, see the official DuckDB documentation.