Paimon Blob tables enable row-level tracking and schema evolution by storing large binary objects in separate files optimized for random access. This design is ideal for multi-modal data storage and manipulation. - E-MapReduce

The Paimon Blob format stores large binary objects, such as images and videos. Unlike inline formats, it uses separate files with a layout optimized for random access.

Supported versions

This feature requires engine version esr-4.7.0 or later.

Paimon Blob tables

To create a Paimon Blob table, you must define its schema and properties. The following template shows a standard creation statement and explains the key parameters.

CREATE TABLE blob_tbl (
    fileName STRING,
    picture BINARY
)
USING paimon
TBLPROPERTIES (
    'row-tracking.enabled' = 'true',
    'data-evolution.enabled' = 'true',
    'blob-field' = 'picture'
);

Key parameters:

'blob-field' = 'picture': Declares the picture field as the Blob field. This field must be of the BINARY type.
'row-tracking.enabled' = 'true': Enables row-level tracking, which is required for MERGE INTO update and delete operations.
'data-evolution.enabled' = 'true': Allows schema evolution.

Usage example

This example demonstrates how to read image files from Object Storage Service (OSS) and write them to a Paimon Blob table.

Prepare sample images

Download the following sample images to use in this example.

Click cat.png and dog.jpg to download the sample images.

Upload sample images

Upload the images to the Object Storage Service (OSS) console. For details, see Simple Upload.

In this example, upload the two sample images, cat.png and dog.jpg, to the pictures/ directory in your bucket.

Develop and run

On the EMR Serverless Spark page, click Development in the left-side navigation pane.

This opens the development page. The main panel contains the Development and Data Directory tabs. The directory tree contains the Development and Git Directory nodes.
Create a notebook.
1. On the Development tab, click the icon.
2. In the dialog box that appears, enter a name, select interactive development > Notebook as the type, and then click OK.
In the upper-right corner, select a running notebook session instance.

You can also select Create Notebook Session from the drop-down list to create a notebook session instance. For more information about notebook sessions, see Manage notebook sessions.

Copy the following code into a Python cell in the new notebook. Replace <yourBucketName> with your bucket name and <yourPicturePath> with the path to your images, such as pictures.

from PIL import Image
import io
from IPython.display import display
from pyspark.sql.functions import input_file_name, col, monotonically_increasing_id, regexp_extract
# 1. Recursively read image files from OSS.
df = (
    spark.read.format("binaryFile")
    .option("recursiveFileLookup", "true")
    .load("oss://<yourBucketName>/<yourPicturePath>/")
)
# 2. Extract the file name and image binary data.
df_with_id = (
    df.select(
        col("content").alias("picture"),
        regexp_extract(input_file_name(), r".*/(.+)$", 1).alias("fileName")
    )
    .select("fileName", "picture")
)
# 3. Create a temporary view.
df_with_id.createOrReplaceTempView("temp_images")
# 4. Preview the image metadata.
print("Preview of image metadata (first few rows):")
spark.sql("SELECT fileName, length(picture) AS size_bytes FROM temp_images LIMIT 5").show(truncate=False)
# 5. Create the blob table.
spark.sql("DROP TABLE IF EXISTS blob_tbl")
spark.sql("""
    CREATE TABLE blob_tbl (
        fileName STRING,
        picture BINARY
    )
    USING paimon
    TBLPROPERTIES (
        'row-tracking.enabled' = 'true',
        'data-evolution.enabled' = 'true',
        'blob-field' = 'picture'
    )
""")
# 6. Write data to the table.
print("Writing images to 'blob_tbl'...")
spark.sql("INSERT INTO blob_tbl SELECT * FROM temp_images")
# 7. Read and display the first two images.
print("\nFetching and displaying the first 2 images from the table...\n")
result_df = spark.sql("SELECT fileName, picture FROM blob_tbl LIMIT 2")
rows = result_df.collect()
if not rows:
    print("No images found in the table.")
else:
    for i, row in enumerate(rows, start=1):
        print(f"[{i}] Displaying: {row.fileName}")
        try:
            img = Image.open(io.BytesIO(row.picture))
            display(img)
        except Exception as e:
            print(f"  ⚠️  Failed to load image: {e}")

Click Execute All Cells and view the results below.

Preview of image metadata (first few rows):
+--------+----------+
|fileName|size_bytes|
+--------+----------+
|cat.png |354076    |
|dog.jpg |59207     |
+--------+----------+
Writing images to 'blob_tbl'...
Fetching and displaying the first 2 images from the table...
[1] Displaying: cat.png
[2] Displaying: dog.jpg

E-MapReduce:Paimon Blob