All Products
Search
Document Center

E-MapReduce:Overview

Last Updated:Feb 04, 2024

Iceberg is an open table format for data lakes. You can use Iceberg to quickly build your own data lake storage service on Hadoop Distributed File System (HDFS) or Alibaba Cloud Object Storage Service (OSS). Then, you can use a compute engine such as Apache Flink, Apache Spark, Apache Hive, or Apache Presto of the open source big data ecosystem to analyze data in your data lake.

Features

Apache Iceberg is designed to migrate Hive data to the cloud. After the release of multiple updates, Apache Iceberg became a standard table format for data lakes that are deployed on the cloud. For more information about Apache Iceberg, visit the Apache Iceberg official website.

Apache Iceberg provides the following features:

  • Builds a low-cost lightweight data lake storage service based on HDFS or an object storage system.

  • Connects to mainstream open source compute engines for data ingestion and analysis.

  • Provides comprehensive atomicity, consistency, isolation, durability (ACID) semantics.

  • Supports row-level data changes.

  • Supports historical version backtracking.

  • Supports efficient data filtering.

  • Supports schema changes.

  • Supports partition changes.

  • Supports hidden partitioning.

The following table compares open source ClickHouse (real-time data warehouse), open source Hive (offline data warehouse), and Alibaba Cloud E-MapReduce (EMR) Iceberg (data lake) from the dimensions of system architecture, business value, and maintenance costs.

Item

Subitem

Open source ClickHouse

Open source Hive

Alibaba Cloud EMR Iceberg

System architecture

Architecture

Integrated computing and storage

Decoupled computing and storage

Decoupled computing and storage

Multiple compute engines

Not supported

Supported

Supported

Data storage in an object storage system

Not supported

Not fully supported

Supported

Data storage in HDFS

Not supported

Supported

Supported

Storage format openness

No

Yes

Yes

Business value

Timeliness

Accurate to the second

Accurate to the hour or day

Accurate to the minute

Computing flexibility

Low

High

High

Transaction

Not supported

Not fully supported

Supported

Table-level semantic generality

Poor

Poor

Excellent

Row-level data change

Not supported

Limited support

Supported

Data quality

Excellent

Good

Good

Maintenance costs

Query performance

High

Very high

Very high

Storage costs

High

Medium

Low

Self-service

Not supported

Not supported

Supported

Resource scalability

Medium

Medium

Excellent

Comparison between Alibaba Cloud EMR Iceberg and Apache Iceberg

The following table compares Alibaba Cloud EMR Iceberg and Apache Iceberg from the dimensions of basic features, data changes, and compute engines.

Note

The check mark (✓) indicates that the related item is supported, and the cross mark (x) indicates that the related item is not supported.

Category

Item

Subitem

Apache Iceberg

EMR Iceberg

Basic features

ACID

None

Historical version backtracking

None

Source and sink integration

Batch

Streaming

Efficient data filtering

None

Data changes

Schema evolution

None

Partition evolution

None

Copy-on-write update

None

Merge-on-read update

Read

Write

Compaction

x

x

Compute engines

Apache Spark

Read

Write

Apache Hive

Read

Write

Apache Flink

Read

Write

PrestoDB or Trino

Read

Write

Programming languages

Java

None

Python

None

Advanced features

Native connection to Alibaba Cloud OSS

None

x

Native connection to Alibaba Cloud Data Lake Formation (DLF)

None

x

Data access acceleration based on data caching in local disks

None

x

Automatic merging of small files

None

x

Note

In this table, information is provided based on an objective analysis of the status of Apache Iceberg and Alibaba Cloud EMR Iceberg by the end of September 2021. This information may change based on the updates of Apache Iceberg and EMR Iceberg.

Scenarios

Iceberg is one of the core components of a general-purpose data lake service. The following table describes the scenarios in which you can use Iceberg.

Scenario

Description

Write and read data in real time

Upstream data is ingested to an Iceberg-based data lake in real time to perform a query. You can run a Flink or Spark streaming job to write log data to an Iceberg table in real time. Then, you can use a compute engine such as Hive, Spark, Flink, or Presto to read the data in real time. For more information, Apache Iceberg connector, Run a Spark streaming job to write data to an Iceberg table, Use Spark to read data, and Apache Iceberg connector. Iceberg supports ACID transactions, which isolate data write operations from data read operations to avoid dirty data.

Delete or update data

Most data warehouses do not support row-level data deletion or update. In most cases, you can run an offline job to read all data from a source table, change the data, and then write the changed data to the source table. If Iceberg is used, data changes can be performed on files instead of tables. The scope of the change operation narrows down. This way, you can update or delete your business data by performing a change operation based on a smaller scope.

In an Iceberg-based data lake, you can run a command that is similar to DELETE FROM test_table WHERE id > 10 to change data in a table.

Control data quality

You can use an Iceberg schema to check for and delete abnormal data from data that is being written or further process the abnormal data.

Change the schema of a table

You can use DDL statements supported by Spark SQL to change the schema of the Iceberg table.

When you change the schema of an Iceberg table, you do not need to export all historical data in the table based on the new schema. Therefore, the speed of changing a schema is fast. Iceberg supports ACID transactions, which prevents schema changes from affecting data read operations. This way, the data that you read and the data that you write are consistent.

Real-time machine learning

In machine learning scenarios, a long period of time may be required to process data, such as cleansing, converting, and characterizing data. You may also need to process historical data and real-time data. Iceberg simplifies these workflows. Iceberg provides a complete and reliable real-time stream to cleanse, convert, and characterize data. You do not need to process historical data and real-time data. Iceberg also supports the native SDK for Python, which is developed to meet the requirements of developers who use machine learning algorithms.