DataServing cluster - E-MapReduce - Alibaba Cloud Documentation Center

Alibaba Cloud E-MapReduce (EMR) provides DataServing clusters based on Apache HBase. This topic describes the features, use scenarios, and technical architecture of DataServing clusters.

Features

Apache HBase is an open source, distributed NoSQL database that provides high reliability, high performance, column-oriented storage, and high scalability and supports real-time data read and write. Apache HBase is suitable for scenarios that require random and real-time access to large amounts of data.

Apache HBase seamlessly integrates with Apache Hadoop and works with Apache Phoenix to enable SQL-like queries on HBase tables.

Apache HBase uses Hadoop Distributed File System (HDFS) as the underlying storage. Apache HBase allows you to use Alibaba Cloud Object Storage Service (OSS) in cloud-based scenarios to store data. This improves flexibility and reduces storage costs.

Apache HBase can store large volumes of data and provides scalable storage and computing capabilities and high read and write performance. Apache HBase also allows you to store dynamic columns and data of multiple versions, and manage the lifecycle of data.

Scenarios

DataServing clusters are suitable for the following scenarios:

Analysis of data that is stored in dynamic columns, such as data related to risk control and user profiling
Analysis of objects such as images, videos, and web pages, and analysis of large volumes of storage data that requires high write performance, such as time series data from Internet of Things (IoT) and spatial data from Internet of Vehicles (IoV)
Analysis of feed streams based on the characteristics of low read and write latency and high concurrency

Architecture

You can deploy EMR HBase by using one of the following architectures: integrated computing and storage architecture based on HDFS and storage-computing separation architecture based on OSS-HDFS. For more information, see Overview and Overview of the OSS-HDFS service.

In the preceding figure, EMR HBase is deployed by using the storage-computing separation architecture based on OSS-HDFS, and the EMR cluster is deployed with services such as HBase and HDFS.

EMR HBase stores HFiles and table metadata in OSS-HDFS and accesses the data that is stored in OSS-HDFS by using JindoData.
The HBase and JindoData processes are deployed on core nodes. The HDFS process is also deployed on core nodes to store WAL data of HBase. Core nodes do not support the auto scaling feature.
The HBase and JindoData processes are deployed on task nodes. The HDFS process is not deployed on task nodes. Task nodes support the auto scaling feature.
HBase RegionServer of the core and task nodes caches data blocks in the memory or local disks of the nodes. If data is cached in local disks, you must enable the BucketCache feature.
If EMR HBase is deployed by using the storage-computing separation architecture, you can use the BlockCache feature of HBase to store data. You can also use JindoFSx to accelerate the access of data from local disks.

The storage-computing separation architecture provides the following advantages:

Low storage costs: OSS is used as storage, which reduces storage costs.
Low O&M costs: OSS is fully managed on the cloud, which reduces O&M costs.
Support of auto scaling: Computing resources can be scaled based on your business requirements.
Easy to update: The metadata and data of HBase tables are stored in OSS. Therefore, you can easily update HBase without the need to migrate data.