Alibaba Cloud E-MapReduce (EMR) is suitable for various scenarios. This topic describes the main use scenarios of EMR.
Data lake scenario
DataLake clusters provide the services and the related lake formats that are required for data lake analytics scenarios. The services include Hadoop, OSS-HDFS, Hive, Spark, and Presto. If you use OSS-HDFS, the YARN service of your cluster no longer depends on local Hadoop Distributed File System (HDFS) and no longer requires the deployment of a core node group. This helps enhance the cluster elasticity and flexibility. You can also use Data Lake Formation (DLF) as a data directory service. DLF provides a unified metadata management service and allows you to manage data in data lakes. DLF can be used to simplify and accelerate the governance process of data in data lakes.
In data lake scenarios, you can use a collection program to write event tracking logs to OSS-HDFS in a near real-time manner and use Sqoop to synchronize data from a business database to OSS-HDFS on a regular basis. In an EMR cluster, you can use Hive and Spark to clean and process raw data to extract metrics that are required by your business, such as the number of daily active users, the user retention rate, and the number of new orders for a specific stock keeping unit (SKU). During the day, you can use the auto scaling mechanism to retain only specific nodes. In addition, you can start an environment in which the Trino or Presto service is deployed to meet the data query requirements of data analysts or operations teams during the day.
Data analytics scenario
Online analytical processing (OLAP) clusters provide services such as StarRocks, Doris, and ClickHouse. All these services provide features such as efficient data compression, column-oriented storage, and parallel query. The features ensure high performance of OLAP clusters in big data analytics scenarios. OLAP clusters are suitable for a series of business scenarios, such as user profiling, recipient selection, BI reports, and business analytics.
Real-time data analytics solution
Procedure:
Real-time ingestion: Kafka data is directly read. A Flink connector is used to write Flink data streams and ensure exactly-once semantics. In addition, the Flink change data capture (CDC) connector is used to capture updates to transaction processing data and store the update results in StarRocks in real time.
Data analytics: The data that is generated in the real-time data analysis process can be used for serving. This implements the integration of real-time and offline data.
Real-time data modeling: Real-time data modeling aggregation tables are provided to support real-time aggregation. Powerful engines and optimizers ensure high efficiency in real-time data modeling of databases.
Real-time update: The real-time update policy delete-and-insert is used. When you read data, you do not need to merge the data that uses the same primary key. The performance of the delete-and-insert policy is 3 to 15 times higher than the performance of the merge-on-read (unique) policy.
Data lakehouse analytics solution
Query layer: The cost-based optimization (CBO) feature and query engine capabilities of StarRocks are used. This way, the query and computing performance is 3 to 5 times higher than that of Trino.
Metadata management:
Supports multi-catalog management, seamless integration with the Hive metastore service (HMS), and custom catalogs to facilitate connection with data lake formation services of cloud vendors.
Supports standard formats such as Parquet, ORC, and CSV, and implements late materialization and read and write of merged small files.
Supports various data lake formats, such as Hudi, Iceberg, Delta Lake, and Paimon.
Procedure:
Real-time ingestion: The details of underlying data sources are masked. Joint analysis of data from heterogeneous data sources is supported. Joint analysis of real-time and offline data is also supported.
Query acceleration: Policies of computing close to end users, such as expression pushdown and aggregation pushdown, and policies of optimizing the distributed read mode and data sources are used. Technologies, such as vectorized interpretation of data in the ORC or Parquet format, dictionary-based filtering, and late materialization, are supported.
Test results: TPC-H and Hive query tests are performed. The performance is improved by more than 3 to 5 times, in comparison with the performance of Presto, under the same conditions. The same performance experience can be achieved by using only 1/3 of Presto resources.
Real-time data streaming scenario
Dataflow clusters provide services such as Flink, Kafka, and Paimon. The clusters are used to resolve various end-to-end (E2E) real-time computing issues and are widely used in real-time data extract, transform, and load (ETL), and log collection and analysis scenarios.
You can use a collection program to deliver business data, logs, and event tracking data to EMR Kafka. Then, you can use the real-time computing feature of Flink to write data to different analytics systems, such as EMR StarRocks, EMR HBase, and Alibaba Cloud Hologres, to perform operations such as real-time analysis, point query calls, and BI report analysis.
Data serving scenario
DataServing clusters provide services such as HBase, Phoenix, and OSS-HDFS. You can use the HBase and OSS-HDFS services to store HBase data in data lakes and write write-ahead logging (WAL) logs from HBase to local HDFS or OSS-HDFS of your cluster. The compute-storage separation architecture is used to mitigate the storage pressure in your cluster. If your data is stored in data lakes, you can restore the configurations of HBase clusters with ease.
In data warehousing scenarios, you can obtain the tag information about specific users, except the basic information, based on ETL-based computing. The tag information includes hobby tags, interested topics, and search keywords. You can write a program to write the user information that is added or modified every day to EMR HBase clusters. You can use the user profile data provided by an EMR HBase cluster to create a selection service and deliver advertisements to a specific range of users based on the business promotion situations. Data is stored in OSS-HDFS. In this case, you can create another EMR HBase cluster whose HFile storage path is the same as the existing EMR HBase cluster. This way, you can use the new EMR HBase cluster as a read-only cluster to reduce the read and write load on the existing EMR HBase cluster.