Community Blog Using Apache Spark for Data Processing and Analysis

Using Apache Spark for Data Processing and Analysis

In this article, you will learn to accelerate your data processing and analysis across Apache Spark Relational Cache, Mesos, Akka, Cassandra, and Kafka.

Use Relational Cache to Accelerate EMR Spark in Data Analysis

This article looks into what cache and relational cache is and how you can use it to accelerate EMR spark in data analysis operations.

Cache can broadly be defined as hardware or software of some kind that is used in various fields and directions of data processing. For many computing devices, I/O access speeds are highly dependent on the storage medium used for caching. As such, typically, the I/O access speeds of HDDs slower than SDDs, which are slower than NVMes, which comes after Mem, and L3-L2-L1 Cache, and Register, and the CPU. Generally speaking, the closer the storage device is to the CPU, the lower the speed gap between computing and I/O access, and the faster the data processing speed. However, the costs increase as speeds increase, with capacity typically decreasing as well. Cache pushes the data to be processed closer to the computation at the cost of more resource consumption, accelerating data processing operations, and filling the gap between computing and I/O access speeds.

For Spark, specifically, file systems, such as HDFS cache and Alluxio, provide file-level cache services. By caching files to the memory, the data processing speed is accelerated, and it is completely transparent to computing frameworks, such as Spark.

In addition, another method for caching is also available. If the same data needs to be processed multiple times, and the processing logic is similar, we can cache the intermediate results, so that each time we process the data, we start processing from the intermediate results, eliminating the computation from the original data to the intermediate results. Cached data is closer to the computing result. Compared with the original data, the results can be obtained through less computation, and the processing speed can also be accelerated. Materialized views in data warehouses are typical applications of this cache type.

Spark also provides dataset-level cache. You can use SQL DDLs or Dataset APIs to cache relational data (not files) with schema information located in the memory. Subsequent data processing based on the dataset can save time for computing the dataset by directly reading the cached data in the memory. Unlike the materialized view in data warehouses, the current Spark dataset cache still has many shortcomings:

  1. Spark cached dataset can only be reused in the same Spark Context, and cannot be shared across Spark contexts. When Spark Context exits, the cached data is also deleted.
  2. The dataset cache only supports precise matching and reuse of execution plans. In other words, only when the execution plans of subsequent queries accurately matches the execution plans of cached datasets, can cache be used to optimize queries, which greatly reduces the optimization scope of cache.
  3. The cached dataset data can only be stored in memory or local disk. Larger data volumes require more memory. The persistent data is serialized binary data with no data schema information, which is costly to deserialize and SQL optimization processing, such as project filter push-down, cannot be supported.

How Relational Cache Works

Because of the shortcomings mentioned above, Spark dataset cache is not widely used in practical applications, and it often cannot meet the requirements of interactive analysis scenarios, such as Star model-based multi-dimensional data analysis. Generally, the cube is built in advance and the SQL execution plan is rewritten, so to meet the sub-second level interactive analysis requirements. Relational Cache, conversely, aims to take into account the usability of Spark dataset cache and the optimization efficiency of materialized views. Its main objectives include:

  1. Users can cache any relational data, including tables, views or datasets. The cache support for any relational data can greatly extend the scope of use of the Relational Cache. Any use case that includes repeated computing or pre-determined computing logic, such as multi-dimensional data analysis, reports, dashboards, and ETL, may benefit from Relational Cache.
  2. The cached data can be stored in memory, local disk, or any datasource supported by Spark. Temporarily cached data stored in memory can be accessed quickly, but cross-Spark Context sharing is not supported. For cache with a large amount of data, for example, the materialized view or cube built by many enterprises, may reach the PB level. In this case, Relational Cache is obviously more suitable to be stored in persistent distributed file systems, such as HDFS and OSS.
  3. Cached data can be used to optimize any subsequent user queries that can be optimized.

Related Blogs

Use EMR Spark Relational Cache to Synchronize Data Across Clusters

This article looks at EMR Spark Relational Cache, how it can be useful in a number of scenarios, and how use it to synchronize Data Across two clusters.

Much like how you can use relational cache to accelerate EMR spark in data analysis, you can also use EMR Spark relational cache to synchronize data across clusters. This article shows you how.

Before we get too deep into things, let's first discuss a few things. First of all, let's look into what Relational Cache actually is. Well, it works as an important feature in EMR Spark that accelerates data analysis mainly by pre-organizing and pre-computing data and provides functionality similar to the materialized view in traditional data warehousing. In addition to speeding up data processing, Relational Cache can also be applied in many other scenarios. This article will mainly describe how to use Relational Cache to synchronize data tables.

Managing all data through a unified data lake is a goal for many companies. However, the existence of different data centers, network regions, and departments in reality inevitably leads to different big data clusters, and data synchronization across different clusters is a common need. In addition, synchronization of older and newer data after cluster migration and site migration is also a common problem. Data synchronization is very complicated and a laborious task. Lots of customized development and manual intervention are required for tasks. This includes migration tool development, incremental data processing, synchronous read and write, and subsequent data comparison. With all of these problems and needs considered, Relational Cache can simplify data synchronization and allow you to implement cross-cluster data synchronization at a relatively low cost.

In the following section, a specific example is given to show how to use EMR Spark Relational Cache to implement cross-cluster data synchronization.

Use Relational Cache to Synchronize Data

Assume that we have two clusters (for convenience purposes, let's call them Cluster A and B) and you need to synchronize the data in the activity_log table from Cluster A to Cluster B. Also, another thing is that, during the migration process, new data is continuously inserted into the activity_log table. Create an activity_log table in Cluster A:

CREATE TABLE activity_log (
  user_id STRING,
  act_type STRING,
  module_id INT,
  d_year INT)

Big Data Storage and Spark on Kubernetes

In this article, let's discuss the most important topic in the big data field - storage. Big data is increasingly impacting our everyday life. Data analysis, data recommendation and data decisions based on big data can be found in almost all scenarios, from traveling, buying houses to ordering food delivery and calling rideshare services.

Higher requirements on data multidimensionality and data integrity must be met so that better and more accurate decisions can be made based on big data. In the near future, data volumes are expected to grow bigger and bigger. Especially with 5G on the way, data throughput will increase exponentially, and data dimensions and sources will also increase. Data will also have increasingly heterogeneous types. All these trends bring new challenges to big data platforms. The industry wants low-cost and high-capacity storage options with fast read/write speeds. This article discusses demands, their underlying challenges, and how Alibaba Cloud container services including Spark on Kubernetes can meet these challenges in different business scenarios.

Computing and Storage of Containerized Big Data

The separation of computing and storage is an issue in the big data field that is frequently discussed from the following angles:

  1. Hardware Limitations: Bandwidth on machines is increasing exponentially, while disk throughput often remains unchanged, making local data read/write less advantageous.
  2. Computing Costs: The gap between the magnitude of computing and storage results in a significant waste in the computing power.
  3. Storage Costs: Centralized storage can reduce the storage cost and ensure higher SLAs at the same time, and building data warehouses less competitive.

These three problems have become increasingly prominent as we move to the era of containers on the cloud. With Kubernetes, Pods run in the underlying resource pool, and storage needed for Pods are dynamically assigned or mounted by using PVs or PVCs. The architecture of containers at some level features the separation of computing and storage. But, one common question is what changes and advantages may be brought by big data container clusters that adopt the separation of storage and computing?

Cost-Efficiency and Bringing Down Costs

Generally a D-series machine is used when we create a Spark big data platform on Alibaba Cloud. Then a series of basic components like HDFS and Hadoop are built on it and Spark tasks are scheduled through Yarm and run on this cluster. A D-series machine has Intranet bandwidth ranging from 3 Gbit/s to 20 Gbit/s and can be bound to between four to twenty-eight 5.5 TB local disks by default. Because disk I/O on the cloud and network I/O are shared and the I/O of local disks remains independent, the combination of D-series and local disks shows better I/O performance than the combination of cloud disks and traditional machines with the same specification.

However, in actual production, data stored increases over time. Because data usually features a certain level of timeliness, the computing power in a time unit does not always match the increase in data storage, resulting in wasted costs. There one issue is what changes may happen if we follow the "separation of computing and storage" principle and use external storage products like OSS, NAS or DFS (Alibaba Cloud HDFS)?

To avoid the impact of storage I/O differences, use a remote DFS as the file storage system. We choose two popular machines and compare their performance: ecs.ebmhfg5.2xlarge (8-core, 32 GB, 6 Gbit/s) and ecs.d1ne. 2xlarge (8-core, 32 GB, 6 Gbit/s), which are designed for computing scenarios and big data scenarios respectively and have the same specifications.

Data Processing with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka

This article introduces the SMACK (Spark, Mesos, Akka, Cassandra, and Kafka) stack and illustrates how you can use it to build scalable data processin.

Data Processing with SMACK: Spark, Mesos, Akka, Cassandra, and Kafka

This article introduces the SMACK (Spark, Mesos, Akka, Cassandra, and Kafka) stack and illustrates how you can use it to build scalable data processing platforms While the SMACK stack is really concise and consists of only several components, it is possible to implement different system designs within it which list not only purely batch or stream processing, but also contain more complex Lambda and Kappa architectures as well.

What is the SMACK stack?

First, let’s talk a little bit about what SMACK is. Here’s a quick rundown of the technologies that are included in it:

SMACK stack

Spark - a fast and general engine for distributed large-scale data processing.

Mesos - a cluster resource management system that provides efficient resource isolation and sharing across distributed applications.

Akka - a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM.

Cassandra - a distributed highly available database designed to handle large amounts of data across multiple datacenters.

Kafka - a high-throughput, low-latency distributed messaging system/commit log designed for handling real-time data feeds.

Storage layer: Cassandra

Although not in alphabetical order, first let’s start with the C in SMACK. Cassandra is well known for its high-availability and high-throughput characteristics and is able to handle enormous write loads and survive cluster nodes failures. In terms of CAP theorem, Cassandra provides tunable consistency/availability for operations.

What is most interesting here is that when it comes to data processing, Cassandra is linearly scalable (increased loads could be addressed by just adding more nodes to a cluster) and it provides cross-datacenter replication (XDCR) capabilities.

Related Courses

How to Use Spark on Cloud Series - Setup

The basic concepts and architecture of Spark will be covered in this course, we will also explain how to setup and configure Linux operating system and Spark running environment on Alibaba Clouod ECS server through demo. Spark will be running on a single instance in local mode in this course.

How to Use Spark on Cloud Series 2 - Spark Python

In this course we will show you how to write basic data analysis programs on Spark in Python. Through these programs we will introduce some concepts like Spark RDD, Dataframe, Streaming, etc. You'll also learn how to use Python's generators, anonymous functions.

Riding Analysis of Shared Bicycles

This course explains the related knowledge of sharing bicycles, large data application scenarios, commonly used data analysis methods, algorithms, and data visualization. We also introduce MaxCompute, DataWorks and Quick BI in detail. Finally, an experiment was conducted to lead the trainees to use the data analysis method to solve the analysis requirements of sharing bicycle scheduling scenarios and the core indicators that enterprises concerned. Students can refer to this experiment and combine their own business and needs to apply their learning to practice.

Related Market Products

Delivery Business Data Analysis

Understand the basic data and business related to the current take-out industry, and analyze relevant industry data using the Alibaba Cloud Big Data Platform.

Alibaba Cloud Analysis Architecture and Case Studies

The objective of this course is to introduce the core services of Alibaba Cloud Analysis Architecture (E-MapReduce, MaxCompute, Table Store) and to show you some classic use cases.

Related Documentation

Perform association query and analysis on logs and the data from databases

The scenario where data is distributed across different regions is common in log analysis. In this scenario, you need to perform hierarchical analysis on user data based on both the logs and the data from databases. The result is written to databases and can be queried through report systems. Association query of Logstores and databases is required.

Background information

  1. User log data: Taking game logs as an example, a classic game log contains properties such as the operation, target, blood, mana, network, payment method, click location, status code, user ID.
  2. User metadata: Logs record incremental events. However, static user information, such as the gender, registration time, and region, is fixed and hard to be obtained on the client. The static user information cannot be recorded in logs. The static user information is known as user metadata.
  3. Association analysis of Logstores and ApsaraDB RDS for MySQL instances: The query and analysis engine of Log Service can perform association query and analysis of Logstores and ExternalStores. You can use the SQL JOIN syntax to associate logs and metadata to analyze metrics related with user properties. In addition to referencing ExternalStores for association query and analysis, Log Service also supports writing results to ExternalStores such as ApsaraDB RDS for MySQL instances.

This allows you to further process the results.

  1. Logstore: allows you to collect, store, query, and analyze logs.
  2. ExternalStore: maps data to ApsaraDB for RDS tables. Developers can store the user information in the tables.

Multidimensional analysis of data from IoT sensors

This topic provides a use case to describe how to use Realtime Compute to analyze data from IoT sensors in multiple dimensions.


With the economic tidal wave of globalization sweeping over the world, industrial manufacturers are facing increasingly fierce competition. To increase competitiveness, manufacturers in the automotive, aviation, high-tech, food and beverage, textile, and pharmaceutical industries must innovate and replace the existing infrastructure. These industries have to address many challenges during the innovation process. For example, the existing traditional devices and systems have been used for decades, which results in high maintenance costs. However, replacing these systems and devices may slow down the production process and compromise the product quality.

These industries face two additional challenges, which are high security risks and the urgent need for complex process automation. The manufacturing industry has prepared to replace the existing traditional devices and systems. In this industry, high reliability and availability systems are needed to ensure the safety and stability of real-time operations. A manufacturing process involves a wide range of components, such as robotic arms, assembly lines, and packaging machines. This requires remote applications that can seamlessly integrate each stage of the manufacturing process, including the deployment, update, and end-of-life management of devices. The remote applications also need to handle failover issues.

Another requirement for these next-generation systems and applications is that they be able to capture and analyze the large amounts of data generated by devices, and respond appropriately in a timely manner. To increase competitiveness and accelerate development, manufacturers need to optimize and upgrade their existing systems and devices. The application of Realtime Compute and Alibaba Cloud IoT solutions allows you to analyze device running information, detect faults, and predict yield rates in real time. This topic describes a use case as an example. In this use case, a manufacturer uses Realtime Compute to analyze the large amounts of data collected from sensors in real time. Realtime Compute is also used to cleanse and aggregate data in real time, write data into an online analytical processing (OLAP) system in real time, and monitor the key metrics of devices in real time.

Related Products

Alibaba Cloud Elasticsearch

Alibaba Cloud Elasticsearch is based on the open-source Elasticsearch engine and provides commercial features. Designed for scenarios such as search and analytics, Alibaba Cloud Elasticsearch features enterprise-level access control, security monitoring, and automatic updates.

Realtime Compute

Realtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink. With Realtime Compute, we are striving to deliver new solutions to help you upgrade your big data capabilities in your digital transformations.

0 0 0
Share on

Alibaba Clouder

2,600 posts | 754 followers

You may also like