This article looks into what cache and relational cache is and how you can use it to accelerate EMR spark in data analysis operations.
Cache can broadly be defined as hardware or software of some kind that is used in various fields and directions of data processing. For many computing devices, I/O access speeds are highly dependent on the storage medium used for caching. As such, typically, the I/O access speeds of HDDs slower than SDDs, which are slower than NVMes, which comes after Mem, and L3-L2-L1 Cache, and Register, and the CPU. Generally speaking, the closer the storage device is to the CPU, the lower the speed gap between computing and I/O access, and the faster the data processing speed. However, the costs increase as speeds increase, with capacity typically decreasing as well. Cache pushes the data to be processed closer to the computation at the cost of more resource consumption, accelerating data processing operations, and filling the gap between computing and I/O access speeds.
For Spark, specifically, file systems, such as HDFS cache and Alluxio, provide file-level cache services. By caching files to the memory, the data processing speed is accelerated, and it is completely transparent to computing frameworks, such as Spark.
In addition, another method for caching is also available. If the same data needs to be processed multiple times, and the processing logic is similar, we can cache the intermediate results, so that each time we process the data, we start processing from the intermediate results, eliminating the computation from the original data to the intermediate results. Cached data is closer to the computing result. Compared with the original data, the results can be obtained through less computation, and the processing speed can also be accelerated. Materialized views in data warehouses are typical applications of this cache type.
Spark also provides dataset-level cache. You can use SQL DDLs or Dataset APIs to cache relational data (not files) with schema information located in the memory. Subsequent data processing based on the dataset can save time for computing the dataset by directly reading the cached data in the memory. Unlike the materialized view in data warehouses, the current Spark dataset cache still has many shortcomings:
Because of the shortcomings mentioned above, Spark dataset cache is not widely used in practical applications, and it often cannot meet the requirements of interactive analysis scenarios, such as Star model-based multi-dimensional data analysis. Generally, the cube is built in advance and the SQL execution plan is rewritten, so to meet the sub-second level interactive analysis requirements. Relational Cache, conversely, aims to take into account the usability of Spark dataset cache and the optimization efficiency of materialized views. Its main objectives include:
This article looks at EMR Spark Relational Cache, how it can be useful in a number of scenarios, and how use it to synchronize Data Across two clusters.
Much like how you can use relational cache to accelerate EMR spark in data analysis, you can also use EMR Spark relational cache to synchronize data across clusters. This article shows you how.
Before we get too deep into things, let's first discuss a few things. First of all, let's look into what Relational Cache actually is. Well, it works as an important feature in EMR Spark that accelerates data analysis mainly by pre-organizing and pre-computing data and provides functionality similar to the materialized view in traditional data warehousing. In addition to speeding up data processing, Relational Cache can also be applied in many other scenarios. This article will mainly describe how to use Relational Cache to synchronize data tables.
Managing all data through a unified data lake is a goal for many companies. However, the existence of different data centers, network regions, and departments in reality inevitably leads to different big data clusters, and data synchronization across different clusters is a common need. In addition, synchronization of older and newer data after cluster migration and site migration is also a common problem. Data synchronization is very complicated and a laborious task. Lots of customized development and manual intervention are required for tasks. This includes migration tool development, incremental data processing, synchronous read and write, and subsequent data comparison. With all of these problems and needs considered, Relational Cache can simplify data synchronization and allow you to implement cross-cluster data synchronization at a relatively low cost.
In the following section, a specific example is given to show how to use EMR Spark Relational Cache to implement cross-cluster data synchronization.
Assume that we have two clusters (for convenience purposes, let's call them Cluster A and B) and you need to synchronize the data in the activity_log table from Cluster A to Cluster B. Also, another thing is that, during the migration process, new data is continuously inserted into the activity_log table. Create an activity_log table in Cluster A:
CREATE TABLE activity_log ( user_id STRING, act_type STRING, module_id INT, d_year INT) USING JSON PARTITIONED BY (d_year)
In this article, let's discuss the most important topic in the big data field - storage. Big data is increasingly impacting our everyday life. Data analysis, data recommendation and data decisions based on big data can be found in almost all scenarios, from traveling, buying houses to ordering food delivery and calling rideshare services.
Higher requirements on data multidimensionality and data integrity must be met so that better and more accurate decisions can be made based on big data. In the near future, data volumes are expected to grow bigger and bigger. Especially with 5G on the way, data throughput will increase exponentially, and data dimensions and sources will also increase. Data will also have increasingly heterogeneous types. All these trends bring new challenges to big data platforms. The industry wants low-cost and high-capacity storage options with fast read/write speeds. This article discusses demands, their underlying challenges, and how Alibaba Cloud container services including Spark on Kubernetes can meet these challenges in different business scenarios.
The separation of computing and storage is an issue in the big data field that is frequently discussed from the following angles:
These three problems have become increasingly prominent as we move to the era of containers on the cloud. With Kubernetes, Pods run in the underlying resource pool, and storage needed for Pods are dynamically assigned or mounted by using PVs or PVCs. The architecture of containers at some level features the separation of computing and storage. But, one common question is what changes and advantages may be brought by big data container clusters that adopt the separation of storage and computing?
Generally a D-series machine is used when we create a Spark big data platform on Alibaba Cloud. Then a series of basic components like HDFS and Hadoop are built on it and Spark tasks are scheduled through Yarm and run on this cluster. A D-series machine has Intranet bandwidth ranging from 3 Gbit/s to 20 Gbit/s and can be bound to between four to twenty-eight 5.5 TB local disks by default. Because disk I/O on the cloud and network I/O are shared and the I/O of local disks remains independent, the combination of D-series and local disks shows better I/O performance than the combination of cloud disks and traditional machines with the same specification.
However, in actual production, data stored increases over time. Because data usually features a certain level of timeliness, the computing power in a time unit does not always match the increase in data storage, resulting in wasted costs. There one issue is what changes may happen if we follow the "separation of computing and storage" principle and use external storage products like OSS, NAS or DFS (Alibaba Cloud HDFS)?
To avoid the impact of storage I/O differences, use a remote DFS as the file storage system. We choose two popular machines and compare their performance: ecs.ebmhfg5.2xlarge (8-core, 32 GB, 6 Gbit/s) and ecs.d1ne. 2xlarge (8-core, 32 GB, 6 Gbit/s), which are designed for computing scenarios and big data scenarios respectively and have the same specifications.
This article introduces the SMACK (Spark, Mesos, Akka, Cassandra, and Kafka) stack and illustrates how you can use it to build scalable data processin.
This article introduces the SMACK (Spark, Mesos, Akka, Cassandra, and Kafka) stack and illustrates how you can use it to build scalable data processing platforms While the SMACK stack is really concise and consists of only several components, it is possible to implement different system designs within it which list not only purely batch or stream processing, but also contain more complex Lambda and Kappa architectures as well.
First, let’s talk a little bit about what SMACK is. Here’s a quick rundown of the technologies that are included in it:
Spark - a fast and general engine for distributed large-scale data processing.
Mesos - a cluster resource management system that provides efficient resource isolation and sharing across distributed applications.
Akka - a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM.
Cassandra - a distributed highly available database designed to handle large amounts of data across multiple datacenters.
Kafka - a high-throughput, low-latency distributed messaging system/commit log designed for handling real-time data feeds.
Although not in alphabetical order, first let’s start with the C in SMACK. Cassandra is well known for its high-availability and high-throughput characteristics and is able to handle enormous write loads and survive cluster nodes failures. In terms of CAP theorem, Cassandra provides tunable consistency/availability for operations.
What is most interesting here is that when it comes to data processing, Cassandra is linearly scalable (increased loads could be addressed by just adding more nodes to a cluster) and it provides cross-datacenter replication (XDCR) capabilities.
The basic concepts and architecture of Spark will be covered in this course, we will also explain how to setup and configure Linux operating system and Spark running environment on Alibaba Clouod ECS server through demo. Spark will be running on a single instance in local mode in this course.
In this course we will show you how to write basic data analysis programs on Spark in Python. Through these programs we will introduce some concepts like Spark RDD, Dataframe, Streaming, etc. You'll also learn how to use Python's generators, anonymous functions.
This course explains the related knowledge of sharing bicycles, large data application scenarios, commonly used data analysis methods, algorithms, and data visualization. We also introduce MaxCompute, DataWorks and Quick BI in detail. Finally, an experiment was conducted to lead the trainees to use the data analysis method to solve the analysis requirements of sharing bicycle scheduling scenarios and the core indicators that enterprises concerned. Students can refer to this experiment and combine their own business and needs to apply their learning to practice.
Understand the basic data and business related to the current take-out industry, and analyze relevant industry data using the Alibaba Cloud Big Data Platform.
The objective of this course is to introduce the core services of Alibaba Cloud Analysis Architecture (E-MapReduce, MaxCompute, Table Store) and to show you some classic use cases.
The scenario where data is distributed across different regions is common in log analysis. In this scenario, you need to perform hierarchical analysis on user data based on both the logs and the data from databases. The result is written to databases and can be queried through report systems. Association query of Logstores and databases is required.
This allows you to further process the results.
This topic provides a use case to describe how to use Realtime Compute to analyze data from IoT sensors in multiple dimensions.
With the economic tidal wave of globalization sweeping over the world, industrial manufacturers are facing increasingly fierce competition. To increase competitiveness, manufacturers in the automotive, aviation, high-tech, food and beverage, textile, and pharmaceutical industries must innovate and replace the existing infrastructure. These industries have to address many challenges during the innovation process. For example, the existing traditional devices and systems have been used for decades, which results in high maintenance costs. However, replacing these systems and devices may slow down the production process and compromise the product quality.
These industries face two additional challenges, which are high security risks and the urgent need for complex process automation. The manufacturing industry has prepared to replace the existing traditional devices and systems. In this industry, high reliability and availability systems are needed to ensure the safety and stability of real-time operations. A manufacturing process involves a wide range of components, such as robotic arms, assembly lines, and packaging machines. This requires remote applications that can seamlessly integrate each stage of the manufacturing process, including the deployment, update, and end-of-life management of devices. The remote applications also need to handle failover issues.
Another requirement for these next-generation systems and applications is that they be able to capture and analyze the large amounts of data generated by devices, and respond appropriately in a timely manner. To increase competitiveness and accelerate development, manufacturers need to optimize and upgrade their existing systems and devices. The application of Realtime Compute and Alibaba Cloud IoT solutions allows you to analyze device running information, detect faults, and predict yield rates in real time. This topic describes a use case as an example. In this use case, a manufacturer uses Realtime Compute to analyze the large amounts of data collected from sensors in real time. Realtime Compute is also used to cleanse and aggregate data in real time, write data into an online analytical processing (OLAP) system in real time, and monitor the key metrics of devices in real time.
Alibaba Cloud Elasticsearch is based on the open-source Elasticsearch engine and provides commercial features. Designed for scenarios such as search and analytics, Alibaba Cloud Elasticsearch features enterprise-level access control, security monitoring, and automatic updates.
Realtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink. With Realtime Compute, we are striving to deliver new solutions to help you upgrade your big data capabilities in your digital transformations.
Alibaba Clouder - April 9, 2019
Alibaba Cloud Security - July 31, 2018
Alibaba EMR - May 7, 2020
Alibaba EMR - August 28, 2019
Apache Flink Community China - December 25, 2019
Alibaba Clouder - July 26, 2019
More Posts by Alibaba Clouder