Best Practices for Big Data Processing in Spark

Lecturer: Jianfeng, Head of Alibaba Cloud EMR Platform

Content Framework:

Big Data Overview
How to Get Rid of Technical Novice
Spark SQL Learning Framework
Best Practices for Big Data on EMR Studio

I. Big Data Overview

Big Data Processing ETL (Data → Data)
Big Data Analysis BI (Data → Dashboard)
Machine Learning Platform for AI (Data → Model)

II. How to Get Rid of Technical Novice

What is a Technical Novice?

A technical novice only understands the surface information, not the main idea

For example, you only refer to other people's Spark code, but you do not understand the internal mechanism of Spark, and you do not know how to tune Spark jobs.

Getting Rid of Technical Novice

Understand the operation mechanism
Learn to configure the Spark App
Learn to read the log

Understanding the Operating Mechanism: Spark SQL Architecture

Learn How to Configuring the Spark App

1. Configure Driver

spark.driver.memory
spark.driver.cores

2. Configure Executor

spark.executor.memory
spark.executor.cores

3. Configure Runtime

spark.files
spark.jars

4. Configure DAE

Learn more: https://spark.apache.org/docs/latest/configuration.html

Learn to Read the Log: Spark Log

III. Spark SQL Learning Framework

Spark SQL Learning Framework (Combined with graphics /geometry)

1. Select Rows

2. Select Columns

3. Transform Columns

4. Group By / Aggregation

5. Join

Spark SQL Execution Plan

1. Spark SQL – Where

2. Spark SQL – Group By

3. Spark SQL – Order By

IV. EMR Studio Practices

EMR Studio Features

Compatible with open-source components
Supports connecting to multiple clusters
Adapt to multiple computing engines
Interactive development + seamless connection of job scheduling
Applicable to a variety of big data application scenarios
Separated computing and storage

1. Compatible with Open-Source Components

EMR Studio is optimized and enhanced based on the open-source software Apache Zeppelin, Jupyter Notebook, and Apache Airflow.

2. Supports Connecting to Multiple Clusters

One EMR Studio can connect to multiple EMR computing clusters. You can switch computing clusters and submit jobs to different computing clusters easily.

3. Adapt to Multiple Computing Engines

Adapts to multiple computing engines automatically, such as Hive, Spark, Flink, Presto, Impala, and Shell, without complex configuration
Multiple computing engines work together

4. Interactive Development + Seamless Connection of Job Scheduling

Notebook + Airflow: Connects development and production scheduling seamlessly

The interactive development mode can verify the correctness of the job quickly.
Schedule notebook jobs in Airflow to ensure the consistency between the development environment and the production environment to the greatest extent and prevent problems caused by inconsistent environments between the development and production phases.

5. Applicable to a Variety of Big Data Application Scenarios

Big Data Processing ETL
Interactive Data Analysis
Machine Learning Platform for AI
Real-Time Computing

6. Separated Computing and Storage

1. All data is stored on OSS, including:

User Notebook Code
Scheduling Job Logs

2. Even if the cluster is destroyed, the cluster can be rebuilt, and data can be restored easily.

EMR Studio Demo Demonstration

E-MapReduce: https://www.alibabacloud.com/product/emapreduce

EMR Studio (Beta): https://help.aliyun.com/document_detail/208107.html (Article in Chinese)

For specific product introduction and demonstration, you can click the following link to watch the playback: https://developer.aliyun.com/live/247072 (Video in Chinese)

Community