This article is an overview of the best practices for big data processing in Spark taken from a lecture.
Lecturer: Jianfeng, Head of Alibaba Cloud EMR Platform
- Big Data Overview
- How to Get Rid of Technical Novice
- Spark SQL Learning Framework
- Best Practices for Big Data on EMR Studio
I. Big Data Overview
- Big Data Processing ETL (Data → Data)
- Big Data Analysis BI (Data → Dashboard)
- Machine Learning Platform for AI (Data → Model)
II. How to Get Rid of Technical Novice
What is a Technical Novice?
- A technical novice only understands the surface information, not the main idea
For example, you only refer to other people's Spark code, but you do not understand the internal mechanism of Spark, and you do not know how to tune Spark jobs.
Getting Rid of Technical Novice
- Understand the operation mechanism
- Learn to configure the Spark App
- Learn to read the log
Understanding the Operating Mechanism: Spark SQL Architecture
Learn How to Configuring the Spark App
1. Configure Driver
2. Configure Executor
3. Configure Runtime
4. Configure DAE
Learn more: https://spark.apache.org/docs/latest/configuration.html
Learn to Read the Log: Spark Log
III. Spark SQL Learning Framework
Spark SQL Learning Framework (Combined with graphics /geometry)
1. Select Rows
2. Select Columns
3. Transform Columns
4. Group By / Aggregation
Spark SQL Execution Plan
1. Spark SQL – Where
2. Spark SQL – Group By
3. Spark SQL – Order By
IV. EMR Studio Practices
EMR Studio Features
- Compatible with open-source components
- Supports connecting to multiple clusters
- Adapt to multiple computing engines
- Interactive development + seamless connection of job scheduling
- Applicable to a variety of big data application scenarios
- Separated computing and storage
1. Compatible with Open-Source Components
- EMR Studio is optimized and enhanced based on the open-source software Apache Zeppelin, Jupyter Notebook, and Apache Airflow.
2. Supports Connecting to Multiple Clusters
- One EMR Studio can connect to multiple EMR computing clusters. You can switch computing clusters and submit jobs to different computing clusters easily.
3. Adapt to Multiple Computing Engines
- Adapts to multiple computing engines automatically, such as Hive, Spark, Flink, Presto, Impala, and Shell, without complex configuration
- Multiple computing engines work together
4. Interactive Development + Seamless Connection of Job Scheduling
Notebook + Airflow: Connects development and production scheduling seamlessly
- The interactive development mode can verify the correctness of the job quickly.
- Schedule notebook jobs in Airflow to ensure the consistency between the development environment and the production environment to the greatest extent and prevent problems caused by inconsistent environments between the development and production phases.
5. Applicable to a Variety of Big Data Application Scenarios
6. Separated Computing and Storage
1. All data is stored on OSS, including:
- User Notebook Code
- Scheduling Job Logs
2. Even if the cluster is destroyed, the cluster can be rebuilt, and data can be restored easily.
EMR Studio Demo Demonstration
EMR Studio (Beta): https://help.aliyun.com/document_detail/208107.html (Article in Chinese)
For specific product introduction and demonstration, you can click the following link to watch the playback: https://developer.aliyun.com/live/247072 (Video in Chinese)