Best Practices for Spark Big Data Processing
Open source big data community - AliCloud EMR series live broadcast phase 11
Subject: Spark Big Data Processing Best Practices
Lecturer: Jian Feng, head of Alibaba Cloud EMR data development platform
Content framework:
• Big data overview
• How to get rid of technical Xiaobai
• Spark SQL learning framework
• Big data best practices on EMR Studio
Live playback: scan the QR code at the bottom of the article to join the pin group to watch the playback, or enter the link https://developer.aliyun.com/live/247072
1、 Big data overview
• Big data processing ETL (Data → Data)
• Big data analysis BI (Data → Dashboard)
• Machine learning AI (Data → Model)
2、 How to get rid of technical Xiaobai
What is technical Xiaobai?
• Only understand the surface, not the essence
For example, they only know how to refer to other people's Spark code, do not know the internal mechanism of Spark, and do not know how to tune Spark Job
Prescription for getting rid of technical Xiaobai
• Understand the operation mechanism
• Learn to configure
• Learn to watch Log
Understand the operation mechanism: Spark SQL Architecture
Learn to configure: How to configure Spark App
• Configure Driver
• spark.driver.memory
• spark.driver.cores
• Configure Executor
• spark.executor.memory
• spark.executor.cores
• Configure Runtime
• spark.files
• spark.jars
• Configure DAEs
• …..........
Reference website: https://spark.apache.org/docs/latest/configuration.html
Learn to watch Log: Spark Log
3、 Spark SQL Learning Framework
Spark SQL learning framework (combined with graphics/geometry)
1. Select Rows
2. Select Columns
3. Transform Column
4. Group By / Aggregation
5. Join
Spark SQL Execution Plan
1. Spark SQL - Where
2. Spark SQL - Group By
3. Spark SQL - Order by
4、 EMR Studio Practice
EMR Studio features:
• Compatible with open source components
• Support connecting multiple clusters
• Adapt to multiple computing engines
• Interactive development+seamless connection of job scheduling
• Applicable to a variety of big data application scenarios
• Compute storage separation
1. Compatible with open source components
• EMR Studio has been optimized and enhanced on the basis of open source software Apache Zeppelin, Jupyter Notebook and Apache Airflow.
2. Support connecting multiple clusters
• One EMR Studio can connect to multiple EMR computing clusters. You can easily switch computing clusters and submit jobs to different computing clusters for running.
3. Adapt to multiple computing engines
• Automatically adapt multiple computing engines such as Hive, Spark, Flink, Presto, Impala and Shell, without complex configuration, and multiple computing engines work together
4. Interactive development+seamless connection of job scheduling
Notebook+Airflow: seamless connection between development and production scheduling
• The interactive development mode can quickly verify the correctness of the operation
• Scheduling Notebook jobs in Airflow ensures the consistency of the development environment and production environment to the greatest extent, and prevents problems caused by the inconsistency of the development and production environment.
5. Applicable to a variety of big data application scenarios
• Big data processing ETL
• Interactive data analysis
• Machine learning
• Real-time computing
6. Calculation and storage separation
• All data are saved on OSS, including:
• User Notebook code
• Scheduling job log
• Even if the cluster is destroyed, the cluster can be rebuilt and data can be recovered easily
Subject: Spark Big Data Processing Best Practices
Lecturer: Jian Feng, head of Alibaba Cloud EMR data development platform
Content framework:
• Big data overview
• How to get rid of technical Xiaobai
• Spark SQL learning framework
• Big data best practices on EMR Studio
Live playback: scan the QR code at the bottom of the article to join the pin group to watch the playback, or enter the link https://developer.aliyun.com/live/247072
1、 Big data overview
• Big data processing ETL (Data → Data)
• Big data analysis BI (Data → Dashboard)
• Machine learning AI (Data → Model)
2、 How to get rid of technical Xiaobai
What is technical Xiaobai?
• Only understand the surface, not the essence
For example, they only know how to refer to other people's Spark code, do not know the internal mechanism of Spark, and do not know how to tune Spark Job
Prescription for getting rid of technical Xiaobai
• Understand the operation mechanism
• Learn to configure
• Learn to watch Log
Understand the operation mechanism: Spark SQL Architecture
Learn to configure: How to configure Spark App
• Configure Driver
• spark.driver.memory
• spark.driver.cores
• Configure Executor
• spark.executor.memory
• spark.executor.cores
• Configure Runtime
• spark.files
• spark.jars
• Configure DAEs
• …..........
Reference website: https://spark.apache.org/docs/latest/configuration.html
Learn to watch Log: Spark Log
3、 Spark SQL Learning Framework
Spark SQL learning framework (combined with graphics/geometry)
1. Select Rows
2. Select Columns
3. Transform Column
4. Group By / Aggregation
5. Join
Spark SQL Execution Plan
1. Spark SQL - Where
2. Spark SQL - Group By
3. Spark SQL - Order by
4、 EMR Studio Practice
EMR Studio features:
• Compatible with open source components
• Support connecting multiple clusters
• Adapt to multiple computing engines
• Interactive development+seamless connection of job scheduling
• Applicable to a variety of big data application scenarios
• Compute storage separation
1. Compatible with open source components
• EMR Studio has been optimized and enhanced on the basis of open source software Apache Zeppelin, Jupyter Notebook and Apache Airflow.
2. Support connecting multiple clusters
• One EMR Studio can connect to multiple EMR computing clusters. You can easily switch computing clusters and submit jobs to different computing clusters for running.
3. Adapt to multiple computing engines
• Automatically adapt multiple computing engines such as Hive, Spark, Flink, Presto, Impala and Shell, without complex configuration, and multiple computing engines work together
4. Interactive development+seamless connection of job scheduling
Notebook+Airflow: seamless connection between development and production scheduling
• The interactive development mode can quickly verify the correctness of the operation
• Scheduling Notebook jobs in Airflow ensures the consistency of the development environment and production environment to the greatest extent, and prevents problems caused by the inconsistency of the development and production environment.
5. Applicable to a variety of big data application scenarios
• Big data processing ETL
• Interactive data analysis
• Machine learning
• Real-time computing
6. Calculation and storage separation
• All data are saved on OSS, including:
• User Notebook code
• Scheduling job log
• Even if the cluster is destroyed, the cluster can be rebuilt and data can be recovered easily
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00