Best Practices for Spark Big Data Processing

Open source big data community - AliCloud EMR series live broadcast phase 11

Subject: Spark Big Data Processing Best Practices

Lecturer: Jian Feng, head of Alibaba Cloud EMR data development platform

Content framework:

• Big data overview

• How to get rid of technical Xiaobai

• Spark SQL learning framework

• Big data best practices on EMR Studio

Live playback: scan the QR code at the bottom of the article to join the pin group to watch the playback, or enter the link

1、 Big data overview

• Big data processing ETL (Data → Data)

• Big data analysis BI (Data → Dashboard)

• Machine learning AI (Data → Model)

2、 How to get rid of technical Xiaobai

What is technical Xiaobai?

• Only understand the surface, not the essence

For example, they only know how to refer to other people's Spark code, do not know the internal mechanism of Spark, and do not know how to tune Spark Job

Prescription for getting rid of technical Xiaobai

• Understand the operation mechanism

• Learn to configure

• Learn to watch Log

Understand the operation mechanism: Spark SQL Architecture

Learn to configure: How to configure Spark App

• Configure Driver

• spark.driver.memory

• spark.driver.cores

• Configure Executor

• spark.executor.memory

• spark.executor.cores

• Configure Runtime

• spark.files

• spark.jars

• Configure DAEs

• …..........

Reference website:

Learn to watch Log: Spark Log

3、 Spark SQL Learning Framework

Spark SQL learning framework (combined with graphics/geometry)

1. Select Rows

2. Select Columns

3. Transform Column

4. Group By / Aggregation

5. Join

Spark SQL Execution Plan

1. Spark SQL - Where

2. Spark SQL - Group By

3. Spark SQL - Order by

4、 EMR Studio Practice

EMR Studio features:

• Compatible with open source components

• Support connecting multiple clusters

• Adapt to multiple computing engines

• Interactive development+seamless connection of job scheduling

• Applicable to a variety of big data application scenarios

• Compute storage separation

1. Compatible with open source components

• EMR Studio has been optimized and enhanced on the basis of open source software Apache Zeppelin, Jupyter Notebook and Apache Airflow.

2. Support connecting multiple clusters

• One EMR Studio can connect to multiple EMR computing clusters. You can easily switch computing clusters and submit jobs to different computing clusters for running.

3. Adapt to multiple computing engines

• Automatically adapt multiple computing engines such as Hive, Spark, Flink, Presto, Impala and Shell, without complex configuration, and multiple computing engines work together

4. Interactive development+seamless connection of job scheduling

Notebook+Airflow: seamless connection between development and production scheduling

• The interactive development mode can quickly verify the correctness of the operation

• Scheduling Notebook jobs in Airflow ensures the consistency of the development environment and production environment to the greatest extent, and prevents problems caused by the inconsistency of the development and production environment.

5. Applicable to a variety of big data application scenarios

• Big data processing ETL

• Interactive data analysis

• Machine learning

• Real-time computing

6. Calculation and storage separation

• All data are saved on OSS, including:

• User Notebook code

• Scheduling job log

• Even if the cluster is destroyed, the cluster can be rebuilt and data can be recovered easily

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us