Community Blog Best Practices for Big Data Processing in Spark

Best Practices for Big Data Processing in Spark

This article is an overview of the best practices for big data processing in Spark taken from a lecture.

Lecturer: Jianfeng, Head of Alibaba Cloud EMR Platform

Content Framework:

  • Big Data Overview
  • How to Get Rid of Technical Novice
  • Spark SQL Learning Framework
  • Best Practices for Big Data on EMR Studio

I. Big Data Overview

  • Big Data Processing ETL (Data → Data)
  • Big Data Analysis BI (Data → Dashboard)
  • Machine Learning Platform for AI (Data → Model)


II. How to Get Rid of Technical Novice

What is a Technical Novice?

  • A technical novice only understands the surface information, not the main idea

For example, you only refer to other people's Spark code, but you do not understand the internal mechanism of Spark, and you do not know how to tune Spark jobs.

Getting Rid of Technical Novice

  • Understand the operation mechanism
  • Learn to configure the Spark App
  • Learn to read the log

Understanding the Operating Mechanism: Spark SQL Architecture


Learn How to Configuring the Spark App

1.  Configure Driver

  • spark.driver.memory
  • spark.driver.cores

2.  Configure Executor

  • spark.executor.memory
  • spark.executor.cores

3.  Configure Runtime

  • spark.files
  • spark.jars

4.  Configure DAE

Learn more: https://spark.apache.org/docs/latest/configuration.html

Learn to Read the Log: Spark Log


III. Spark SQL Learning Framework

Spark SQL Learning Framework (Combined with graphics /geometry)

1. Select Rows


2. Select Columns


3. Transform Columns


4. Group By / Aggregation


5. Join


Spark SQL Execution Plan

1. Spark SQL – Where


2. Spark SQL – Group By


3. Spark SQL – Order By


IV. EMR Studio Practices

EMR Studio Features

  • Compatible with open-source components
  • Supports connecting to multiple clusters
  • Adapt to multiple computing engines
  • Interactive development + seamless connection of job scheduling
  • Applicable to a variety of big data application scenarios
  • Separated computing and storage

1. Compatible with Open-Source Components

  • EMR Studio is optimized and enhanced based on the open-source software Apache Zeppelin, Jupyter Notebook, and Apache Airflow.


2. Supports Connecting to Multiple Clusters

  • One EMR Studio can connect to multiple EMR computing clusters. You can switch computing clusters and submit jobs to different computing clusters easily.


3. Adapt to Multiple Computing Engines

  • Adapts to multiple computing engines automatically, such as Hive, Spark, Flink, Presto, Impala, and Shell, without complex configuration
  • Multiple computing engines work together


4. Interactive Development + Seamless Connection of Job Scheduling

Notebook + Airflow: Connects development and production scheduling seamlessly

  • The interactive development mode can verify the correctness of the job quickly.
  • Schedule notebook jobs in Airflow to ensure the consistency between the development environment and the production environment to the greatest extent and prevent problems caused by inconsistent environments between the development and production phases.


5. Applicable to a Variety of Big Data Application Scenarios

6. Separated Computing and Storage

1.  All data is stored on OSS, including:

  • User Notebook Code
  • Scheduling Job Logs

2.  Even if the cluster is destroyed, the cluster can be rebuilt, and data can be restored easily.


EMR Studio Demo Demonstration

E-MapReduce: https://www.alibabacloud.com/product/emapreduce

EMR Studio (Beta): https://help.aliyun.com/document_detail/208107.html (Article in Chinese)

For specific product introduction and demonstration, you can click the following link to watch the playback: https://developer.aliyun.com/live/247072 (Video in Chinese)

0 0 0
Share on

Alibaba EMR

56 posts | 4 followers

You may also like


Alibaba EMR

56 posts | 4 followers

Related Products