Diving into Big Data: Visual Stories through Zeppelin

By Priyankaa Arunachalam, Alibaba Cloud Community Blog author.

The main aim of of Big Data is to analyze the data and reveal business insights hidden in the data to be able to transform business accordingly. Big data analytics is a combination of various tools and technologies. The collective use of these technologies can assist companies in the process of examining large quantities of data so to uncover hidden patterns and arrive at useful findings. To put things in more practical terms, through gathering and analyzing data about customers and transactions, organizations can make the appropriate judgements in how they run and do their business.

Visual stories Using Zeppelin

As we continue to walk through the big data cycle, it's time for us to collect and make full use of business intelligence with various BI tools. After transforming data into information using spark, now we are all set to convert this information into insights for which we might need some help of visualisation tools. In this article, we will focus on building visual stories using Apache Zeppelin, an open-source notebook for all your needs including

Data Ingestion
Data Discovery
Data Analytics
Data Visualization & Collaboration

Apache Zeppelin is a web-based multi-purpose notebook that comes inbuilt with the set of services provided by Alibaba Cloud E-MapReduce. This makes easy for you to process and visualise large datasets with the familiar codes without moving on to a separate set of Business Intelligence (BI) tools.

Zeppelin enables all set of people with interactive data analytics. It provides both data exploration and visualization, as well as sharing and collaboration features to Spark. It supports Python, and also a growing list of programming languages such as Scala, Hive, SparkSQL and markdown. It has in-built visualizations such as graphs and tables which can be used to present our newly found data in a more appealing format. In this article, we will focus on accessing the interface of Zeppelin and explore different ways in which we can present data.

Accessing Zeppelin

Alibaba E-MapReduce supports Apache Zeppelin. Once the cluster is created, we can see the set of services initiated and running. To access Zeppelin, first step to do is to open the port. On creation of an E-MapReduce cluster, port 22 is accessible by default which is the default port of Secure Shell (SSH) with which we accessed the cluster in our previous articles. We can create security groups and assign the ECS instances to the groups created. The best way is to divide ECS instances by their functions, and put them into different security groups, as each security group is provided with unique access control as required.

Linking Security Group to a Cluster

To link a security group with a cluster that has already been created, then follow the steps provided below:

Log on to the Alibaba Cloud ECS console and navigate to Security Group on the left to find the security groups
Click Manage Instances in a security group and view ECS instance names which were created when the E-MapReduce cluster is been created.
Select the instances which you want to move, and click Move to security group, and then select a security group to move the E-MapReduce cluster to.

Add an E-MapReduce cluster to the existing Security Group

Log on to the Alibaba Cloud E-MapReduce console and navigate to the Cluster Management tab at the top of the page. In this cluster overview page, you can see the Security Group ID under Networks tab.

Click on the "Security Group ID" quick link which leads to a page like below. On the left side menu, click Instances in Security Group. This can provide you with the details of security group names of all ECS instances which is been created. If needed, we can also add up the required instances to the created security group as mentioned above.

Navigate to Security Groups tab and click Add to Security Group.

In the wizard that follows, select the security group to which the cluster should be linked and click ok

Now let's add the security rules to the group. Navigate to Security Group tab and the rules which are already created will be displayed in this page. As mentioned earlier, port 22 is accessible by default. Now that to access Zeppelin, we have to open port 8080. Let's add a new rule by clicking Add Security Group Rule.

Clicking Add Security Group rule will display a window as below. Specify the needed details and port range.

The new rule is added up and we have opened the port successfully. Now we are all set to access Zeppelin. There are two ways to access Zeppelin.

One way is to open up Zeppelin in the browser. To do this, move to the browser and give Public IP: 8080

For example, 47.88.171.243:8080 which will lead to the Zeppelin homepage

Another way is by clicking on the quick link provided in the E-Mapreduce console. To do this, follow the steps mentioned below:

Under cluster Management tab, navigate to Connect Strings under which the quick links for various services will be listed.
To access Zeppelin, click on the quick link near to Zeppelin

This leads to the homepage of Zeppelin as shown below. On the homepage, you will have the notebooks which are already created and you will also have options to import your note from local disk or from a remote location by providing the web address.

Once the Zeppelin notebook is opened, you are good to create notebooks and play around with visual stories with the languages in which you are comfortable with. Let's create a new notebook by clicking

Create a New Note

This will lead to a prompt as below. Give a name to the notebook which you are creating. The default interpreter is spark here and you can choose the interpreter type from the list of options like python, hive, etc.

Let's choose spark as the interpreter as we have seen processing of data using spark in our previous article, Drilling into Big Data-Data Preparation . This is easy to start with, as we can carry out with the same code which we used in the spark shell.

Start with %spark.pyspark which is a Pyspark interpreter providing a python environment. Then, read the file as a Dataframe and create a view using spark. Once you complete with the code, click on the Run icon on the right which runs the code and leads to the next step. If the code is successful, the status will be "Finished" and the corresponding output will be displayed. If the code is not successful, the corresponding error message will be displayed.

Visualize DataFrame/Dataset

There are two approaches to visualize DataFrame or a Dataset in Zeppelin

SparkSQL Interpreter by using %spark.sql
Zeppelin Context by using z.show

To use SparkSQL, type %spark.sql and query the view which we created. Soon you run this query, the chart types will be displayed at the bottom of the code. Here we can see the Top Museums by Family Count in Table format.

Let's choose Pie chart from the Chart types, to make the visual more interactive. From this chart, the end user can easily identify the top museums in a glimpse.

Similar to this, we can use Pig, Hive, Python, so on, as language backends and a variety of charts to make big data analytics more appealing. On the top-right corner of each paragraph there will be icons to execute the code, hide and show the output, and so on.

Zeppelin is a notebook where each paragraph can be configured according to the user's wish, by clicking on the gear icon. This leads to options like moving the paragraph up/down, giving a suitable title to the paragraph, line numbers and exporting the current paragraph as an iframe. Thus you can have a clean notebook which can be saved for future reference.

Interpreters

Apache Zeppelin interpreter allows any data-processing backend to be plugged into Zeppelin, which is, to use python code in Zeppelin, you need a python interpreter. Every Interpreter belongs to an InterpreterGroup and the interpreters in the same Group can reference each other. For example, SparkSqlInterpreter can refer SparkInterpreter and get SparkContext from it as they are in the same group. As said, other than spark, there are various interpreters like Python, Big query, Hive, Pig, so on, which helps people work easier in their comfortable zone.

Create a New Interpreter

Let's go ahead and create a Pig interpreter. From the dropdown on the top left, choose Interpreter.

On the Interpreters page, click Create. Enter the name and choose the Interpreter group. This will automatically include the libraries needed. Now the Interpreter is created but it is not binded to the notebook in which we are working. That's where we have to include Interpreter Binding.

Interpreter binding

To use an Interpreter in a notebook, move to the corresponding notebook and click on the Interpreter Binding icon. This will show the list of Interpreters. Click to bind or unbind the required Interpreter and click on save.

Thus, you can use the newly created interpreter in the notebook you are using.

Summary

From identifying and gathering big data from variety of sources, to visualising stories to uncover insights and hidden patterns, we end up with Big Data Analytics. I hope that you enjoyed walking through this cycle of batch processing pipeline in Big Data with various tools and technologies. In future articles, we will focus on the EMR cluster management and Big Data with another set of Alibaba Cloud products.

Community

Diving into Big Data: Visual Stories through Zeppelin

Visual stories Using Zeppelin

Accessing Zeppelin

Linking Security Group to a Cluster

Add an E-MapReduce cluster to the existing Security Group

Create a New Note

Visualize DataFrame/Dataset

Interpreters

Create a New Interpreter

Interpreter binding

Summary

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

MaxCompute

DataWorks

DataV