Community

Blog
×
Community Blog Deciphering Data to Uncover Hidden Insights – Data Wrangling and Ingestion

Deciphering Data to Uncover Hidden Insights – Data Wrangling and Ingestion

In this article series, we will be exploring data analytics for businesses using Alibaba Cloud QuickBI and sample data from banking and financial services.

By Ranjith Udayakumar, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud's incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.

This multi-part article talks about how to collect data, wrangle the data, ingest the data, model the data, and visualize the data from three viewpoints (conceptual, practical, and best practice).

In the first article of this series, we have seen how to understand data conceptually through an example from the Banking, Financial services and Insurance (BFSI) domain.

In this article, we will learn how to wrangle the data (i.e. cleaning the data) according to your business scenario to Alibaba Cloud Quick BI. We may need Quick BI in the upcoming process of deciphering data so please ensure that you have registered for an Alibaba Cloud account. If you haven't, sign up for a free account through this link.

What Is Data Wrangling

Data wrangling, sometimes referred as data munging, is the process of transforming data from one format to other with the intent of making it more appropriate and valuable for analytics. With the rapid rise of Big Data and IoT applications, the number of data types and formats are increasing each day. This makes data wrangling an indispensable element for big data processing, especially for larger applications.

What Is Quick BI

Alibaba Cloud Quick BI is a flexible and lightweight business analytics platform built on cloud. The basic components of Quick BI are as follows:

  1. Data source
  2. Dataset
  3. Worksheet (Quick BI Basic)
  4. Workbook (Quick BI Pro and Professional edition)
  5. Dashboard
  6. Portal (Quick BI Pro and Professional edition)

Wrangle the Data (Conceptual)

Data wrangling includes:

  1. Data Cleaning
  2. Data Editing

Data Cleaning

Data cleansing or data cleaning is the process of detecting, analyzing, removing the inaccurate records from a data set, and replacing the inaccurate parts of the data by appropriate data. Data cleansing could be done with help of data wrangling tools, or through scripts.

For example, let's assume the dataset has some erroneous value in date column like "s29-05-2018" instead of "29-05-2018". To a person, this may seem like a trivial error, but to a system, this entry is unreadable. The system may not read this entry as a date but instead it will take it as a string.

Note: We can use Excel itself to clean the data but may we miss something. It's better to do it via scripts or tools. If the data source is Database, then we can do it in BI tool.

Data Editing and Preparing

Data Editing and Preparing is nothing but a manual process of changing the data, data types for the reporting purpose.

For example, let's assume dataset has date values in "MM-DD-YYYY" format but we need in "DD-MM-YYYY" format. This means we need to change it manually.

Note: Data Editing and Preparing is usually carried out in Excel itself or if the data source is Database then we can do it in BI tool.

Wrangle the Data (Practical)

As we discussed earlier, we are going to use Excel to wrangle the data. In addition to that, I also like to share the python script I wrote to wrangle the data automatically.

UseCase-1: ATM Analytics

Here, we will look at our first use case ATM Dataset. In this use case example we will see how to wrangle the data in Excel.

1

Our dataset is clean, but for demonstration we will make some part of this data erroneous or inaccurate first.

2

Now our data contains some erroneous value like:

  1. Transaction Date has some erroneous data, so it recognizes the columns as String instead of Date.
  2. No of Withdraws must contain whole number but it contains decimal number.
  3. Working Day contains one small h which makes it has new categorical value.
  4. Working Day, Holiday Sequence also contains erroneous data.

We need to handle this because if not handled properly it will lead to inaccurate results which in turn leads to false insights.

Let me show how to handle this in excel. Please have a look at the following table

Type Result
Number 1
Text 2
Logical Value 4
Error Value 16
Array Value 64

Using Type Function, we can easily find erroneous data. Please follow the following Steps:

  1. Insert a new column for validating the data

    3

  2. Write the formula "=TYPE(Range)"

    4

  3. Apply Filter to the column to validate

    5


From this we can easily find whether date column has erroneous value or not. Because it contains text, we can differentiate text vs number but we couldn`t able to find whether it has float instead of integer. We may also use other functions in this process say like ISNUMBER(), ISTEXT, Etc.,

We may not able to analyze each cell to find the incorrect data. So, it is essential to use any wrangling tool or scripts.

UseCase-2: Customer 360

Here we go our next use case Customer360.

We are going to use a Python Script to wrangle the data automatically. I will write a separate article in detail about this programmatic approach to "Decipher the data to uncover the hidden insights out of it".

Python Code

# =============================================================================
#                      #Importing the necessary modules
# =============================================================================

import pandas as pd
import numpy as np

# =============================================================================
#                    #Data Wrangling or Data munging
# =============================================================================

def datawrangler(data):
    
       
    data=data.drop_duplicates()
    data.rename(columns=lambda x: x.replace(' ', ''), inplace=True)
    
    # Getting Column Names
    columns=data.columns.values
    columns=pd.DataFrame(columns,columns=['ColumnFields'])
    
    # Getting Column Types
    columndatatype=pd.DataFrame(data.dtypes)
    columndatatype.index=range(len(columns))
    
    #Adding Column types with their respective columns
    columns['ColumnTypes']=columndatatype
      
    for x in range(len(columns)):
        
      if (columns['ColumnTypes'].iloc[x]) is np.dtype(np.int64) or (columns['ColumnTypes'].iloc[x]) is  np.dtype(np.float64) :
             
           data[columns['ColumnFields'].iloc[x]]=data[columns['ColumnFields'].iloc[x]].fillna(0)
      
      if (columns['ColumnTypes'].iloc[x]) is not np.dtype(np.int64):
             
           data[columns['ColumnFields'].iloc[x]]=data[columns['ColumnFields'].iloc[x]].fillna(method='ffill')
     
      if 'date' in (columns['ColumnFields'].apply(np.str.lower).iloc[x]):
           
          data[columns['ColumnFields'].iloc[x]]=data[columns['ColumnFields'].iloc[x]].apply(pd.to_datetime)

      if (columns['ColumnTypes'].iloc[x]) is np.dtype(np.float64):
          
          for y in range(len(data)):
             
              if (np.modf(data[columns['ColumnFields'].iloc[x]].iloc[y])[0]) in ['0.0',0.0] :     
       
                 data[columns['ColumnFields'].iloc[x]]= data[columns['ColumnFields'].iloc[x]].astype(np.dtype(np.int64))  
                 break
                       
      if (columns['ColumnTypes'].iloc[x]) is not np.dtype(np.int64) and (columns['ColumnTypes'].iloc[x]) is not np.dtype(np.float64) and ('date' not in (columns['ColumnFields'].apply(np.str.lower).iloc[x])):
          
          for y in range(len(data)):
              
              if ((data[columns['ColumnFields'].iloc[x]].iloc[y]).isnumeric()) :
                  
                  data[columns['ColumnFields'].iloc[x]]=data[columns['ColumnFields'].iloc[x]].astype(np.dtype(np.int64))
                  break
            
    
    return data
    
# =============================================================================
#                            #Getting the Data
# =============================================================================

location=input("\n\nEnter the File Location\n\n\t")
data=pd.read_csv(location,index_col=0)
savelocation=input("\n\nEnter the File Save Location\n\n\t")

# =============================================================================
#                            #Passing the Data to the Module
# =============================================================================

data=datawrangler(data)
data.to_csv(savelocation, index=False, header=None)

Wrangle the Data (Best Practices)


  1. It's always a best practice to check whether the data is clear or need to be wrangled.
  2. Check the data whether it has anomalies, and outliers.
  3. Convert the categorical values to lower or upper case and then convert it to sentence case to avoid repetition of same values
  4. Find null values and NaN values which leads to false results.
  5. Use Data Wrangler or Preparation tools to wrangle the data to avoid human error.

Ingest the Data (Conceptual)

Data ingestion is a critical success factor for analytics and business intelligence. We need to know about OLTP and OLAP.

OLTP is an Online Transaction Processing system. The focus of OLTP system is to record the current Update, Insertion and Deletion while transaction. The OLTP queries are simpler and short and hence require less time in processing and requires less space. A common example of an OLTP system is an ATM database, in which using short transactions we modify the status of our account

OLAP, on the other hand, is an Online Analytical Processing system. OLAP database stores historical data that has been inputted by OLTP. Using OLAP, you can extract information from a large database and analyze it for decision making. A classic example of an OLAP system is a data warehouse, which accumulates data from multiple OLTP data sources for querying and analyzing the data.

For both OLTP and OLAP, we may have files or database as data sources. Typical file formats include .CSV and .XLS. We also need to consider different types of databases for our application. Connectors will be available to easily connect the databases.

Ingest the Data (Practical)

Let's now see how to ingest data into Quick BI.

But first, you'll need to have Quick BI set up for this example. Follow these steps to apply for Quick BI Pro Trial. You'll need to have an Alibaba Cloud account. If you don't have one already, please sign up here.

  1. You'll need to have an Alibaba Cloud account. If you don't have one already, please sign up here. Log in to your Alibaba Cloud account.
  2. Enter your Console and navigate to the Quick BI product console.
  3. Select the Region and Select Quick BI Pro.

Your Quick BI instance has been created.

6

How to Ingest Files into Quick BI Pro

  1. Select Personal Workspace

    7

  2. Click the Data Source, Click Upload, Upload the file, Name the file, and Click Ok.

    8

  3. Data is successfully ingested into Quick BI.

    9


Note: Files are only supported in personal workspace.

How to Ingest Data from Database to Quick BI Pro


  1. Select Default Workspace

    10

  2. Click the Data Source, Click Create Data Source, Select the Data Source

    11

    Note: You can select either Cloud Databases or External Databases.

  3. Fill the Connection Details, Click Test Connection, and Save

    12

  4. Data source connection is established.

    13


Note: Please choose workspace other than the personal workspace so that you can able to use workbooks.

Ingest the Data (Best Practices)


  1. Always Ingest the Wrangled data if you are using files.
  2. Don't connect OLTP Database as source as doing so may degrade the performance of source systems.
  3. Connect the OLAP Databases or Use RDS Services offered by Alibaba Cloud Platform.
  4. Always use custom query to ingest the data instead of ingesting all tables.

If you followed the steps correctly, you should have successfully ingested data into Quick BI, which is a great milestone in the process of deciphering your data to uncover the hidden insights.

Please ensure that you have registered for an Alibaba Cloud account to enjoy a free trial of Quick BI Pro. In the next article of this multi-part tutorial, we will see how to model the data with Quick BI. See you all in next part of this article series.

0 0 0
Share on

Alibaba Clouder

701 posts | 106 followers

You may also like

Comments