By Ranjith Udayakumar, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud's incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.
"Where there is data smoke, there is business fire." — Thomas Redman
If you are reading this article, I am pretty sure that you are already familiar with cloud computing. Cloud computing has made a huge impact by providing businesses the opportunity to have virtually unlimited computing resources anywhere, anytime. Over the last few months, we can see that the industries and businesses are moving towards "Serverless Computing". So, it is not surprising to see the footprints of Serverless Computing in Business Intelligence and Analytics (BIA) architectures.
Since serverless computing is a rather novel term, in this article, I will walk you through the concepts of serverless computing and its underlying benefits. I will then talk about Alibaba Cloud Data Lake Analytics (DLA), and discuss how efficient it is when compared to traditional methods of analytics. We will then finish up with typical scenarios of DLA with different use cases as an example.
This article is meant for everyone! This includes students or newcomers who just want to familiarize with general concepts of serverless computing and big data analytics, as well as professional data engineers and analysts who want to leverage serverless analytics to optimize the cost utilization, and time consumption.
This article covers about what is serverless computing, why serverless computing, how serverless architecture deepens its roots in Business Intelligence and Analytics (BIA), and how to leverage Serverless analytics with the help of Alibaba Cloud Data Lake Analytics. We will analyze and visualize data from different data sources, such as from Alibaba Cloud Object Storage Service (OSS), Table Store, and ApsaraDB for RDS, using Alibaba Cloud DLA and Alibaba Cloud Quick BI. At least you need to activate OSS, DLA and Quick BI to make use of this article effectively.
"Focus on your application development, not on how to deploy and maintain the infrastructure"
Serverless Computing doesn't mean that there is no server, it is software development approach that aims to eliminate the need to manage server on our own. In general, Serverless Computing is a cloud computing model which leads to build more, manage less by avoiding running the virtual resources for the long period of time.
In Serverless Computing, the code runs in "stateless compute clusters that are ephemeral". i.e. The clusters are automatically provisioned and invoked for the specific tasks and after completion of the tasks, the resources are released. It all happens in matter of seconds, significantly optimizes the resources and reduces the cost.
For illustration, just imagine a machine which starts off to complete a task and stops after completing the task automatically. Serverless computing often refers to FaaS because of "Just run for function".
Alibaba Cloud provides Function as a Service (FaaS) in the name of Alibaba Cloud Function Compute which is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
The above figure illustrates the key difference between serverless computing, and Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) cloud computing models.
Serverless Computing provides virtual resources in a tick of time for a function (a specific task), allowing you to run code more flexibly and reliably. This leads to following benefits are
Unlike PaaS Models (run nonstop to serve the requests), Serverless computing is event-driven (run to complete a task or function) i.e. the resources are allocated on the fly and invoked to serve specific tasks, therefore you only need to pay to the computing time you really need (Pay-Per execution).
With Serverless approaches, we really need not to worry about provisioning the instance and managing it. Serverless applications scale as per the demand autonomously. There is no need of scaling and tuning but still operation team has to monitor the applications.
Due to its layers of abstraction, deployment in the Serverless is less complex. Deploy the code in the environment, then you are go-to-market ready.
In the pipeline, Business intelligence and Analytics architecture is divided into two important conceptual components to derive business value from the data are ¬
From this illustration, we can see that analytics architecture is a concatenation of storing and transformation processes. It becomes evident that when some of these processes are happening as stateless functions in the cloud which is the only difference in serverless computing.
"Data Lake refers to storage where we have data in its natural state."
Alibaba Cloud provides Data Lake Analytics which does not require any ETL tools. It allows user to use standard SQLs and business intelligence (BI) tools to efficiently analyze your data stored in the cloud with extremely low costs.
Benefits of Data Lake Analytics:
We are going to process, transform, analyze and visualize the cold data stored in the OSS using DLA and Quick BI, with an example use case of processing, transforming and analyzing the website log data into consumable insights.
The dataset we are going to analyze is NASA's apache web server access log. Before getting into analyzing it in detail, I like to give an overview of what is apache access.log? and why it is essential to analyze log data?
You can download the dataset from here.
What is Apache access.log?
A log is a documentation of events occurred in a system. Apache access.log is a file which captures the information regarding the events occurred in apache web server.
For instance, when someone visits your website, a log is recorded and stored in the access.log file. It contains information such as client IP, resource path, status codes, browser used, etc.
Apache log format: "%h %l %u %t \"%r\" %>s %b"
Let's break down the log format
%h – Client IP
%l – Identity of Client (will return hyphen if the information is not available)
%u – User-ID of Client (will return hyphen if the information is not available)
%t – Timestamp
"%r" – Request Line (it includes http method, resource path, and http protocol used)
%>s – Status Code
%b – Size of the object requested
Finally, the log entry will look like
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
Why it is essential to analyze log data?
Log analysis is the process of making sense of system generated logs and it helps businesses to understand their customer behavior, comply with security policies, and comprehend system troubleshooting.
We need to upload the data into Object Storage Service (OSS) to effectively process, transform and analyze website log data into consumable insights using the Data Lake Analytics (DLA) and Quick BI.
We need OSS command line utility (You can also use the console to create bucket and upload the data into it). Please download and install the tool from the official website. For more details, you can have a look at how to Download and Install ossutil.
After completing the process of downloading, installing and configuring the OSS Utility tool, follow the following processes to ingest the data into the Object Storage Service (OSS).
Create a bucket in OSS
ossutil mb oss://bucket [--acl=acl] [--storage-class sc] [-c file]
For Instance,
ossutil64 mb oss://apachelogs
You can see that bucket is created successfully, now we need to upload the file into that bucket.
Note:
Upload the file into a bucket
ossutil cp src-url dest-url
For Instance,
ossutil64 cp accesslog.txt oss://apachelogs/
You can see that file is ingested into the bucket successfully, now we need to process it using Data Lake Analytics (DLA).
After the completion of uploading data to OSS, we can use Data Lake Analytics to process the data stored in the OSS.
Unlike traditional systems like Hadoop, Spark, Data Lake Analytics uses serverless architecture as discussed earlier in this article. Therefore, users are need not to worry about the provision of infrastructure which is taken care by the vendor itself. You only need to pay for the volume of data gets scanned to produce the result during the execution of query i.e. pay per execution.
So, we only need to use DLL statements to create the table and describe the structure of data stored in OSS for DLA. Currently, you need to apply for DLA to use it, for more details, have a look at Alibaba Cloud Data Lake Analytics.
Assuming you have access to DLA and configured the DLA connection in SQL Workbench, we are going to use apache web log imported into OSS as an example to understand how to use Data Lake Analytics. For more details on how to create a connection in SQL Workbench or in shell or any other Database tools, please have a look at official documentation.
Creating a Schema
CREATE SCHEMA my_test_schema with DBPROPERTIES (
LOCATION = 'oss://xxx/xxx/',
catalog='oss');
Note: Currently, DLA schema name (in the same region) must be globally unique. If a schema name already exists, an error message will be returned.
For instance,
CREATE SCHEMA apacheloganalyticsschema with DBPROPERTIES (
LOCATION = 'oss://apachelogs /',
catalog='oss' );
Note: Your OSS LOCATION path must be ended with "/" to indicate that it is a path.
Creating a table
CREATE EXTERNAL TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[ROW FORMAT row_format]
[STORE AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
LOCATION oss_path
For Instance,
CREATE EXTERNAL TABLE apacheloganalyticstable (
host STRING,
identity STRING,
user_id STRING,
time_stamp STRING,
request STRING,
status STRING,
size STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s" )
STORED AS TEXTFILE LOCATION 'oss://apachelogs/accesslog.txt';
You can see that table is created successfully, now we will able to execute ad-hoc queries against the table to see the results.
Querying the database
Data Lake Analytics complies with standard SQLs and supports a variety of functions. Therefore, you can perform ad-hoc queries to access the data stored in the OSS like you do in a common database.
select * from apacheloganalyticstable limit 5;
select count (distinct host) as "Unique Host" from apacheloganalyticstable where status="200";
As you can see that DLA processes the data within a matter of milli-seconds which is blazingly fast, the performance is continuously improving.
I hope that this article gives you a better understanding of serverless computing and why we need to go for serverless computing. We also discussed how serverless architecture can be applied to Business Intelligence and Analytics (BIA) and introduced Alibaba Cloud Data Lake Analytics.
We started off with a scenario of analyzing the cold data stored in the OSS using DLA and Quick BI with a use case. If you followed the steps correctly, you should have successfully ingested data into OSS, created data table in DLA and able to query the data as like we do in database.
In the next article of this series, we will be analyzing and visualizing the data using Quick BI to transform the logs into consumable insights. Stay tuned.
Flink State Management and Fault Tolerance for Real-Time Computing
2,599 posts | 758 followers
FollowAlibaba Clouder - June 23, 2020
ApsaraDB - February 25, 2021
Alibaba Cloud Community - September 23, 2021
Alibaba Developer - September 7, 2020
Alibaba Cloud Industry Solutions - January 13, 2022
Alibaba Clouder - May 27, 2019
I know that healthcare analytics [url]https://greenm.io/healthcare-analytics/[/url] allows for the examination of patterns in various healthcare data to determine how clinical care can be improved while limiting excessive spending. Healthcare organizations are rapidly adopting information systems to improve pollution analytics in the world
2,599 posts | 758 followers
FollowA new generation of business Intelligence services on the cloud
Learn MoreAlibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreMore Posts by Alibaba Clouder
Raja_KT February 18, 2019 at 5:20 pm
Good one. We travail our journey from physical machine, virtualization, cloud computing, container, serverless