What is Data Lake Analytics? Simplify Big Data Processing

Data Lake Analytics is to uniformly store all the data in the enterprise, from the original data to the use of reporting, visualization, analysis, and machine learning.

Alibaba Cloud Data Lake Analytics is an interactive analytics service that utilizes serverless architecture for big data applications.

What is Data Lake Analytics?

With the advancements in cloud, issues such as data availability and accessibility can be easily addressed. However, the growth in data volume also means systems need to be able to process more kinds of raw data. This raw data may include structured data, unstructured data, images, videos, files, blobs, and other file formats.

So what makes data lake a necessity for organizations? A data lake allows for raw data to be stored in its native format but at the same time allowing it to be processed at very high speeds, such as for reporting and visualization through business intelligence tools.

Alibaba Cloud Data Lake Analytics does exactly this, it is a storage or repository of information where all the raw data of the company sits in. It can perform ETL (Extract, Transform &Load) activity right in the Data Lake. All the extracted and transformed data will be available for feed at the same location. Currently, organization typically use Hadoop for this purpose, but Hadoop requires a Batch Compute mechanism which is again a dependency for the enterprises and might not be that easy and to work with.

Data Lake vs. Data Warehouse

A lot of people often mistake data lakes with data warehouses. In fact, they are quite different. Products such as Alibaba Cloud Table Store is a data warehouse while Data Lake Analytics, like its name suggests, is a data lake. There are differences between both types of data storage:

Data Persistence

Data Warehouses removes data which is not part of the decision-making process. This is help in reducing the usage of the disk space. This might also be due to the scoped process of the Enterprise Data Warehouse. While on the other hand, Data Lake Analytics keeps the track of every data, due to different actions we are going to perform and the wide ranges of Data Types it supports. A majority of Data Lake Analytics users connect processed data to their own BI tools to automate reporting. A minor section of users may perform the data analysis on the old data using the new techniques. And there is also other group which might work on bring in new data and create a new dataset out of it.

Data Types

Data Lake Analytics supports almost all of the data types of data, whereas the Data Warehouses contain generated data from the transactional systems. The data will be well built with all the attributes describing them. Whereas the data in the Data Lake might or might not contain much information on the data.

The data in the Data lake might be generated from logs of a web application, JSON files, CSV's, images, videos, PDF's and any other files, sensor data, in other words this also consist of data which is schema less. Which is analogues to NoSQL database compared to SQL. Also data in EDW is read-only while in Data Lake it supports both read and write.

Ability to Adapt

There is an overhead when we talk of implementing the data warehouses just for the reason that there is lot of things to do like cleaning up of data, assigning schema and bring data into a proper shape for the Business Intelligence team to consume this takes a significant amount of time in development and there is time associated with it. But Data Lake Analytics doesn't pose these problems, since we have the raw data available we can experiment and leave the result if we don't want to else we can persist and provide the information.

Data Lake Analytics will be much faster due to the power of parallel computing and with more advantages, also the data will be readily available. This will come at a cost that use who accesses the data should be responsible for cleaning up the data and see whether it is a good fit for the business's users.

Is Data Lake Analytics Really for Me? You can refer Why I Love Data Lake Analytics.

Related Blogs

Processing OSS Data Files in Different Formats Using Data Lake Analytics

This article describes how to use Data Lake Analytics (DLA) to analyze files stored in Object Storage Service (OSS) instances based on the file format.

Alibaba Cloud Data Lake Analytics (DLA) is a serverless interactive query and analysis service in Alibaba Cloud. You can query and analyze data stored in Object Storage Service (OSS) and Table Store instances simply by running standard SQL statements, without the necessity of moving the data.

Currently, DLA has been officially launched on Alibaba Cloud. You can apply for a trial of this out-of-the-box data analysis service.

Visit the official documentation page to apply for activation of the DLA service.

In addition to plain text files such as CSV and TSV files, DLA can also query and analyze data files in other formats, such as ORC, Parquet, JSON, RCFile, and Avro. DLA can even query geographical JSON data in line with the ESRI standard and files matching the specified regular expressions.

This article describes how to use DLA to analyze files stored in OSS instances based on the file format. DLA provides various built-in Serializers/Deserializers (SerDes) for file processing. Instead of compiling programs by yourself, you can choose one or more SerDes to match formats of data files in your OSS instances. Contact us if the SerDes do not meet your needs for processing special file formats.

Exploring Blockchain and Big Data with Alibaba Cloud Data Lake Analytics

I started my own company in the business of big data for promotion. A major difficulty that I had to conquer was processing offline data and cold backup data.

We needed to collect hundreds of millions of search results from the Apple's App Store. To reduce costs, we only kept data from the past one month in the online database and backed up historical data in Alibaba Cloud Object Storage Service (OSS). When we needed the historical data, we had to import it to the database or Hadoop cluster. Similar methods are used for data like users' click logs by companies with limited budgets. This method is inconvenient as it requires a large amount of work for ETL import and export. Currently, frequent requirements such as historical data backtracking and analysis, due to increasingly complex data operations, are consuming a considerable part of R&D engineers' time and energy.

Alibaba Cloud's Data Lake Analytics was like a ray of hope. It allows us to query and analyze data in the OSS using simple database methods that supports MySQL. It simplifies big data analysis and delivers satisfying performance. To a certain extent, it can directly support some of the online services.

Here I'd like to share my experience, using an example of big data analysis for blockchain logs. I picked blockchain data because the data volume is huge, which can fully demonstrate the features of Data Lake Analytics. Another reason is that all the blockchain data is open.

Experimental Data Set

The data set used in this article contains all the data of Ethereum as of June 2018. More than 80% of today's DApps are deployed in Ethereum, so the data analysis and mining are of high values.

The data logic of Ethereum is chained data blocks, and each block contains detailed transaction data. The transaction data is similar to common access logs, which include the From (source user address), To (destination user address), and Data (sent data) fields. For example, ETH exchange between users and Ethereum program (smart contract) invoking by users are completed through transactions. Our experimental data set contains about 5 million blocks and 170 million transaction records.

Analyzing Data on Tableau with Data Lake Analytics

In this tutorial, we will analyze raw data with Tableau using Data Lake Analytics. We will analyze the files that are available in Object Storage Service (OSS).

Activate Data Lake Analytics

Firstly, you have to activate Alibaba Cloud Data Lake Analytics using the Console or through the product page. After successful activation, you will receive a mail containing the credentials for accessing Data Lake Analytics.

Upload Data Files to Object Storage Service (OSS)
You have to create a bucket in OSS console/OSS Browser in the same region. So assuming if you are using Data Lake Analytics outside China, you can use the Singapore region. This means that the data has to be stored in the OSS Bucket in Singapore. The choice of Availability Zone does not matter.

Grant permissions to Data Lake Analytics to access OSS in the RAM Console or by clicking the below link:

https://ram.console.aliyun.com/

Related Courses

Quick Start Guide of Data Lake Analytics

Data Lake Analytics does not require any ETL tools. This service allows you to use standard SQL syntax and business intelligence (BI) tools to efficiently analyze your data stored in the cloud with extremely low costs. Through the introduction and usage of the product, this course enables learners to quickly get started with Data Lake Analytics.

Alibaba Cloud Big Data Products Overview

This course briefly explains the basic knowledge of Alibaba Cloud big data product system and several products in large data applications, such as MaxCompute, DataWorks, RDS, DRDS, QuickBI, TS, Analytic DB, OSS, Data Integration, etc. Students can refer to the application scenarios explained, combine with the enterprise's own business and demand, apply what we have learned to practice.

Related Market Products

A Quick Guide to Process Structured Data with Python

The data set of this course is from virtual blog site, we are going to use the data to solve business problems,for example what countries do your customers come from；Which day of the week gets the most online traffic; Which region contributes the most clickstream data etc.

XpoLog - Log Management and Analytics

XpoLog Log Analysis platform collects and indexes any machine-generated data from any device, server or application in the environment. Immediately start searching and analyzing your log data, run complex correlations and create mission critical real time visualizations of the data.

Related Products

Data Lake Analytics

MaxCompute

MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.

Community

What is Data Lake Analytics? Simplify Big Data Processing

What is Data Lake Analytics?

Data Lake vs. Data Warehouse

Data Persistence

Data Types

Ability to Adapt

Related Blogs

Processing OSS Data Files in Different Formats Using Data Lake Analytics

Exploring Blockchain and Big Data with Alibaba Cloud Data Lake Analytics

Experimental Data Set

Analyzing Data on Tableau with Data Lake Analytics

Activate Data Lake Analytics

Related Courses

Quick Start Guide of Data Lake Analytics

Alibaba Cloud Big Data Products Overview

Related Market Products

A Quick Guide to Process Structured Data with Python

XpoLog - Log Management and Analytics

Related Documentation

What is Data Lake Analytics?

Data Lake Analytics Service Level Agreement

Related Products

Data Lake Analytics

MaxCompute

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Dikky Ryan Pratama May 6, 2023 at 12:31 pm

Alibaba Clouder

Related Products

Database Backup

Whole Genome Sequencing Analysis Solution