According to the forecast given by the International Data Corporation IDC), there will be almost 163ZB (Zettabytes) of data generated worldwide by 2025. This is almost 10x increase of what we have today and almost all of data will be managed by enterprises. The astronomical growth in data can help businesses create better products and opportunities through big data analytics.
With the advancements in cloud, issues such as data availability and accessibility can be easily addressed. However, the growth in data volume also means systems need to be able to process more kinds of raw data. This raw data may include structured data, unstructured data, images, videos, files, blobs, and other file formats.
So what makes data lake a necessity for organizations? A data lake allows for raw data to be stored in its native format but at the same time allowing it to be processed at very high speeds, such as for reporting and visualization through business intelligence tools.
Alibaba Cloud Data Lake Analytics does exactly this, it is a storage or repository of information where all the raw data of the company sits in. It can perform ETL (Extract, Transform &Load) activity right in the Data Lake. All the extracted and transformed data will be available for feed at the same location. Currently, organization typically use Hadoop for this purpose, but Hadoop requires a Batch Compute mechanism which is again a dependency for the enterprises and might not be that easy and to work with.
A lot of people often mistake data lakes with data warehouses. In fact, they are quite different. Products such as Alibaba Cloud Table Store is a data warehouse while Data Lake Analytics, like its name suggests, is a data lake. There are differences between both types of data storage:
Data Warehouses removes data which is not part of the decision-making process. This is help in reducing the usage of the disk space. This might also be due to the scoped process of the Enterprise Data Warehouse. While on the other hand, Data Lake Analytics keeps the track of every data, due to different actions we are going to perform and the wide ranges of Data Types it supports. A majority of Data Lake Analytics users connect processed data to their own BI tools to automate reporting. A minor section of users may perform the data analysis on the old data using the new techniques. And there is also other group which might work on bring in new data and create a new dataset out of it.
Data Lake Analytics supports almost all of the data types of data, whereas the Data Warehouses contain generated data from the transactional systems. The data will be well built with all the attributes describing them. Whereas the data in the Data Lake might or might not contain much information on the data.
The data in the Data lake might be generated from logs of a web application, JSON files, CSV's, images, videos, PDF's and any other files, sensor data, in other words this also consist of data which is schema less. Which is analogues to NoSQL database compared to SQL. Also data in EDW is read-only while in Data Lake it supports both read and write.
There is an overhead when we talk of implementing the data warehouses just for the reason that there is lot of things to do like cleaning up of data, assigning schema and bring data into a proper shape for the Business Intelligence team to consume this takes a significant amount of time in development and there is time associated with it. But Data Lake Analytics doesn't pose these problems, since we have the raw data available we can experiment and leave the result if we don't want to else we can persist and provide the information.
Data Lake Analytics will be much faster due to the power of parallel computing and with more advantages, also the data will be readily available. This will come at a cost that use who accesses the data should be responsible for cleaning up the data and see whether it is a good fit for the business's users.
The answer can be either Yes or No, depending on your current situation. If you have a well-established system using Data Warehouse and it works for you, then Data Lake Analytics may not be necessary for you. However if you are fairly new, and are starting up in analyzing a large amount of data, then you can start considering the Data Lake Analytics for your use case. If you have a Data Warehouse, but are struggling with the above-mentioned issues, then you can start implementing the Data Lake Analytics in parallel. When you are fully comfortable, you can move everything into Data Lake Analytics.
The Alibaba Cloud team has only recently announced the Data Lake Analytics product, currently in Version 1.0.0. While it may not be exactly serving to your business needs at this point, the product packs so many features that may help your business.
Alibaba Cloud Data Lake Analytics (DLA) is an interactive analytics service that utilizes serverless architecture. As a ready-to-use service, DLA does not require any prior setup of infrastructure or upfront management costs. You do not need to maintain instances in DLA, and service is billed based on actual use and needs. DLA uses SQL interfaces to interact with user service clients, which means it complies with standard SQL syntax and provides a variety of similar functions. DLA allows you to retrieve and analyze data from multiple data sources or locations such as OSS and Table Store for optimal data processing, analytics, and visualization to give better insights and ultimately guide better decision making.
To learn more about Alibaba Cloud Data Lake Analytics, visit www.alibabacloud.com/products/data-lake-analytics
Alibaba Clouder - August 7, 2020
Alibaba Clouder - November 12, 2018
ApsaraDB - November 17, 2020
Alibaba Clouder - August 8, 2018
Alibaba EMR - July 9, 2021
Alibaba Clouder - January 20, 2021
More Posts by Alibaba Clouder