By Pablo Puig and Eneko Perez
A data lake is a centralized data repository that stores all forms of structured, unstructured, or semi-structured data in large quantities from various sources in a native, raw format. Companies storing big data can make better decisions, offer a seamless customer experience, and cater to new growth opportunities. This repository data has a flat architecture and stores data via the extract, transform, and load methods.
The biggest advantage of using a data lake is it helps store access information in less time. Furthermore, this data repository aid users with collaboration and data analysis, resulting in better decision-making and faster performance.
It can be challenging for a data lake to perform operations without governance. The tools need to have a defined mechanism, access controls, and semantic consistency to catalog and secure data.
Data lakes provide solutions to businesses across every industry. The most significant benefits of data lakes are listed below:
Alibaba Cloud offers a wide range of cloud computing services and data products that enable you to build a data lake, including Object Storage Service (OSS) for storage, Data Lake Analytics (DLA) for data and analytics processing, and Quick BI.
Alibaba Cloud Object Storage Service (OSS) is a cloud-based object storage service. You can use OSS to store your data in a centralized manner from multiple sources and access it anytime. OSS offers 99.9999999999% (12 9s) data durability and 99.995% service availability, making it a reliable cloud storage service. OSS stores copies of your data in multiple zones within the same region. OSS also provides highly consistent increases in read and write operations. In addition, you can upload and download data simultaneously using the OSS Append Object functionality, enabling real-time reading while appending new data and enhancing workload analytics efficiency.
Alibaba Cloud Data Lake Analytics (DLA) is a cloud-native big data solution that is highly flexible and easy to use. DLA comprises data processing capabilities and supports various file formats and Hadoop-oriented object storage. DLA is also very well integrated with most adopted databases, including relational, PolarDB, and NoSQL databases. In addition, it supports Spark and Presto engines.
Alibaba Cloud Quick BI is a business intelligence (BI) service that allows you to analyze high amounts of data in real-time. Thus, you can visualize, analyze, interact, and create reports with your data.
The following step-by-step guide explains how to create a basic data lake on Alibaba Cloud using the components described above:
We used a sample CSV file named supplier_with_header.csv in this example.
You can find the settings we specified for our CSV file in the screenshot below. Pay special attention to the field delimiter and the automatic detection of table headers. In the OSS directory location, enter the file's path defined in step 1:
This will create a metadata information discovery task like the following example:
DLA will analyze the content and structure of the CSV file. If it is successful, it will create a database that we will use later. In our case, DLA has created the following database and structure directly from the CSV file:
If we use multiple accounts in our DLA instance, you need to grant permissions to this recently created database:
From here, a new dataset needs to be created as well. Once the dataset has been created, we need to inspect its data and decide which columns will be Dimensions and which ones will be Measures. In our case, we will use "s_acctbal" (Account balance) as a measure and will leave the rest as dimensions:
Once we have defined a field as Measure, double check its format to make sure it matches the contents (string, number, etc.).
Data lakes help manipulate and process data through multiple transformations and data operations. Thus, this system of repository data aid businesses in generating better insights. Data lakes are highly flexible and carry multiple methods to query the data, gain insights, eliminate data silos, and perform faster. Data lakes are a great place to look for all data sources and formats and provide organizational reporting and advanced analysis.
There are various benefits to building a data lake on Alibaba Cloud. There is no need to set up infrastructure facilities, it complies with standard SQLs, it can perform analysis across multiple data sources, and it provides high scalability and performance.
ApsaraDB - November 17, 2020
Alibaba Clouder - November 23, 2020
Alibaba Clouder - November 23, 2020
Alibaba EMR - July 9, 2021
Alibaba Clouder - April 14, 2021
Apache Flink Community China - May 14, 2021
Build a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalabilityLearn More
An end-to-end solution to efficiently build a secure data lakeLearn More
Alibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.Learn More
Alibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.Learn More
More Posts by Alibaba Cloud Community