×
Community Blog Building Your Data Lake on Alibaba Cloud

Building Your Data Lake on Alibaba Cloud

This article explains data lakes and how to build data lakes on Alibaba Cloud.

By Pablo Puig and Eneko Perez

What Is a Data Lake?

A data lake is a centralized data repository that stores all forms of structured, unstructured, or semi-structured data in large quantities from various sources in a native, raw format. Companies storing big data can make better decisions, offer a seamless customer experience, and cater to new growth opportunities. This repository data has a flat architecture and stores data via the extract, transform, and load methods.

The biggest advantage of using a data lake is it helps store access information in less time. Furthermore, this data repository aid users with collaboration and data analysis, resulting in better decision-making and faster performance.

It can be challenging for a data lake to perform operations without governance. The tools need to have a defined mechanism, access controls, and semantic consistency to catalog and secure data.

Benefits of a Data Lake

Data lakes provide solutions to businesses across every industry. The most significant benefits of data lakes are listed below:

  • Data Storage and Security: Data lakes allow you to store high amounts of data in a highly durable, available, and secure way.
  • Flexibility: Data lakes offer you the capability of storing, ingesting, transforming, and analyzing data from diverse sources and multiple schemas.
  • Performance: You can ingest a high volume of new data and consume a high volume of stored data. Performance is critical for data lakes.
  • Advanced Analytics: Data lakes allow you to run analytics without moving your data to a separate analytics system since multiple analytic services are supported, including open-source tools such as Presto and Spark.

Building Your Data Lake on Alibaba Cloud

Alibaba Cloud offers a wide range of cloud computing services and data products that enable you to build a data lake, including Object Storage Service (OSS) for storage, Data Lake Analytics (DLA) for data and analytics processing, and Quick BI.

What Is Object Storage Service (OSS)?

Alibaba Cloud Object Storage Service (OSS) is a cloud-based object storage service. You can use OSS to store your data in a centralized manner from multiple sources and access it anytime. OSS offers 99.9999999999% (12 9s) data durability and 99.995% service availability, making it a reliable cloud storage service. OSS stores copies of your data in multiple zones within the same region. OSS also provides highly consistent increases in read and write operations. In addition, you can upload and download data simultaneously using the OSS Append Object functionality, enabling real-time reading while appending new data and enhancing workload analytics efficiency.

What Is Data Lake Analytics (DLA)?

Alibaba Cloud Data Lake Analytics (DLA) is a cloud-native big data solution that is highly flexible and easy to use. DLA comprises data processing capabilities and supports various file formats and Hadoop-oriented object storage. DLA is also very well integrated with most adopted databases, including relational, PolarDB, and NoSQL databases. In addition, it supports Spark and Presto engines.

What Is Quick BI?

Alibaba Cloud Quick BI is a business intelligence (BI) service that allows you to analyze high amounts of data in real-time. Thus, you can visualize, analyze, interact, and create reports with your data.

Creating a Data Lake on Alibaba Cloud

The following step-by-step guide explains how to create a basic data lake on Alibaba Cloud using the components described above:

  • Step 1: Create an OSS bucket for a non-archive type and create a directory structure on the new bucket like the following example:

1

We used a sample CSV file named supplier_with_header.csv in this example.

  • Step 2: Use Data Lake Analytics to discover metadata and create a database. You can create a metadata discovery job that will create a database schema with all the information from the CSV file using the OSS wizard.

You can find the settings we specified for our CSV file in the screenshot below. Pay special attention to the field delimiter and the automatic detection of table headers. In the OSS directory location, enter the file's path defined in step 1:

2
3

This will create a metadata information discovery task like the following example:

4

DLA will analyze the content and structure of the CSV file. If it is successful, it will create a database that we will use later. In our case, DLA has created the following database and structure directly from the CSV file:

5

If we use multiple accounts in our DLA instance, you need to grant permissions to this recently created database:

6

  • Step 3: Use Quick BI to visualize and analyze the data on the database. First, we need to configure the data source, which will be the recently created database on DLA. Create a new MySQL data source and use the connection details provided on the DLA dashboard (SQL access point under Serverless Presto):

7

From here, a new dataset needs to be created as well. Once the dataset has been created, we need to inspect its data and decide which columns will be Dimensions and which ones will be Measures. In our case, we will use "s_acctbal" (Account balance) as a measure and will leave the rest as dimensions:

8

Once we have defined a field as Measure, double check its format to make sure it matches the contents (string, number, etc.).

  • Step 4: We are ready to create our first Dashboard. Go to Dashboards and select the dataset we created from the top left selection field. Select "s_name" and "s_acctbal" and click Update. Quick BI will create a Line Chart for us by default, but you can also choose from a wide range of options to visualize the data.

9

Conclusion

Data lakes help manipulate and process data through multiple transformations and data operations. Thus, this system of repository data aid businesses in generating better insights. Data lakes are highly flexible and carry multiple methods to query the data, gain insights, eliminate data silos, and perform faster. Data lakes are a great place to look for all data sources and formats and provide organizational reporting and advanced analysis.

There are various benefits to building a data lake on Alibaba Cloud. There is no need to set up infrastructure facilities, it complies with standard SQLs, it can perform analysis across multiple data sources, and it provides high scalability and performance.

0 1 0
Share on

Alibaba Cloud Community

875 posts | 198 followers

You may also like

Comments

Alibaba Cloud Community

875 posts | 198 followers

Related Products