Community Blog EB-level Data Lake Based on OSS

EB-level Data Lake Based on OSS

This article briefly discusses data lake systems, their features, and describes the process of building a data lake storage based on Alibaba Cloud OSS.


Digital transformation has become a hot spot in the IT industry owing to explosive data growth. Accordingly, an in-depth value analysis of data has become the need of the hour. Therefore, it is necessary to protect the original information retained in data to meet the ever-changing future needs. Database middleware products such as Oracle cannot adapt to this trend, so new computing engines are constantly emerging to cope up with the data age. Recently, many companies have been discussing the concept of a data lake. Companies expect a system that can retain the original data information while connecting to a variety of computing platforms. The idea is to stay ahead of the competition in the data age with such advanced systems.


What is a Data Lake?

A data lake provides centralized storage for various types of data, including structured, semi-structured, and unstructured data. It requires no predefined schema. Instead, it can store data in the original format while covering various types of data input sources. A data lake seamlessly connects to various computing and analysis platforms and provides good support for the Hadoop ecosystem. You can directly use the data available in a data lake for data analysis, processing, and querying. Thus, you can explore the value of data through in-depth data mining and analysis.


Data Lake: Key Features and Values

  • Massive data storage: A data lake stores massive amounts of data. It is independent of the computing framework and allows direct data access without needing additional mount operations. Also, it is flexible and elastic enough to cope with explosive data growth. Moreover, a data lake supports redundancy in multiple layers for high data reliability and availability.
  • Efficient data computing: A data lake provides various data storage types and sharing capabilities, supporting storage of structured, semi-structured, and unstructured data. Besides, it can adapt to different computing platforms to avoid problems such as data island and invalid data copies.
  • Security data management: A data lake supports data catalog. It can intelligently manage a large number of data assets and ensure data security through fine-grained access control.

OSS-based Data Lake Storage

OSS Introduction

Object Storage Service (OSS) is a secure, cost-effective, and highly reliable cloud storage service provided by Alibaba Cloud. It enables users to store a large amount of data in the cloud. OSS supports durability of at least 99.9999999999% and availability (or business continuity) of at least 99.995%. OSS provides RESTful APIs that are independent of platforms. You can store and access any type of data anytime, anywhere, and from any application.

Building a Data Lake Storage Based on OSS


As the storage component of a data lake, OSS can fully meet the key requirements of a data lake.

Massive Data Storage

1) OSS adopts a distributed system architecture and flat namespace design, which supports unrestricted storage. In addition, performance and capacity of OSS can increase linearly with system expansion.

2) OSS supports elastic scaling. You can expand its capacity automatically with no size limit on the storage space. You can also expand the storage space as needed and pay only for the actual usage without configuring it in advance.

3) OSS supports high data availability.

  • The mechanism of multiple Availability Zone (AZ) redundancy and cross-region replication are adopted in the same region. Thus, you can avoid data loss or access failure due to single-point failure.
  • You can avoid silent data corruption using periodic data verification.
  • Strong consistency is supported for object operations. The data written in an object is readable immediately after being successfully returned.
  • Multiple versions are supported to prevent accidental data deletion. The overall OSS solution meets a data durability of 99.9999999999% and a service availability of 99.995%.

Efficient Data Computing

  • OSS provides RESTFul APIs that you can access over the internet. You can store and access data anytime and anywhere without needing mapping and mounting operations in advance.
  • OSS is compatible with the open-source Hadoop ecosystem and works seamlessly on different computing platforms of Alibaba Cloud. This enables computing platforms to share data without replication. In addition, OSS optimizes specific operations for some computing platforms to improve data processing performance.
  • OSS supports operator unloading. It currently supports Select statements, allowing you to read only the required data from a single file, improving data retrieval efficiency.

Security Data Management

  • OSS supports data lifecycle management. You can configure lifecycle rules to automatically delete compliant data or transfer the data to cheaper storage types.
  • OSS supports data encryption at the client and the server-side. You can choose an encryption solution based on your situation to avoid data leaks.
  • OSS supports the Write Once Read Many (WORM) strategy that prevents data deletion or overwriting. The data in OSS is compatible with the regulations of the U.S. Securities and Exchange Commission (SEC) and Financial Industry Regulatory Authority, Inc. (FINRA). OSS has obtained the corresponding compliance certification.
  • OSS supports access security policies for multiple types of data. It also grants long-term or temporary access permissions on buckets, objects, and roles. This ensures secure data sharing with minimal permissions.


In conclusion, OSS is a suitable solution for enterprises to build large, efficient, and secure data lakes especially in scenarios that require analyzing massive amounts of data.

0 0 0
Share on

Alibaba EMR

58 posts | 5 followers

You may also like


Alibaba EMR

58 posts | 5 followers

Related Products