A data lake is a centralized repository that can store semi-structured and unstructured data at any scale in its original form. By using various analytics engines, such as big data processing frameworks, real-time analytics tools, and machine learning platforms, you can easily extract hidden value from your data stored in a data lake.
Architecture diagram
This architecture diagram illustrates a robust, end-to-end data management and analytics platform that encompasses the entire data lifecycle, from ingestion to analytics. The platform has the following benefits:
It supports ingestion of data in various formats, including but not limited to Parquet, CSV, JSON, multimedia files, database data, and application data.
This platform supports data ingestion from a wide range of environments, including public clouds, Apsara Stack, hybrid cloud deployments, and edge devices, ensuring seamless integration across multi-cloud and distributed architectures.
In this architecture, data for big data and AI-driven applications is organized into bucket groups, which are managed using Quality of Service (QoS) settings within a resource pool. This enables dynamic bandwidth allocation for each bucket group, ensuring efficient data access and optimized data management.
This system provides a comprehensive set of tools for data access and processing, including software development kits (SDKs), a POSIX-compliant file system interface, and an HDFS-compatible layer, enabling multi-protocol data access and processing for diverse application scenarios.
The integration of data analytics and AI-powered capabilities enables end-to-end workflows for advanced analytics, machine learning (ML) model training, and real-time stream processing. By combining these tools with intuitive data visualizations, users can derive actionable insights, optimize decision-making, and unlock the full value of their data assets.
Advantages of building data lakes on OSS
Alibaba Cloud Object Storage Service (OSS) provides nearly unlimited, cost-effective, elastic storage, making it the best data storage service for building data lake solutions on Alibaba Cloud. OSS features powerful data management capabilities that can efficiently process and organize massive amounts of data. Comprehensive OSS client implementations, including SDKs and multi-protocol support, facilitate seamless integration with compute engines.
Building a data lake on OSS offers the following core advantages:
Cost-effective storage. OSS employs a pay-as-you-go billing model and offers storage tiering that automatically transitions data between storage classes (Standard, Infrequent Access, Archive, Cold Archive, and Deep Cold Archive) based on lifecycle rules, reducing data storage costs.
Elastic scaling. OSS enables EB-level data storage without the necessity for capacity provisioning, allowing you to effortlessly adapt to data growth.
Ecosystem integration. OSS supports seamless integration with Alibaba Cloud computing services, such as MaxCompute, E-MapReduce (EMR), and Platform for AI (PAI), as well as open-source analytics frameworks, including Hadoop, Spark, RAY, and PyTorch.
Security compliance: OSS offers a comprehensive array of data protection features, including server-side encryption and access control, to effectively address your data security needs.
High availability and disaster recovery: OSS provides cross-zone redundant storage and cross-region replication, guaranteeing enhanced data reliability.
Capabilities to consider when building data lakes
When organizations plan to build data lakes and analytics platforms, they need to consider many key capabilities, including:
Data collection and importing
Data lakes allow you to import any amount of real-time data. Data can be collected from multiple sources and stored in a data lake in its original format. This process allows you to scale to any amount of data while saving time in defining data structures, schemas, and transformations. OSS allows you to import data by using the following methods:
Upload data directly to OSS over the internal network
Upload data from your data centers to OSS by using Express Connect
Migrate PB-level data to OSS by using Data Online Migration or Data Transport
Upload data directly to OSS over the public network. However, because transferring data over the public network can expose you to potential security threats, we strongly advise you to prioritize domain name management and access control when using this method. Please review the following documents carefully:
Cost-effective, secure data storage
Data lakes allow you to store massive amounts of unstructured data from various sources such as mobile apps, Internet of Things (IoT) devices, social media, and Internet of vehicles (IoV). There must be automated cost optimization to reduce data storage costs, and robust security features to protect data assets. OSS provides the following capabilities that can securely store data at an optimal cost:
Five storage classes for storage of hot and cold data
Automatic storage tiering for cold data using lifecycle rules
Versioning that protects data from accidental deletions or overwrites
Efficient data management
As a use case for a data lake, different departments within an organization may store data in separate prefixes within a single bucket, while some departments may require their own dedicated buckets for data storage. This scenario requires that data in the same bucket can be managed separately, and data can flow between different buckets. OSS provides multiple capabilities to address such data management and movement requirements:
Access points that can be used to configure separate access permissions for individual departments
Bucket inventory that provides an overview of bucket storage usage by different departments
Data replication that allows data replication for synchronization between buckets in the same region or across regions
Cross-application performance management and optimization
In data lake operations, the parallel execution of data ingestion, preprocessing, AI model training, and debugging tasks can lead to inefficient resource allocation and contention between storage buckets and Resource Access Management (RAM) users. These issues can be addressed by using the resource pool QoS feature of OSS, which dynamically enforces throttling policies for specific buckets and RAM users. This ensures that high-priority services and compute-intensive workloads receive guaranteed resource access during peak workloads, maintaining operational stability without compromising system performance or introducing latency for critical business processes.
Data lakes are designed to support low-latency, high-query-per-second (QPS) performance for retrieval engines, seamless application data access, and efficient data retrieval for AI inference models. To meet these demands, OSS introduces OSS accelerators, which reduce data read latency and increase query throughput by caching frequently accessed (hot) data on NVMe SSDs, significantly optimizing the performance of real-time computing workloads.
Data analytics and AI framework integration
Various analytic tools and computing frameworks run on top of data lakes to access and process stored data. In end-to-end data workflows, organizations often employ multiple computing frameworks, each with unique data access interfaces and protocols. To streamline integration with these heterogeneous ecosystems and reduce operational complexity and integration costs, OSS offers a comprehensive suite of clients, tools, and APIs:
OSS offers SDKs for major programming languages, enabling developers to access and process stored data programmatically. For users with programming expertise, leveraging OSS SDKs is recommended to achieve optimized performance and streamlined data access. For Python developers specifically, adopting multi-threading techniques can further enhance bandwidth utilization and throughput, ensuring efficient high-performance workflows. For detailed guidance on Python best practices, refer to Python multi-threading for increased bandwidth.
If you have experience using components of the Hadoop ecosystem on OSS, we strongly recommend that you use OSS Connector for Hadoop to read and write data in OSS. This approach enables you to efficiently utilize the unlimited storage capability and various features of OSS.
If your organization heavily depends on open-source Hadoop Distributed File System (HDFS) and cannot migrate your applications quickly, we recommend using OSS-HDFS, a service that provides HDFS-compatible interfaces while delivering superior performance, scalability, and cloud-native advantages over traditional HDFS. OSS-HDFS is seamlessly integrated with Alibaba Cloud E-MapReduce (EMR) and open-source ecosystems like Hadoop and Spark. This service features strong compatibility with HDFS, allowing you to smoothly migrate HDFS-based applications from your data centers to the cloud without requiring code modifications to existing applications. However, be aware that some advanced data management capabilities native to OSS may not be available due to differences in functional definitions between open-source HDFS and OSS. For a list of features supported by OSS-HDFS, see Supported features. To fully leverage the high performance and data management capabilities of OSS in cloud-native scenarios, we recommend gradually adapting to and optimizing your applications by using OSS Connector after your migration to the cloud.
If your application includes workloads that require POSIX-compliant file system access, the ossfs tool provides a POSIX-compatible interface to mount OSS buckets as local file systems:
For modern applications like AI training, AI inference, and autonomous driving simulation, which typically operate with less strict POSIX semantics requirements, ossfs 2.0 is recommended to achieve optimal performance. If the access patterns of your application are unclear, we recommend testing with OSSFS 2.0 first to evaluate performance gains. In cases where compatibility issues arise, you can revert to ossfs 1.0.
For traditional applications, ossfs 1.0 enables read and write operations on data stored in OSS. However, due to the inherent differences between OSS and NAS systems, and the need for certain applications to meet stricter POSIX compliance and performance requirements, we advise against using ossfs 1.0 as a direct replacement for NAS. In such cases, to ensure optimal compatibility and performance, we recommend using File Storage NAS.
If you’re experienced with PyTorch for dataset loading in AI training but less familiar with OSS SDKs, we recommend OSS Connector for AI/ML, which offers optimal performance for reading datasets from OSS without requiring SDK expertise.
OSS provides the following tools for administrators and developers to upload and download data: