OSS-HDFS (JindoFS) is a cloud-native data lake storage feature that is fully compatible with the Hadoop Distributed File System (HDFS) interface. It provides unified metadata management to support data lake computing scenarios for big data and AI.
Usage notes
After OSS-HDFS is enabled for a bucket, data that is written by using OSS-HDFS is stored in the .dlsdata/ directory. To ensure the availability of OSS-HDFS and prevent data loss, do not perform write operations on the .dlsdata/ directory or objects within by using methods that are not supported by OSS-HDFS. For example, do not perform the following write operations: rename the directory, delete the directory, and delete objects in the directory.
After you enable OSS-HDFS, performing write operations on the .dlsdata/ directory using other Object Storage Service (OSS) features can cause data loss, data contamination, or data access failures. For more information, see Usage notes.
Billing rules
Metadata management fees
You are charged metadata management fees for objects when you use OSS-HDFS. However, you are not charged for this billable item.
Data storage fees
When you use OSS-HDFS, data blocks are stored in Objects Storage Service (OSS). Therefore, the billing method of OSS applies to data blocks in OSS-HDFS. For more information, see Billing overview.
Benefits
You can use OSS-HDFS without modifying your existing Hadoop and Spark big data analytics applications. After a simple configuration process, you can manage and access data in the same way as you would with native HDFS. You can also benefit from OSS features, including unlimited capacity, elastic scaling, and high security, reliability, and availability.
As the foundation for cloud-native data lakes, OSS-HDFS can manage exabytes of data and hundreds of millions of files, and provide terabytes of throughput. It fully integrates with the big data storage ecosystem. In addition to the flat namespace of object storage, OSS-HDFS provides a hierarchical namespace service that lets you organize objects into a directory hierarchy. Its unified metadata management capability enables automatic internal conversion. Compared to the active-standby redundancy of NameNodes in traditional HDFS, OSS-HDFS uses a multi-node, active-active redundancy mechanism for metadata management, which provides superior data redundancy. For Hadoop users, this means you can access data as efficiently as you would access a local HDFS, without requiring data replication or conversion. This greatly improves overall job performance and reduces maintenance costs.
Features
Feature | Description | References |
RootPolicy | You can use RootPolicy to set a custom prefix for OSS-HDFS. This allows jobs to run directly on OSS-HDFS without modifying the original | |
ProxyUser | The ProxyUser command authorizes a user to perform file system operations on behalf of other users. For example, for some sensitive data, only specific authorized users are allowed to access and operate on the data on behalf of other users. | |
UserGroupsMapping | UserGroupsMapping is used to configure mappings between users and user groups. |
Scenarios
OSS-HDFS provides comprehensive support for big data and AI ecosystems. Its main application scenarios are as follows:
Offline data warehousing with Hive and Spark
OSS-HDFS natively supports file and directory semantics and operations. It lets you set file and directory permissions. It also supports directory atomicity, millisecond-level rename operations, the setTimes operation, extended attributes (XAttrs), access control lists (ACLs), and local read cache acceleration. These features make it suitable for open source Hive and Spark offline data warehouses. In extract, transform, and load (ETL) scenarios, OSS-HDFS offers significant performance advantages over Standard OSS buckets.
OLAP
OSS-HDFS provides basic file operations, such as append, truncate, flush, sync, and pwrite. It fully supports POSIX through JindoFuse. In online analytical processing (OLAP) scenarios such as ClickHouse, you can replace local disks with OSS-HDFS to implement a storage-compute separation solution. The caching system provides acceleration for improved cost-effectiveness.
Decoupling of storage from computing for HBase
OSS-HDFS natively supports file and directory semantics and operations, including the flush operation. You can use it to replace HDFS in a storage-compute separation solution for HBase. Compared to a solution that combines HBase with Standard OSS buckets, a solution that combines HBase with OSS-HDFS can store write-ahead logging (WAL) logs using the HDFS API. This greatly simplifies the overall solution architecture. For more information, see Use OSS-HDFS as the underlying storage for HBase.
Real-time computing
OSS-HDFS efficiently supports flush and truncate operations. It can seamlessly replace HDFS as a storage solution for sinks and checkpoints in Flink real-time computing scenarios.
Data migration
As a next-generation cloud-native data lake storage solution, OSS-HDFS supports the lift-and-shift migration of on-premises HDFS to the cloud. It optimizes the HDFS user experience and provides the cost benefits of elastic scaling and pay-as-you-go billing, which significantly reduces storage costs. The JindoDistCp tool supports the seamless migration of HDFS file data and metadata, such as file properties, to OSS-HDFS. It also provides a fast comparison feature based on HDFS checksums.
Supported engines
Ecosystem Type | Engine/Platform | References |
Open source ecosystem | Flink | Use open source Flink with JindoSDK to process data in OSS-HDFS |
Flume | ||
Hadoop | ||
HBase | ||
Hive | ||
Impala | ||
Presto | ||
Spark | ||
Alibaba Cloud ecosystem | EMR | |
Flink | ||
Flume | Use Flume to synchronize data from an EMR Kafka cluster to OSS-HDFS | |
HBase | Use OSS-HDFS as the underlying storage of HBase on an EMR cluster | |
Hive | ||
Impala | ||
Presto | ||
Spark | ||
Sqoop | Use Sqoop on an EMR cluster to read data from and write data to OSS-HDFS |