OSS-HDFS (JindoFS) is a cloud-native data lake storage feature with unified metadata management and full Hadoop Distributed File System (HDFS) API compatibility for big data and AI workloads.
Usage notes
-
After you enable the OSS-HDFS service for a bucket, the service's data is stored in the bucket's
.dlsdata/directory. Do not perform write operations, such as renaming or deleting, on this directory and its objects using non-OSS-HDFS methods. This can cause service disruptions or data loss. -
If your account has an overdue payment or if the dependent RAM role
AliyunOSSDlsDefaultRoleis deleted, the HDFS background service may enter safe mode. In safe mode, all background tasks, such as audit logging, asynchronous deletion, and automatic storage tiering, are paused. The service automatically resumes after the issue is resolved.
After you enable OSS-HDFS, writing to the .dlsdata/ directory through other OSS features can cause data loss, corruption, or data inaccessibility, as described in Prerequisites.
Billing
-
Data usage fees
OSS-HDFS stores data blocks in OSS. Therefore, standard OSS billing applies to data blocks in OSS-HDFS. For more information, see Billing overview.
Benefits
OSS-HDFS works with existing Hadoop and Spark applications without modification. After basic configuration, you manage data as with native HDFS, with the added benefits of OSS: virtually unlimited capacity, elastic scalability, and enhanced security, reliability, and availability.
OSS-HDFS handles exabytes of data and billions of files at terabyte-level throughput. Beyond the flat namespace of standard object storage, it provides a hierarchical namespace that organizes objects into directories, with automatic namespace conversion through unified metadata management. Instead of active-standby NameNode redundancy in traditional HDFS, OSS-HDFS uses multi-node active-active redundancy for superior resiliency. Hadoop users access data as efficiently as on local HDFS without replication or conversion, improving job performance and reducing maintenance costs.
Features
|
Feature |
Description |
References |
|
RootPolicy |
Configure a custom prefix for OSS-HDFS so jobs run without changing their original |
|
|
ProxyUser |
Authorize a user to perform file system operations on behalf of other users, such as accessing sensitive data. |
|
|
UserGroupsMapping |
Configure mappings between users and user groups. |
Use cases
OSS-HDFS supports big data and AI use cases:
Hive and Spark
OSS-HDFS suits offline data warehouses built with Hive and Spark. It natively supports file and directory semantics, permissions, atomic directory operations, millisecond-level renames, setTimes, extended attributes (XAttrs), ACLs, and local read cache acceleration. In ETL workloads, OSS-HDFS significantly outperforms standard OSS buckets.
OLAP
OSS-HDFS supports append, truncate, flush, sync, and pwrite with full POSIX support through JindoFuse. This lets you replace local disks in OLAP scenarios (such as ClickHouse) to decouple storage and compute. Built-in caching accelerates performance.
HBase decoupling
OSS-HDFS natively supports file and directory semantics and flush operations, enabling it to replace HDFS in a decoupled storage-compute architecture for HBase. Unlike standard OSS, this stores the Write-Ahead Log (WAL) directly in OSS-HDFS, simplifying the architecture. Use OSS-HDFS as the underlying storage for HBase.
Real-time computing
OSS-HDFS supports flush and truncate and seamlessly replaces HDFS for sinks and checkpoints in Flink real-time computing.
Data migration
OSS-HDFS enables smooth migration of HDFS data from on-premises to cloud, reducing storage costs through elastic scaling and pay-as-you-go pricing. JindoDistCp migrates HDFS data (including file attributes and metadata) to OSS-HDFS and provides fast data comparison using HDFS checksums.
Supported engines
|
Ecosystem |
Engine/Platform |
References |
|
open source ecosystem |
Flink |
Use open source Flink with JindoSDK to process data in OSS-HDFS |
|
Flume |
||
|
Hadoop |
||
|
HBase |
||
|
Hive |
||
|
Impala |
||
|
Presto |
||
|
Spark |
||
|
Alibaba Cloud ecosystem |
EMR |
|
|
Flink |
||
|
Flume |
Use Flume to synchronize data from an EMR Kafka cluster to OSS-HDFS |
|
|
HBase |
Use OSS-HDFS as the underlying storage for HBase on an EMR cluster |
|
|
Hive |
||
|
Impala |
||
|
Presto |
||
|
Spark |
||
|
Sqoop |
Use Sqoop on an EMR cluster to read and write data in OSS-HDFS |