All Products
Search
Document Center

Object Storage Service:What is OSS-HDFS?

Last Updated:Jan 16, 2024

OSS-HDFS (JindoFS) is a cloud-native data lake storage feature. OSS-HDFS provides centralized metadata management capabilities and is fully compatible with the Hadoop Distributed File System (HDFS) API. You can use OSS-HDFS to manage data in data lake-based computing scenarios in the big data and AI fields.

Usage notes

Warning

After OSS-HDFS is enabled for a bucket, data that is written by using OSS-HDFS is stored in the .dlsdata/ directory of OSS-HDFS. To ensure the availability of the OSS-HDFS service or prevent data loss, do not perform write operations on the .dlsdata/ directory or on objects in the directory by using methods that are not supported by the OSS-HDFS service. For example, do not perform the following write operations: rename the directory, delete the directory, and delete objects in the directory.

After you enable OSS-HDFS, risks such as data loss, data contamination, and data inaccessibility may arise when you use other Object Storage Service (OSS) features to write data to the .dlsdata/ directory. For more information, see Usage notes.

Billing rules

  • Metadata management fees

    You are charged metadata management fees for objects when you use OSS-HDFS. However, you are not charged for this billable item.

  • Data storage fees

    When you use OSS-HDFS, data blocks are stored in Objects Storage Service (OSS). Therefore, the billing method of OSS is applicable to data blocks in OSS-HDFS. For more information, see Billing overview.

Benefits

You can use OSS-HDFS without the need to modify existing Hadoop and Spark applications. You can configure OSS-HDFS with ease to access and manage data in a similar manner in which you manage data in HDFS. You can also take advantage of OSS characteristics, such as unlimited storage space, elastic scalability, and high security, reliability, and availability.

Cloud-native data lakes are based on OSS-HDFS. You can use OSS-HDFS to analyze exabytes of data, manage hundreds of millions of objects, and obtain terabytes of throughput. OSS-HDFS provides the flat namespace feature and the hierarchical namespace feature to meet your requirements for big data storage. You can use the hierarchical namespace feature to manage objects in a hierarchical directory structure. OSS-HDFS automatically converts the storage structure between the flat namespace and the hierarchical namespace to help you manage object metadata in a centralized manner. Hadoop users can access objects in OSS-HDFS without the need to copy and convert the format of the objects. This improves job performance and reduces maintenance costs.

Features

Feature

Description

References

Snapshot (trial)

You can use snapshots created by using the Snapshot command to restore data that is accidentally deleted or to back up data to ensure service continuity when an error occurs. You can use the snapshot feature of OSS-HDFS in the same manner as the snapshot feature of HDFS. The snapshot feature of OSS-HDFS supports directory-level operations.

Snapshot

RootPolicy

You can use RootPolicy to configure a custom prefix for OSS-HDFS. This way, jobs can run on OSS-HDFS without modifying the original access prefix hdfs://.

Access OSS-HDFS by using RootPolicy

ProxyUser

The ProxyUser command is used to authorize a user to perform operations such as accessing sensitive data on behalf of other users.

ProxyUser

UserGroupsMapping

The UserGroupsMapping command is used to manage mappings between users and user groups.

UserGroupsMapping

Scenarios

OSS-HDFS is suitable for computing scenarios in the big data and AI fields. You can use OSS-HDFS in the following scenarios:

Offline data warehousing with Hive and Spark

OSS-HDFS supports operations on files and directories and allows you to manage permissions on files and directories. OSS-HDFS also supports atomic operations on directories and rename operations in milliseconds. OSS-HDFS supports features, such as time configurations by using setTimes, extended attributes (XAttrs), ACLs, and accelerated access to local cache. This makes OSS-HDFS suitable for offline data warehousing with Hive and Spark. When you use the extract, transform, and load (ETL) feature to process data, OSS-HDFS provides better performance than OSS Standard buckets.

OLAP

OSS-HDFS supports basic file-related operations, such as append, truncate, flush, and pwrite. OSS-HDFS supports POSIX by using JindoFuse. This way, when you use ClickHouse for online analytical processing (OLAP), you can replace on-premises disks to decouple storage from computing. The caching system of OSS-HDFS helps decrease the period of time that is required for operations and improve performance at low costs.

Decoupling of storage from computing for HBase

OSS-HDFS supports operations on files and directories and flush operations. You can use OSS-HDFS instead of HDFS to decouple storage from computing for HBase. Compared with the combination of HBase and OSS Standard buckets, the combination of HBase and OSS-HDFS provides a more streamlined service architecture because the latter uses HDFS to store Web Application Firewall (WAF) logs. For more information, see Use OSS-HDFS as the storage backend of HBase.

Real-time computing

OSS-HDFS supports flush operations and truncate operations. You can use OSS-HDFS instead of HDFS to store sinks and checkpoints in real-time computing scenarios of Flink.

Data migration

As a novel cloud-native data lake storage service, OSS-HDFS allows you to migrate data from HDFS in the data center to Alibaba Cloud, optimizes the experience of HDFS users, and provides storage services that are scalable and cost-effective. You can use Jindo DistCp to migrate data from HDFS to OSS-HDFS. During data migration, HDFS checksum can be used to verify the integrity of data.

Supported engines

Ecosystem

Engine/Platform

References

Open source ecosystem

Flink

Use Apache Flink to write data to OSS-HDFS

Flume

Use JindoSDK with Apache Flume to write data to OSS-HDFS

Hadoop

Use Hadoop to access OSS-HDFS by using JindoSDK

HBase

Use OSS-HDFS as the underlying storage of HBase

Hive

Use JindoSDK with Hive to process data stored in OSS-HDFS

Impala

Use JindoSDK with Impala to query data stored in OSS-HDFS

Presto

Use JindoSDK with Presto to query data stored in OSS-HDFS

Spark

Use JindoSDK with Spark to query data stored in OSS-HDFS

Alibaba Cloud ecosystem

EMR

Use OSS-HDFS in EMR Hive or Spark

Flink

Flume

Use Flume to synchronize data from an EMR Kafka cluster to a bucket with OSS-HDFS enabled

HBase

Use OSS-HDFS as the underlying storage of HBase on an EMR cluster

Hive

Use Hive on an EMR cluster to process data stored in OSS-HDFS

Impala

Use Impala on an EMR cluster to query data stored in OSS-HDFS

Presto

Use Presto on an EMR cluster to query data stored in OSS-HDFS

Spark

Use Spark on an EMR cluster to process data stored in OSS-HDFS

Sqoop

Use Apache Sqoop on an EMR cluster to implement read and write access to data stored in OSS-HDFS