All Products
Search
Document Center

Object Storage Service:Overview

Last Updated:Jul 28, 2023

OSS-HDFS (JindoFS) is a cloud-native data lake storage service. OSS-HDFS provides centralized metadata management capabilities and is fully compatible with the Hadoop Distributed File System (HDFS) API. OSS-HDFS also supports Portable Operating System Interface (POSIX). You can use OSS-HDFS to manage data in data lake-based computing scenarios in the big data and AI fields.

Usage notes

Warning

After you enable the OSS-HDFS service for a bucket, the data that is written by using the service is stored in the .dlsdata/ directory in which OSS-HDFS data is stored. To ensure the availability of the OSS-HDFS service or prevent data loss, do not perform write operations on the .dlsdata/ directory or on objects in the directory by using methods that are not supported by the OSS-HDFS service. For example, do not perform the following write operations: rename the directory, delete the directory, and delete objects.

After you enable OSS-HDFS, risks such as data loss, data contamination, and data inaccessibility may arise when you use other Object Storage Service (OSS) features that involve writing data to the .dlsdata/ directory. For more information, see Usage notes.

Billing rules

  • Metadata management fees

    You are charged the metadata management fees for objects when you use OSS-HDFS. However, you are not charged for this billable item.

  • Data storage fees

    When you use OSS-HDFS, data blocks are stored in OSS. Therefore, the billing method of OSS is applicable to data blocks in OSS-HDFS. For more information, see Billing overview.

Benefits

You can use OSS-HDFS without the need to modify existing Hadoop and Spark applications. You can configure OSS-HDFS with ease to access and manage data in a similar manner in which you manage data in HDFS. You can also take advantage of OSS characteristics, such as unlimited storage space, elastic scalability, and high security, reliability, and availability.

Cloud-native data lakes are based on OSS-HDFS. You can use OSS-HDFS to analyze exabytes of data, manage hundreds of millions of objects, and obtain terabytes of throughput. OSS-HDFS provides the flat namespace feature and the hierarchical namespace feature to meet your requirements for big data storage. You can use the hierarchical namespace feature to manage objects in a hierarchical directory structure. OSS-HDFS automatically converts the storage structure between the flat namespace and the hierarchical namespace to help you manage object metadata in a centralized manner. Hadoop users can access objects in OSS-HDFS without the need to copy and convert the format of the objects. This improves job performance and reduces maintenance costs.

Scenarios

OSS-HDFS is suitable for computing scenarios in the big data and AI fields. You can use OSS-HDFS in the following scenarios:

Hive and Spark for offline data warehousing

OSS-HDFS supports operations for files and directories. OSS-HDFS also supports atomic operations for directories and millisecond-granularity rename operations. This makes OSS-HDFS suitable for Hive and Spark for offline data warehousing. When you use the extract, transform, and load (ETL) feature to process data, OSS-HDFS provides better performance than OSS Standard buckets.

OLAP

OSS-HDFS supports basic file-related operations, such as append, truncate, flush, and pwrite. OSS-HDFS supports POSIX by using JindoFuse. This way, when you use ClickHouse for online analytical processing (OLAP), you can replace on-premises disks to decouple storage from computing. The caching system of OSS-HDFS helps decrease the period of time that is required for operations and improves performance at low costs.

AI training and reasoning

OSS-HDFS supports basic file-related operations, such as append, truncate, flush, and pwrite. OSS-HDFS supports POSIX by using JindoFuse. Therefore, OSS-HDFS can be used together with the AI ecosystem and existing training and inference programs in Python.

Decoupling of storage from computing for HBase

OSS-HDFS supports operations for files and directories and flush operations. You can use OSS-HDFS instead of HDFS to decouple storage from computing for HBase. Compared with the combination of HBase and OSS Standard buckets, the combination of HBase and OSS-HDFS enables a more streamlined service architecture because the latter uses HDFS to store WAL logs. For more information, see Use OSS-HDFS as the storage backend of HBase.

Real-time analysis

OSS-HDFS supports flush operations and truncate operations. You can use OSS-HDFS instead of HDFS to store sinks and checkpoints in real-time computing scenarios of Flink.

Data migration

As a novel cloud-native data lake storage service, OSS-HDFS allows HDFS to migrate data to Alibaba Cloud and optimizes the experience of HDFS users. This way, OSS-HDFS provides storage services that are scalable and cost-effective. You can use Jindo DistCp to migrate data from HDFS to OSS-HDFS. During data migration, HDFS checksum can be used to verify the integrity of data.

Features in various scenarios

The following table describes the features of OSS-HDFS in differrent scenarios.

Scenario

Feature

Hive and Spark for data warehousing

Native support for files and directories and related operations

Permission management on files and directories

Atomic operations on directories and rename operations within milliseconds

Time configurations by using setTimes

Extended attributes (XAttrs)

ACL

On-premises read caching acceleration

Alternative to HDFS

Snapshot

Flush, sync, truncate, and append operations on files

Checksum verification

Automatic clean-up of the HDFS recycle bin

Support for POSIX

Random writes to files

File-related operations, such as truncate, append, and flush