OSS-HDFS (JindoFS) is a cloud-native data lake storage service. OSS-HDFS provides unified metadata management capabilities and is fully compatible with the Hadoop Distributed File System (HDFS) API. OSS-HDFS also supports Portable Operating System Interface (POSIX). OSS-HDFS allows you to manage data in data lake-based computing scenarios in the big data and AI fields.

Usage notes

Warning After you enable the OSS-HDFS service for a bucket, the data that is written by using the service is stored in the .dlsdata/ directory in which OSS-HDFS data is stored. To ensure the availability of the OSS-HDFS service or prevent data loss, do not perform write operations on the .dlsdata/ directory or on objects in the directory by using methods that are not supported by the OSS-HDFS service. For example, do not perform the following write operations: rename the directory, delete the directory, and delete objects.

After you enable OSS-HDFS, risks such as data loss and data contamination, and risks due to which data cannot be accessed may occur when you use other features of Object Storage Service (OSS) to write data to the .dlsdata/ directory. For more information, see Usage notes.

Benefits

You can use OSS-HDFS without the need to modify the Hadoop and Spark applications. You can configure OSS-HDFS with ease to access and manage data in a similar manner in which you manage data in HDFS. In addition, you can take advantage of Object Storage Service (OSS) characteristics such as unlimited storage space, elastic scalability, and high security, reliability, and availability.

Cloud-native data lakes are based on OSS-HDFS. You can use OSS-HDFS to analyze exabytes of data, manage hundreds of millions of objects, and obtain terabytes of throughput. OSS-HDFS provides the flat namespace feature and the hierarchical namespace feature to meet your requirements for big data storage. The hierarchical namespace feature allows you to manage objects in a hierarchical directory structure. OSS-HDFS can automatically convert the storage structure between the flat namespace and the hierarchical namespace by managing object metadata in a unified manner. Hadoop users can access objects in OSS-HDFS without the need to copy and convert the format of the objects. This improves job performance and reduces maintenance costs.

Characteristics

OSS-HDFS provides the following characteristics:

  • Compatibility with HDFS

    OSS-HDFS is fully compatible with the HDFS API and supports directory-level operations. JindoSDK allows you to use Apache Hadoop-based computing and analysis applications, such as MapReduce, Hive, Spark, and Flink, to access HDFS. This way, you can access and manage your data in OSS-HDFS in the same manner in which you manage data in HDFS. For more information, see Get started with OSS-HDFS.

  • Support for POSIX

    OSS-HDFS supports POSIX by using JindoFuse. This allows you to mount objects in OSS-HDFS to the local file system. This way, you can manage objects in OSS-HDFS in the same manner in which you manage objects in the local file system. For more information, see Use JindoFuse to access the OSS-HDFS service.

  • High performance, high scalability, and low cost

    Self-managed Hadoop clusters are restricted by physical resources, and resources of self-managed Hadoop clusters are difficult to scale out. For example, a bottleneck issue occurs on NameNode of a Hadoop cluster that contains hundreds of nodes and approximately 400 million files. As the size of metadata increases, the queries per second (QPS) of the cluster decreases.

    OSS-HDFS is designed to support multiple tenants and store large volumes of data. You can scale out the cluster on which you want to perform metadata management. OSS-HDFS also provides high concurrency, high throughput, and low latency for the scale-out of the cluster. The performance of OSS-HDFS is stable and the service remains highly available even if the number of objects exceeds one billion. OSS-HDFS provides unified metadata management capabilities to handle ultra-large files and supports multiple hierarchy policies. This helps improve the efficiency of system resource usage, reduce costs, and adapt to rapid changes of business workloads.

  • Data durability and service availability

    OSS-HDFS stores data in OSS. OSS is the core infrastructure for data storage in Alibaba Cloud. OSS is a proven success in smoothly handling traffic spikes during the Double 11 shopping festival and provides high availability and high reliability. OSS provides the following features:

    • 99.995% or higher service availability.
    • 99.9999999999% (twelve 9's) or higher data durability.
    • Automatic scalability without service interruptions.
    • Automatic backup for redundancy.

Scenarios

OSS-HDFS is suitable for the big data and AI fields. You can use OSS-HDFS in the following scenarios:

  • Hive and Spark for offline data warehousing

    OSS-HDFS supports operations for files and directories. OSS-HDFS also supports atomic operations for directories and millisecond-granularity rename operations. This way, OSS-HDFS is suitable for Hive and Spark for offline data warehousing. When you use the extract, transform, and load (ETL) feature to process data, OSS-HDFS provides better performance than OSS Standard buckets.

  • OLAP

    OSS-HDFS supports basic file-related operations, such as append, truncate, flush, and pwrite. OSS-HDFS supports POSIX by using JindoFuse. This way, when you use ClickHouse for online analytical processing (OLAP), local disks can be replaced to decouple storage from computing. The caching system of OSS-HDFS helps decrease the time period that is required for operations and incurs low costs to improve performance.

  • AI training and inference

    OSS-HDFS supports basic file-related operations, such as append, truncate, flush, and pwrite. OSS-HDFS supports POSIX by using JindoFuse. Therefore, OSS-HDFS can be used together with the AI ecosystem and existing training and inference programs in Python.

  • Decoupling of storage from computing for HBase

    OSS-HDFS supports operations for files and directories and flush operations. You can use OSS-HDFS instead of HDFS to decouple storage from computing for HBase. Compared with the storage for HBase by using OSS Standard buckets, OSS-HDFS can store WAL logs for HBase without using HDFS. This streamlines the service architecture.

  • Real-time computing

    OSS-HDFS supports flush operations and truncate operations. You can use OSS-HDFS instead of HDFS to store sinks and checkpoints in real-time computing scenarios of Flink.

  • Data migration

    As a novel cloud-native data lake storage service, OSS-HDFS allows HDFS to migrate data to the cloud and optimizes the experience of HDFS users. This way, OSS-HDFS provides storage services that are scalable and cost-effective. You can use Jindo DistCp to migrate data from HDFS to OSS-HDFS. During data migration, HDFS checksum can be used to verify the integrity of data.

Features

The following table describes the features provided by the OSS-HDFS service for various scenarios.

Scenario Feature
Suitable for Hive and Spark for data warehousing Support operations for files and directories, and related operations
Grant permissions on files and directories
Support atomic operations for directories and millisecond-granularity rename operations.
Specify a point in time by using setTimes
Extended attributes (XAttrs)
ACL
Accelerate local read caching
Replace HDFS Snapshots
File-related operations such as flush, sync, truncate, and append
Checksum verification
Automatic clean-up of the HDFS recycle bin
Support for POSIX Random writes to files
File-related operations such as truncate, append, and flush