JindoData is a data lake storage acceleration suite developed by the Alibaba Cloud open source big data team. It provides comprehensive access acceleration solutions for major data lake storage systems from Alibaba Cloud and the industry for big data and AI ecosystems. This topic describes the features supported by each version of JindoData.
Background information
JindoData is an upgraded version of the original Alibaba Cloud EMR SmartData component. For more information, see JindoData (available only to existing users).
JindoData 4.6.x versions
Overview
JindoData 4.6.x versions introduce a smooth migration feature that supports migration from Hadoop Distributed File System (HDFS) to OSS-HDFS. This feature significantly simplifies the data migration process. The JindoFS storage system supports a file inventory feature, which helps you better understand data distribution and ownership. For performance optimization, the JindoFS storage system improves the performance of `du` and `count` operations through full and incremental optimization. JindoSDK 4.6.x versions support file-level and block-level verification to improve the stability of the write link. JindoSDK also supports a multi-path access protocol, which lets you use different protocol modes to access the same backend path.
JindoData 4.6.11
JindoData 4.6.11 fixes the following issues:
JindoSDK: Fixed an issue where JindoCommitter used an old mapred API to write data in an Alibaba Cloud EMR Hadoop 2.8.5 environment.
JindoTable: Optimized the feature for restoring tables or partitions in Object Storage Service (OSS). You can now set the number of days for the restoration. For more information, see Use JindoTable to archive and restore tables or partitions in OSS.
JindoData 4.6.10
JindoData 4.6.10 fixes the following issues:
JindoFS: Optimized the `pread` prefetch logic.
JindoSDK: Added support for concurrent commit tasks to optimize job commit performance.
JindoSDK: Optimized the path rewrite logic.
JindoFuse: Fixed an issue that occurred when appending objects.
JindoData 4.6.8
JindoData 4.6.8 fixes the following issues:
JindoFS: Added support for clients to set the retention period for the recycle bin.
JindoSDK: Added support for using `MALLOC_CONF` to optimize memory usage.
JindoFuse: Added support for graceful shutdown when mounting OSS-HDFS.
JindoFSx: Added support for using wildcard characters to filter the file list for cache prefetching.
JindoFSx: Fixed an issue where clearing the cache did not take effect.
JindoData 4.6.7
JindoData 4.6.7 fixes the following issues:
JindoFuse: Added support for a graceful shutdown mechanism.
JindoFuse: Optimized log output.
JindoFuse: Fixed an issue where `O_APPEND` or `O_TRUNC` was not supported when mounting OSS.
JindoData 4.6.6
Optimized the degree of parallelism for `distjob` and `distcp` tasks. The maximum degree of parallelism is now limited to the number of tasks.
JindoData 4.6.5
JindoData 4.6.5 includes many fixes and optimizations based on version 4.6.4. The updates include the following:
Added a ServiceLoader for the OSS scheme that points to `JindoOssFileSystem`.
Optimized the exception handling for the `isDirectory()` method. When called with a path that contains a wildcard, such as
Path *, the method now returns `false` instead of throwing an `IllegalPath` exception.Optimized the Hadoop software development kit (SDK) to prevent the `ConcurrentModificationException` that could occur in some scenarios when Hadoop configurations were modified concurrently.
Optimized the retry logic for the JindoMagicCommitter client when writing to OSS to handle cases where temporary directories are abnormal or disks are damaged. This optimization improves the success rate of job writes and prevents the `InvalidPart` exception:
One or more of the specified parts could not be found or the specified entity tag might not have matched the part's entity tag..
JindoData 4.6.4
JindoData 4.6.4 adds multi-platform support.
For more information about the supported platforms, see Download JindoData.
For the Java platform, you can deploy multiple `jindo-core` packages to implement multi-platform support. By default, `jindo-core` supports mainstream Linux systems. To use it on other platforms, you must add the corresponding platform extension package.
The dependency packages for multi-platform support have been uploaded to the JindoData Maven repository. For example, to access OSS when you build a project using Maven, see the dependency configurations in jindosdk_ide_hadoop.md.
For example, to deploy a Hadoop cluster on a mainstream Linux system, add `jindo-core-4.6.4.jar` and `jindo-sdk-4.6.4.jar` to the specified classpath. To run and debug on macOS, you need `jindo-core-4.6.4.jar` and `jindo-sdk-4.6.4.jar` in addition to the `jindo-core-macos-10_14-x86_64-4.6.4.jar` extension package.
Go to the Download JindoData page to download `jindosdk-4.6.10-macos-10_14-x86_64.tar.gz`. This package contains the `jindo-core-4.6.4.jar`, `jindo-sdk-4.6.4.jar`, and `jindo-core-macos-10_14-x86_64-4.6.4.jar` extension package required for this example.
JindoData 4.6.2
JindoData 4.6.2 includes many fixes based on version 4.6.1. The fixes for the JindoFS storage system are as follows:
JindoFS storage system
Fixed an issue where the service became stuck when converting from Standard (STD) to STD in tiered storage.
Fixed an issue where the service became stuck due to an empty manifest file generated by tiered storage.
Accelerated the execution of tiered storage tasks.
Fixed the logic of the RootPolicy feature.
Fixed an issue where the `setAcl` operation occasionally caused the service to crash.
Fixed a low-probability issue where DB manifest files filled up the disk.
Fixed the batch metadata import feature of the migration service.
JindoData 4.6.1
JindoFS storage system
JindoFS: Reduced redundant log output.
JindoFS: Fixed an issue where the file size was incorrect when a metadata inventory was exported for a file that was not closed.
JindoFSx storage acceleration system
JindoFSx: Added support for automatic cleanup of temporary cache directories.
JindoSDK and tools support
JindoSDK: Reduced oversized log output.
JindoSDK: Enabled server-side path optimization for `du` and `count` operations by default.
JindoSDK: Reduced the Security Token Service (STS) token update frequency to prevent throttling caused by frequent requests.
JindoSDK: Changed the Resource Access Management (RAM) role name in the credential-free URL to lowercase to prevent token refresh failures within the ECS credential-free service.
JindoData 4.6.0
JindoFS storage system
JindoFS: Supports exporting file inventories from OSS-HDFS. This feature helps you better understand data distribution and perform custom development.
JindoFS: Significantly improves the performance of `du` and `count` operations through full and incremental server-side optimizations.
JindoFS: Supports smooth migration from HDFS to OSS-HDFS, which significantly simplifies the data migration process.
JindoFS: Supports multi-path protocol access. You can use different protocols to access the same backend path.
JindoFSx storage acceleration system
JindoFSx: Fixed an issue where the client exited unexpectedly when writing to the cache.
JindoFSx: Fixed an issue where the client exited unexpectedly during metrics reporting.
JindoFSx: Fixed a memory leak issue when using Ranger.
JindoSDK and tools support
JindoSDK: Supports CRC and MD5 checksum verification for writes at the file and block levels.
JindoSDK: Supports the Jindo Sync tool for data synchronization without requiring a Hadoop environment.
JindoSDK: Supports the OSS-HDFS TensorFlow Connector.
JindoData 4.5.x versions
JindoData 4.5.1
Overview
Version 4.5.1 is a minor upgrade to version 4.5.0 that includes important fixes and improvements. JindoFS improves service stability and exception handling. JindoFS and JindoFSx further improve the adaptive prefetch algorithm to increase prefetch efficiency. JindoDistCp includes many fixes and optimizations to enhance the stability of the data copy process. JindoFuse uses a new underlying design to significantly improve performance.
Major features
JindoFS storage system
JindoFS: Improved memory usage.
JindoFS: Added exception handling and log-based alerting for
ASSUME_ROLEerrors.JindoFS: Supports updating dynamic AccessKeys during retries.
JindoFS: Further improved the adaptive prefetch algorithm to increase prefetch efficiency.
JindoFS: Fixed read and write paths for random file write scenarios.
JindoFS: Supports the `CheckAccess` API.
JindoFSx storage acceleration system
JindoFSx: Further improved the adaptive prefetch algorithm to increase prefetch efficiency.
JindoFSx: Supports spaces in paths.
JindoFSx: Reduced the occurrence of hot spots during multi-replica reads.
JindoSDK and tools support
Jindo commands now provide full coverage of Hadoop commands.
Jindo commands now include native support for HDFS, which significantly improves performance and user experience.
JindoDistCp supports integration with Alibaba Cloud CloudMonitor.
JindoDistCp supports checksum verification for data migrated from OSS to an HDFS path.
JindoDistCp supports job splitting parameters.
JindoDistCp fixed the error handling logic for source file deletion during the copy process.
JindoSDK optimizes memory usage for random reads.
JindoFuse POSIX support
JindoFuse was redesigned using low-level APIs to significantly improve the performance of operations such as
readdir.JindoFuse fixed an issue where an abnormal program listed the root directory after JindoFSx was mounted.
JindoData 4.5.0
Overview
This version focuses on optimizing the metadata operation performance of the JindoFS storage system, resulting in significant performance improvements. The JindoFS tiered storage feature is enhanced to support Infrequent Access (IA) and Cold Archive storage. Support for batch writes is added to optimize the performance of large-scale extract, transform, and load (ETL) jobs. For SDKs and ecosystem components, a Java SDK that is independent of Hadoop is provided.
Major features
JindoFS storage system
JindoFS: Optimized metadata operations, which significantly improves performance.
JindoFS: Enhanced the tiered storage feature to support IA and Cold Archive storage types.
JindoFS: Added a batch write feature to optimize the performance of large-scale ETL jobs.
JindoFS: Fixed an issue where accessing OSS caused a service exception due to a server-side authorization error.
JindoFSx storage acceleration system
JindoFSx: Fixed a file handle leak issue in the Storage service.
JindoFSx: Fixed a thread safety issue in client-side metrics reporting.
JindoFSx: Optimized the performance of recursively creating parent directories.
JindoFSx: Optimized the performance of the path rewrite feature.
JindoSDK and tools support
JindoSDK: Supports an adaptive prefetch algorithm to increase prefetch efficiency.
JindoSDK: Supports atomic rename operations based on Tablestore.
JindoDistCp: Optimized the `diff` feature to support outputting diff files.
JindoSDK: Implemented unified handling for retry errors, which resolves client retry failures caused by server IP address changes.
JindoSDK: Provides a Java SDK that is independent of Hadoop, offering functionality comparable to the Hadoop SDK and Object SDK.
JindoFuse POSIX support
JindoFuse: Fixed a memory leak issue caused by list operations when caching is enabled in JindoFSx.
JindoData 4.4.x versions
Overview
The JindoFS storage system now includes tiered storage and data archiving features. It uses the tiered storage capabilities of Alibaba Cloud OSS and is compatible with HDFS tiered storage policies. This feature lets you select lower-cost storage policies for infrequently accessed data to reduce total storage costs. In addition, JindoFS adds support for the HDFS AuditLog feature, which significantly improves API compatibility, feature parity, and data migration capabilities with Apache HDFS. It also improves rapid data import for OSS and migration from semi-managed JindoFS instances. The JindoFS features are provided through the Alibaba Cloud OSS-HDFS service. For more information, see What is the OSS-HDFS service?.
On the JindoFSx storage acceleration system, JindoData 4.4.x versions support client-side local cache (LocalCache), which provides client-side-only cache acceleration. This significantly improves metadata caching capabilities and enhances cache acceleration for Alibaba Cloud NAS.
For SDKs and ecosystem components, the performance and throughput of multiple operations are significantly improved. The Object SDK is now supported. It is compatible with OSS object storage APIs while significantly improving the performance of various operations and seamlessly integrating with the JindoFSx acceleration capability. The JindoDistJob tool is introduced to support full and incremental migration of file metadata from semi-managed JindoFS. This lets you smoothly switch to the JindoFS service-based solution without migrating data blocks. The JindoDistCp migration tool is greatly enhanced to achieve lossless migration from Apache HDFS to the JindoFS service, ensuring that file metadata is also copied.
Major features
JindoFS storage system
JindoFS supports tiered storage and data archiving, and is compatible with HDFS storage policies.
JindoFS supports `BatchImport` for importing file data in batches.
JindoFS supports HDFS AuditLog.
JindoFS supports `Concat` and `SymLink` APIs.
JindoFS optimizes the background cleanup process for file data.
JindoFS optimizes the performance of `Lease` and `Lock` related operations.
JindoFSx storage acceleration system
JindoFSx supports cache plugins and provides a client-side cache mode.
JindoFSx supports plug-in-based authorization. By default, you do not need to install KRB5 and SASL library dependencies.
JindoFSx significantly optimizes metadata cache performance and improves cache acceleration support for Alibaba Cloud NAS.
JindoSDK and tools support
JindoSDK improves support for HTTPS and enhances fault tolerance in weak network environments.
JindoSDK improves deployment by removing the dependency on KRB5 and SASL libraries by default.
JindoSDK adds support for OSS object storage APIs, which significantly improves operation performance and seamlessly integrates with the JindoFSx cache acceleration capability.
The JindoDistJob tool is added to support rapid migration of data from semi-managed JindoFS in Block mode to the JindoFS service.
JindoDistCp significantly improves the data migration capability from Apache HDFS to the JindoFS service and supports lossless migration of file metadata.
JindoFuse POSIX support
JindoFuse optimizes the performance of sequential reads of large files.
JindoData 4.3.x versions
Overview
JindoData 4.3.0 fully supports a multicloud architecture. It is a data lake storage solution that supports multicloud, multiple storage systems, various acceleration extensions, multiple protocols, and multiple programming languages. The POSIX support in the JindoFS storage system has been significantly improved. The JindoFSx system supports the Kerberos+Ranger security extension for the first time. JindoSDK and ecosystem tools have also been significantly improved in terms of test coverage.
Major features
JindoSDK and tools support
JindoSDK supports multicloud storage, such as Amazon S3, COS, and OBS.
JindoSDK provides the JindoTable tool.
JindoSDK optimizes the Flink Connector plugin.
JindoSDK improves JindoDistCp.
JindoFSx storage acceleration system
JindoFSx supports multicloud storage, such as Amazon S3, COS, and OBS.
JindoFSx optimizes data caching and metadata caching.
JindoFSx supports the Kerberos+Ranger authorization solution.
JindoFSx significantly improves observability metrics.
JindoFSx is integrated with Fluid.
JindoFS storage system
JindoFS supports POSIX Lock and Fallocate capabilities.
JindoFS supports upgrades for clusters of older JindoFS versions in Block mode.
JindoFuse POSIX support
JindoFuse adds support for XAttr-related APIs, such as Setxattr, Getxattr, Listxattr, and Removexattr.
JindoFuse supports POSIX Lock and Fallocate capabilities.
JindoFuse supports appendable objects in OSS, such as Append, Flush, and read-while-writing features.
JindoData 4.2.x versions
Overview
JindoData 4.2.0 significantly improves the JindoFSx storage acceleration system. It adds cache acceleration for Apache HDFS and Alibaba Cloud NAS storage products, and enhances tools such as JindoFuse, JindoDistCp, and JindoTable.
Major features
JindoFSx storage acceleration system
Supports transparent cache acceleration for Alibaba Cloud Apache HDFS (keeps
hdfs://unchanged) and unified mount acceleration (fsx://).Supports unified mount acceleration (
fsx://) for Alibaba Cloud NAS storage products.Fully integrates with and supports the Alibaba Cloud OSS-HDFS service (JindoFS service) and improves write path support.
JindoSDK and tools support
Introduces the first C/C++ version of JindoSDK, which provides POSIX-like API methods.
Supports JindoFuse POSIX. The JindoFuse tool is improved and built based on the C/C++ version of JindoSDK.
Supports JindoDistCp data migration. The JindoDistCp tool is refactored and improved by simplifying and removing less-used features from the 3.x versions to enhance usability and robustness.
Supports the JindoTable tool. The JindoTable tool is refactored and improved by simplifying and removing less-used features from the 3.x versions to enhance usability and robustness.
JindoData 4.1.x versions
Overview
JindoData 4.1.0 introduces important features such as random writes on the Alibaba Cloud OSS-HDFS service (JindoFS service). It also adds the JindoFSx storage acceleration system, which supports distributed caching for native Alibaba Cloud OSS and the OSS-HDFS service (JindoFS service).
Major features
JindoFS storage system
JindoFS service capabilities
Supports random file writes, which allows files to be modified.
Supports the HDFS recycle bin. The system backend cleans up files in the recycle bin based on their expiration time.
Improves the HDFS snapshot feature to support random file modifications.
Improves the directory deletion mechanism to significantly increase operation performance.
Implements the NsWorker framework, which allows the global meta service to offload some heavy processing to Follower and Learner nodes.
JindoShell CLI support
Allows you to use commands to set the expiration time for the HDFS recycle bin.
Improves the
dumpFilecommand to output information about random write files.
JindoFuse POSIX support
Supports random file modification (Seek and Write).
JindoFSx storage acceleration system
JindoFSx core capabilities
Supports transparent cache acceleration for Alibaba Cloud OSS (keeps
oss://unchanged).Supports transparent cache acceleration for the Alibaba Cloud OSS-HDFS service (JindoFS service) (keeps
oss://unchanged).Provides a unified namespace feature that lets you mount OSS or OSS-HDFS to the same namespace and perform unified operations using the
fsx://prefix.Supports cache acceleration for large-scale file metadata.
Supports acceleration for small file training.
Supports P2P acceleration, which significantly improves cache read performance in scenarios where many training nodes prefetch and load model files simultaneously.
JindoSDK Hadoop support
Provides
JindoOssFileSystemto support transparent cache acceleration for OSS and OSS-HDFS.Provides
JindoFsxFileSystemto support usage in unified namespace mode.
JindoShell CLI support
Supports JindoFSx data cache commands.
Supports JindoFSx metadata cache commands.
Supports JindoFSx unified namespace management commands.
JindoFuse POSIX support
Supports mounting a
oss://path with Fuse to read from and write to the JindoFSx cache.Supports mounting an
fsx://path with Fuse to read from and write to the JindoFSx cache.
JindoData 4.0.x versions
Overview
JindoData 4.0.0 is the first version released after the architecture upgrade of the original Alibaba Cloud EMR SmartData component (up to major version 3.8.0). This version focuses on integrating with and supporting Alibaba Cloud OSS products and the Alibaba Cloud OSS-HDFS service (JindoFS service).
NoteThe JindoFSx storage acceleration system is not released in JindoData 4.0.0.
Major features
Alibaba Cloud OSS service
JindoSDK Hadoop support
Provides a Java Hadoop SDK for Alibaba Cloud OSS that is fully compatible with the Hadoop OSS Connector and significantly improves performance.
Supports multiple ways to set a credential provider, such as configuration, ECS Role, and the EMR credential-free mechanism.
Supports archiving upon write, such as Archive and Deep Cold Archive.
JindoShell CLI support
Provides additional command extensions for Hadoop and HDFS Shell, offering Hadoop-oriented operations for Alibaba Cloud OSS.
Supports the
ls2extended command, which can display the storage status of a file or object in OSS, such as Standard, IA, or Archive, in addition to the standard `ls` command output.Supports the
archivecommand, which lets you specify a directory for archiving operations.Supports the
restorecommand, which lets you specify a directory for restoration operations.
JindoFuse POSIX Support
This is an optimized Fuse client for Alibaba Cloud OSS. Its native code implementation significantly improves performance.
JindoDistCp data migration
Supports migrating data from self-managed HDFS clusters to Alibaba Cloud OSS, with optimizations for large files and many small files.
Alibaba Cloud OSS-HDFS service (JindoFS service)
JindoFS service
Adds a new bucket storage option for Alibaba Cloud OSS products. It provides a metadata acceleration feature, is binary compatible, and is fully aligned with Apache HDFS features, supporting lift-and-shift migration for HDFS.
Natively supports file system directory semantics, significantly optimizes directory operations, and supports atomic and millisecond-level rename capabilities for extra-large directories.
Natively supports file system file semantics, such as HDFS write leases, one-write-multiple-reads, and read-while-writing.
Supports
append,flush,sync, andtruncateoperations on files.Supports HDFS snapshots with a nearly unlimited number of snapshots, which facilitates data backup, disaster recovery, and restoration.
Supports file permissions. You can import and set user group information (UserGroupsMapping) using
JindoShellcommands.Supports the Hadoop Proxy User access control mechanism.
JindoSDK Hadoop support
JindoSDK has built-in support for accessing the Alibaba Cloud OSS-HDFS service (JindoFS service), providing a comprehensive HDFS API access and usage experience.
JindoShell CLI support
Provides additional command extensions for Hadoop and HDFS Shell, offering Hadoop-oriented operations for the Alibaba Cloud OSS-HDFS service (JindoFS service).
Allows you to import and set user group information (UserGroupsMapping) using commands.
Allows you to set Hadoop Proxy User rules using commands.
JindoFuse POSIX support
Provides an optimized Fuse client for the Alibaba Cloud OSS-HDFS service (JindoFS service). It benefits from a full native code implementation, which significantly improves performance.
Known issues
JindoSDK does not support writing files larger than 80 GB to OSS.
JindoSDK does not support writing to OSS in append mode.
JindoSDK does not support client-based encryption for OSS.
JindoSDK does not support older versions of JindoFS in Block mode or Cache mode.
The Alibaba Cloud OSS-HDFS service (JindoFS service) does not support system upgrades from older versions of JindoFS in Block mode. You must use the JindoDistCp migration tool to migrate data from the old system to the new service.