- Provides two storage modes:
- Block mode: JindoFS uses Object Storage Service (OSS) as the backend storage. In the
block storage mode, JindoFS stores data as blocks in OSS and uses Namespace Service
to maintain metadata. The block mode provides better performance than the cache mode
when you read and write data or query metadata. The block mode supports multiple storage
policies, including WARM, COLD, HOT, TEMP, and ALL_HDD. The WARM policy stores one
local backup and one backup in OSS. The COLD policy stores only one backup in OSS.
The HOT policy stores multiple local backups and one backup in OSS. The TEMP policy
stores only one local backup. The ALL_HDD policy stores multiple local backups. The
default storage policy is WARM. You can set different storage policies for directories
- Cache mode: This mode is compatible with OSS. In the cache mode, JindoFS stores data
files as objects in OSS. When you access OSS objects stored in JindoFS, JindoFS can
cache data and metadata of these OSS objects in the local cluster so that you can
quickly access them next time. The cache mode provides multiple policies for you to
synchronize metadata as required.
- Offers an external client:
- JindoFS provides an external client so that you can access JindoFS from the outside
of an E-MapReduce cluster. You can use the external client to access namespaces in
the block mode. However, you cannot use the external client to access the data cached
by JindoFS in E-MapReduce clusters. In addition, the performance of accessing data
through the external client is worse than that of data access within E-MapReduce clusters.
- The cache mode is compatible with the existing OSS semantics. JindoFS accelerates
data caching in E-MapReduce clusters. Therefore, you can use the OSS client to directly
access JindoFS from the outside of E-MapReduce clusters. For example, you can use
the OSS SDK or OssFileSystem of E-MapReduce to access JindoFS from the outside of
- Supports multiple ecosystem components:
- Currently, JindoFS supports various computing engines in E-MapReduce, such as Spark,
Flink, Hive, MapReduce, Impala, and Presto.
- If you need to separate data computing from data storage, you can store data processing
logs and storage logs in JindoFS, such as Spark event logs and YARN container logs.
- JindoFS can be used as the backend storage of HBase to store HFile files. This enhances
the storage capability of HBase.
- Added the feature of automatically detecting bad disks. This feature can resolve cache
write failures caused by bad disks when you write data to OSS.
- Completed the configurations of OssFileSystem.
- Upgraded Bigboot to version 2.0.0.
- Supports multiple namespaces, OSS, multiple storage modes, and access from external
- Fixed the issue where the Bigboot monitor status is abnormal during server restart.
- Improved the service specifications of Kudu.
- Supports a validity check for the specification of each component.
- Relational cache
Supports using a relational cache to accelerate data queries through pre-computing.
You can create a relational cache to pre-compute data. During a data query, Spark
Optimizer automatically detects an appropriate relational cache, optimizes the SQL
execution plan, and continues data computing based on the relational cache. This accelerates
data queries. For example, you can use relational caches to implement multidimensional
online analytical processing (MOLAP), generate data reports, create data dashboards,
and synchronize data across clusters.
- Supports using DDL to perform operations such as CACHE, UNCACHE, ALTER, and SHOW.
A relational cache supports all data sources and data formats of Spark.
- Supports updating caches automatically or by using the REFRESH command. Supports incremental
caching based on the specified partitions.
- Supports optimizing the SQL execution plan based on a relational cache.
- Streaming SQL
- Normalized the parameter settings of Stream Query Writer.
- Optimized the schema compatibility check of Kafka data tables.
- Supports automatically registering a schema with Schema Registry for a Kafka data
table that does not have a schema.
- Optimized log information recorded when a Kafka schema is incompatible.
- Fixed the issue where the column name must be explicitly specified when the query
result is written to a Kafka data table.
- Removed the restriction that streaming SQL queries only support the Kafka and LogHub
Added the Delta component. You can use Spark to create a Delta data source to perform
streaming data writing, transactional reading and writing, data verification, and
data backtracking. For more information, see Delta details.
- You can call the DataFrame API to read data from or write data to Delta.
- You can call the Structured Streaming API to read or write data by using Delta as
the data source or sink.
- You can call the Delta API to update, delete, merge, vacuum, and optimize data.
- You can use SQL statements to create Delta tables, import data to Delta, and read
data from Delta tables.
- Supports primary keys and foreign keys. This is a constraint feature.
- Resolved JAR conflicts such as the servlet conflict.
||Supports the log rollback feature of Log4j.
- Supports the log rollback feature of Log4j.
- Upgraded Fastjson.
||Upgraded the dependent commons-lang3 package to version 3.7 to fix the issue where
PySpark cannot write data to OSS. For more information, see Spark 2.4 incompatibility with commons-lang3 in Zeppelin.
||Supports the SHOW GRANTS command.
||Fixed the NumPy installation error.
||Supports Apache Kudu 1.10.0.
||Upgraded Presto to version 0.221.
||Upgraded ZooKeeper to version 3.5.5.