SmartData is a storage service for the E-MapReduce (EMR) Jindo engine. SmartData provides centralized storage and optimized caching and computing for EMR computing engines and extends storage features. SmartData is composed of JindoFS, JindoTable, and related tools. This topic describes the updates in SmartData 3.2.X.
OSS storage scalability on JindoFS
- JindoFS uses multiple password-free methods to obtain a token that is used to access OSS. The token can be customized or extended.
- Alibaba Cloud Tablestore is used to implement mutual exclusion on the operations that are performed with the rename operation at the same time.
- Data can be written to OSS by using Delta or Hudi.
JindoFS-based caching optimization
JindoFS optimizes the caching of metadata in a large number of small files in AI training scenarios and improves the performance of metadata preloading operations and list operations.
JindoTable-based computing optimization
- JindoTable is integrated with AliORC to provide a native Optimized Row Columnar (ORC) reader. JindoTable allows Spark and Presto to use a native ORC reader to read ORC files. This helps accelerate data reading and improve the computing performance.
- JindoTable can be used to collect access frequency statistics of Hive tables for Presto.
Ecosystem support for JindoFS
When you use Spark to write data to OSS, you can set spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs
to false to avoid generating a _SUCCESS file.