SmartData FAQ - E-MapReduce - Alibaba Cloud Documentation Center

Basic concepts

What is JindoFS?
Why do I need to use JindoFS when OSS is available for me?
What storage modes does JindoFS support? What are the use scenarios of each mode?
What are the differences between the client-only mode and cache mode?
What are the differences between the cache mode and block storage mode?
Can JindoFS in block storage mode read data over the OSS API?
OSS does not support rename operations. Does JindoFS support rename operations?
Does JindoFS provide high rename performance?
Does JindoFS support magic committers, such as Hadoop S3A Magic Committer?
How does JindoFS support a directory that contains millions of files or even tens of millions of files?
How does JindoFS ensure data reliability?
Does JindoFS provide strong consistency of operations on files and directories?
Does JindoFS support atomicity of operations on files and directories?
How does JindoFS in block storage mode ensure high availability?
JindoFS in block storage mode stores file metadata in each cluster. How do I recover the metadata of a stopped cluster to a newly created cluster?
Why do I need to use JindoFS in block storage mode while I can use HDFS in EMR?
What are the technical differences between JindoFS and Alluxio? What are the advantages of JindoFS over Alluxio?
Are JindoFS and OSS more cost-effective than HDFS?
Open source Hadoop can also access OSS. What are the advantages of JindoFS over open source Hadoop?
Does JindoFS provide Fuse? What are the advantages of Fuse provided by JindoFS over Fuse provided by OSS?
What is the relationship between SmartData and JindoFS in EMR?
What is the relationship between Bigboot and JindoFS in EMR?

Open source ecosystems

What open source components does JindoFS support?
Does JindoFS deliver a high data throughput that is required by Spark or Hive large-scale data processing?
Does JindoFS provide high write performance?
Does JindoFS support Realtime Compute for Apache Flink?
Can I use Presto for interactive analysis of data in OSS based on JindoFS?
How do I migrate data in HDFS to JindoFS?
Can I query data in OSS by using JindoFS when I use Impala?
Can I store data in OSS by using JindoFS when I use Delta Lake, Hudi, or Iceberg?
Does JindoFS support machine learning training for data stored in OSS?
How does JindoFS support machine learning training for data in MaxCompute?
How does JindoFS support machine learning training for data in Hive?
What do I do if Bigboot logs occupy too much space?

Upgrade and data migration

How do I migrate data in HDFS to JindoFS?
The version of my EMR cluster does not support JindoFS. How do I use JindoFS?
Which Hadoop versions and Hadoop vendors does JindoFS support?
Can I use JindoFS in a self-managed cluster that is deployed on Elastic Compute Service (ECS) instances?
Can I use JindoFS in Alibaba Cloud Container Service for Kubernetes (ACK)?
Is JindoFS bound to EMR?
Can I use JindoFS in a Hadoop cluster of a data center?

Known version-specific issues

What do I do if the disk space occupied by the /opt/bignode directory is extremely large and continues to increase?
What do I do if a data error occurs when I read cached data?

OSS

How do I view the data volume in JindoFS?
In which scenarios do I need to enable OSS bucket versioning?
What is the impact of OSS bucket versioning on EMR and JindoFS?
The Archive storage class of OSS helps reduce storage costs. Does JindoFS support the Archive storage class?

Security

Is it possible for my AccessKey pair to be disclosed when I use JindoFS?
What is AccessKey-free access?
How do I grant different users different permissions when I use the AccessKey-free access feature?
How do I use different AccessKey pairs to access different OSS buckets from JindoFS?
What do I do if I want to use a self-managed cluster in a data center but do not want to configure an AccessKey pair on the cluster nodes?
Does JindoFS support the AuditLog feature?
Does JindoFS support the Ranger service?

Usage

How do I manually undeploy a node in JindoFS?

What is JindoFS?

JindoFS is a Hadoop file system provided by Alibaba Cloud E-MapReduce (EMR). JindoFS provides multi-level encapsulation and optimization features to enable Hadoop and Spark to better use Alibaba Cloud Object Storage Service (OSS).

JindoFS is adapted to OSS and allows you to use JindoFS SDK to access OSS. You can use JindoFS in cache mode to cache OSS data in distributed mode. You can also use JindoFS in block storage mode together with some other advanced features of JindoFS to meet your business requirements in some special scenarios.

Why do I need to use JindoFS when OSS is available for me?

OSS is an object storage system that provides object-based RESTful APIs and SDKs for various programming languages. JindoFS provides Hadoop Compatible File System (HCFS) interfaces that you can use to access OSS. JindoFS also supports cache-based acceleration and allows you to customize advanced optimization features. Hadoop and Spark components rely on abstract HCFS interfaces. Therefore, JindoFS is required.

What storage modes does JindoFS support? What are the use scenarios of each mode?

You can use JindoFS in client-only, cache, or block storage mode. In client-only mode, JindoFS SDK (jindo-sdk_xxx.jar) is used.

Use scenarios of each mode:

Client-only mode: If you do not need to change the manner in which objects are organized in OSS, use this mode. To use JindoFS in this mode, you need only to upload the JAR package of JindoFS SDK to the classpath directory of your big data computing component.
Cache mode: If computing performance is limited by IOPS and storage throughput, use this mode. In cache mode, you can add disks or expand the capacity of disks on the core nodes of your EMR cluster and enable data caching.
Block storage mode: This mode is suitable for some special scenarios. For example, if you have high requirements for the performance of metadata-related operations and the consistency of metadata, use this mode.

What are the differences between the client-only mode and cache mode?

Both of the modes are fully compatible with OSS. Both of the modes allow you to read data from and write data to OSS by using the API and SDK provided by OSS.

To use JindoFS in cache mode, you must deploy and configure the Jindo distributed caching service and enable data caching. If you use JindoFS in client-only mode, you do not need to perform these operations. If the caching service fails, the system switches to the client-only mode and directly reads data from or writes data to OSS.

What are the differences between the cache mode and block storage mode?

In block storage mode, you can manage the metadata of files, organize data blocks, and use OSS as disks in a similar way to HDFS. In block storage mode, you must use the JindoFS SDK client to read data from or write data to OSS.

In cache mode, JindoFS is compatible with OSS. Therefore, you can directly read data from and write data to OSS. For example, in cache mode, after you write a large file to OSS, you can find the file in an OSS directory in the OSS console. In block storage mode, you can find many file blocks, which you can combine into a large file only by using the JindoFS SDK client.

How can the consistency between data cached in JindoFS and data stored in OSS be ensured after data in OSS is updated?

When JindoFS reads data from the OSS object, JindoFS checks the metadata and MD5 information of the OSS object and compares the information with the cached data. If data in the OSS object is modified, overwritten, or deleted, JindoFS directly reads data from OSS instead of from the cache and updates data in the cache.

Can JindoFS in block storage mode read data over the OSS API?

No, JindoFS in block storage mode cannot read data over the OSS API. JindoFS in block storage mode can read data only by using the JindoFS SDK client.

OSS does not support rename operations. Does JindoFS support rename operations?

Yes, JindoFS supports rename operations. JindoFS supports rename operations. You can use JindoFS to rename files and directories because JindoFS supports HCFS interfaces.

You cannot directly rename directories or files in OSS. JindoFS allows you to perform rename operations by simulating a file system. To rename an object, you must first copy the object to a different location and then delete the original object.

Does JindoFS provide high rename performance?

The rename performance of JindoFS is better than that of open source solutions. OSS provides the Fast Copy feature to optimize the performance of copying large files. This enables JindoFS to rename a file much faster than the open source solutions. Each directory may contain a large number of files. JindoFS has fully optimized the parallelism of rename operations. This ensures that JindoFS can rename a directory much faster than the open source solutions.

Does JindoFS support magic committers, such as Hadoop S3A Magic Committer?

JindoFS supports magic committers that do not require rename operations.

How does JindoFS support a directory that contains millions of files or even tens of millions of files?

JindoFS supports parallel access to and memory optimization for large directories. Parallel access to a large directory does not cause out of memory (OOM) errors or task suspension.

How does JindoFS ensure data reliability?

Data reliability is ensured by OSS. JindoFS stores data in OSS no matter which storage mode is used, and local disks are used only to cache data.

Does JindoFS provide strong consistency of operations on files and directories?

Yes, JindoFS in block storage mode implements HDFS semantics, which supports strong consistency.

Does JindoFS support atomicity of operations on files and directories?

JindoFS in cache mode does not support atomicity. JindoFS in cache mode is compatible with OSS and is limited by the storage features of OSS. Therefore, JindoFS in cache mode does not support atomicity of cross-object operations. For example, if you perform a rename operation, at least a source object and a destination object are involved. If you perform the operation on a directory, more objects are involved.

JindoFS in block storage mode strictly implements HDFS semantics and supports atomicity of various operations such as rename operations.

How does JindoFS in block storage mode ensure high availability?

JindoFS in block storage mode allows you to deploy multiple Jindo Namespace Service nodes based on the distributed consistency protocol Raft. In addition, the metadata of a cluster can be asynchronously synchronized to an Alibaba Cloud Tablestore instance.

JindoFS in block storage mode stores file metadata in each cluster. How do I recover the metadata of a stopped cluster to a newly created cluster?

JindoFS in block storage mode can asynchronously synchronize the metadata of a cluster to an Alibaba Cloud Tablestore instance. If you stop a cluster after the cluster stops updating data and synchronizes all updates to Tablestore, all metadata of the cluster is stored in Tablestore. When you create a cluster, you can restore data from OSS and restore metadata from Tablestore to the cluster.

Note You must take note of the version compatibility when you create a cluster. For example, you can recover the metadata of an EMR V2.7.X cluster to another EMR V2.7.X cluster. However, the same operation between an EMR V2.6.X cluster and an EMR V2.7.X cluster may fail. If such a failure occurs, we recommend that you use the Jindo DistCp tool to synchronize data to OSS.

Why do I need to use JindoFS in block storage mode while I can use HDFS in EMR?

The technical architecture and features of JindoFS in block storage mode are similar to those of HDFS. Both of JindoFS in block storage mode and HDFS can be used to manage file metadata and organize data, and both of them deliver strong consistency.

However, JindoFS in block storage mode has the following advantages over HDFS: JindoFS in block storage mode backs up data to OSS, supports auto scaling, is cost-effective, and does not require disk maintenance.

What are the technical differences between JindoFS and Alluxio? What are the advantages of JindoFS over Alluxio?


Item	JindoFS	Alluxio
Similarities	The technical architecture of JindoFS in cache mode is similar to that of Alluxio. Both JindoFS in cache mode and Alluxio support cache-based acceleration for OSS data and work on one master node and multiple worker nodes. The master node is used to maintain the location information of cache blocks, and the worker nodes are used to manage cache blocks and read and write data.
Differences	To access an oss:// path by using JindoFS, you need only to enable data caching.	To access an oss:// path by using Alluxio, you must mount an OSS bucket to a namespace and use alluxio:// in a command to access the OSS path.
	JindoFS supports only OSS data sources, and all performance optimizations are dedicated to OSS.	Alluxio supports many types of data sources and allows you to mount multiple data sources to the same namespace.
	JindoFS provides the SDK-based client-only mode to support access from various open source engines to OSS.	N/A.

Are JindoFS and OSS more cost-effective than HDFS?

HDFS does not support auto scaling. In some cases, your storage space may be insufficient or purchased space may be wasted.

OSS allows you to store large volumes of data, supports auto scaling, and allows you to archive cold data at low costs. JindoFS supports the tiered storage of hot and cold data and data archiving based on OSS. This way, JindoFS can help you reduce costs.

Open source Hadoop can also access OSS. What are the advantages of JindoFS over open source Hadoop?

Access to OSS from open source Hadoop is limited by the community. Only some basic features are supported.

JindoFS has the following advantages:

More comprehensive: Various open source engines can connect to OSS.
More active: JindoFS is updated or upgraded to support new features provided by OSS.
More advanced: JindoFS supports advanced cache-based acceleration, and JindoFS in block storage mode can meet your requirements for advanced customization.
More performant: The core code of JindoFS is developed based on C++ native code, which ensures the higher performance of basic operations.

Does JindoFS provide Fuse? What are the advantages of Fuse provided by JindoFS over Fuse provided by OSS?

Yes, JindoFS provides Fuse. Fuse provided by JindoFS can fully use the capabilities enabled by the cache mode and block storage mode.

What open source components does JindoFS support?

JindoFS supports components that read and write data over HCFS interfaces. The components include Hadoop, Hive, Spark, Flink, Presto, HBase, Impala, Druid, Kafka, and Flume.

Does JindoFS deliver a high data throughput that is required by Spark or Hive large-scale data processing?

JindoFS supports asynchronous concurrent read operations based on the high concurrency capability of OSS. JindoFS allows you to use the concurrent multipart upload feature of OSS to split a large object into multiple parts and upload them concurrently. JindoFS has large advantages over open source Hadoop in terms of read and write throughput.

Both JindoFS in cache mode and JindoFS in block storage mode can cache data in local disks or memory of clusters. This feature significantly accelerates the queries of new data and data that is repeatedly read. Under the same cluster conditions, JindoFS delivers an equivalent or even higher throughput for Spark or Hive large-scale data processing than HDFS.

Does JindoFS provide high write performance?

HDFS needs to write three replicas. JindoFS does not need to write a backup to OSS. Therefore, in most cases, JindoFS has higher write performance than HDFS.

Does JindoFS support Realtime Compute for Apache Flink?

Yes, JindoFS supports Realtime Compute for Apache Flink. JindoFS allows you to use Realtime Compute for Apache Flink to read data from OSS and use the checkpoint and sink mechanisms of Realtime Compute for Apache Flink to write data to OSS. Exactly-Once semantics are supported.

Can I use Presto for interactive analysis of data in OSS based on JindoFS?

Yes, both JindoFS in cache mode and JindoFS in block storage mode allow you to use Presto to perform interactive analysis of data in OSS.

Can I query data in OSS by using JindoFS when I use Impala?

Yes, Impala 3.4 and later support JindoFS. You can use JindoFS to read and write data.

Can I store data in OSS by using JindoFS when I use Delta Lake, Hudi, or Iceberg?

Yes, you can store data in OSS by using JindoFS when you use Delta Lake, Hudi, or Iceberg.

Does JindoFS support machine learning training for data stored in OSS?

Yes, JindoFS supports machine learning training for data stored in OSS. You can use JindoFS in cache mode to preload data in OSS to the memory or an SSD of your cluster. This way, a training engine can use JindoFuse to read the data from the memory or SSD.

How does JindoFS support machine learning training for data in MaxCompute?

You can use one of the following methods:

MaxCompute jobs write data to OSS by using external tables created in MaxCompute. Then, the training cluster uses JindoFS in cache mode and JindoFuse to load training data.
Use JindoTable to pull data from MaxCompute and write the data to JindoFS in cache mode. Then, use JindoFuse to load training data.

How does JindoFS support machine learning training for data in Hive?

The methods are similar to those for data in MaxCompute. For more information, see How does JindoFS support machine learning training for data in MaxCompute? .

How do I migrate data in HDFS to JindoFS?

You can use Jindo DistCp to synchronize data in HDFS to JindoFS or OSS. Jindo DistCp has higher performance than Hadoop DistCp and can archive data in OSS.

The version of my EMR cluster does not support JindoFS. How do I use JindoFS?

If the cluster scale is small, we recommend that you create a cluster of a version that supports JindoFS and use JindoFS in the new cluster.

Which Hadoop versions and Hadoop vendors does JindoFS support?

The JindoFS SDK provides the OSS adaptation feature and supports minor versions later than Hadoop 2.7 and Hadoop 3.X.

Hortonworks Data Platform (HDP) developed by Hortonworks and Cloudera's Distribution Including Apache Hadoop (CDH) developed by Cloudera can be used, but a conflict may occur. To use HDP or CDH, you must set fs.oss.impl to JindoOssFileSystem.

Can I use JindoFS in a self-managed cluster that is deployed on Elastic Compute Service (ECS) instances?

Yes, you can use JindoFS in a self-managed cluster that is deployed on ECS instances. You need only to download JindoFS SDK and manually deploy JindoFS in the cluster. If you want to use JindoFS in cache mode or block storage mode, we recommend that you log on to the EMR console and use an EMR cluster.

Can I use JindoFS in Alibaba Cloud Container Service for Kubernetes (ACK)?

Yes, you can use JindoFS in Alibaba Cloud ACK.

Is JindoFS bound to EMR?

No, JindoFS is not bound to EMR. JindoFS provides standard HCFS interfaces and is fully compatible with open source ecosystems.

Can I use JindoFS in a Hadoop cluster of a data center?

Yes, you can use JindoFS in a Hadoop cluster of a data center. You can download the open source JindoFS SDK and deploy it by following instructions in the documentation.

How do I view the data volume in JindoFS?

You can run the following command to view the data volume:

hadoop dfs -du/count

In which scenarios do I need to enable OSS bucket versioning?

We recommend that you enable OSS bucket versioning for important data. OSS bucket versioning ensures that data is not lost if you delete data by mistake.

What is the impact of OSS bucket versioning on EMR and JindoFS?

We recommend that you do not enable OSS bucket versioning for intermediate result data of Hive or Spark and data that is frequently modified. OSS bucket versioning affects computing performance.

The Archive storage class of OSS helps reduce storage costs. Does JindoFS support the Archive storage class?

Yes, JindoFS supports the Archive storage class. JindoFS in block storage mode provides specific storage policies to support the Archive storage class.

Is it possible for my AccessKey pair to be disclosed when I use JindoFS?

JindoFS allows you to configure and use an AccessKey pair on a cluster. However, the AccessKey pair may be disclosed. If a node of an EMR cluster or in an ECS environment is bound to an ECS role, you can access the node based on your permissions instead of using an AccessKey pair.

What is AccessKey-free access?

EMR clusters support AccessKey-free access. This feature allows you to obtain an Alibaba Cloud Security Token Service (STS) token and use this token to access Alibaba Cloud resources, such as OSS buckets.

How do I grant different users different permissions when I use the AccessKey-free access feature?

AccessKey-free access is not suitable in some scenarios.

You can use one of the following methods to grant permissions to different users:

Use RAM to implement access control. If this method is used, RAM users are used to access OSS.
Use JindoFS to implement access control. If this method is used, Ranger is used to grant permissions.

Important JindoFS supports access control only on namespaces.

How do I use different AccessKey pairs to access different OSS buckets from JindoFS?

You can use multiple namespaces in JindoFS and configure the information about a single OSS bucket and the corresponding AccessKey pair for each namespace.

What do I do if I want to use a self-managed cluster in a data center but do not want to configure an AccessKey pair on the cluster nodes?

You can use a Hadoop credential provider. For more information, see Use a credential provider.

Does JindoFS support the AuditLog feature?

Yes, JindoFS supports the AuditLog feature. JindoFS allows you to configure multiple namespaces and enable the AuditLog feature on each namespace. The AuditLog feature is disabled by default.

Does JindoFS support the Ranger service?

Yes, JindoFS supports the Ranger service. JindoFS allows you to configure multiple namespaces and enable the Ranger service on each namespace. The Ranger service is disabled by default.

What is the relationship between SmartData and JindoFS in EMR?

SmartData is an EMR component, which provides the JindoFS service.

What is the relationship between Bigboot and JindoFS in EMR?

Bigboot is the infrastructure of the SmartData component. Bigboot provides features such as millisecond-level process monitoring and log cleaning for the services contained in the component.

What do I do if Bigboot logs occupy too much space?

This issue may occur in a minor version earlier than EMR V3.36.1 or EMR V5.2.1. You must manually delete some log files. You can perform the following steps to change the log level from INFO to WARN. This way, the number of displayed logs is reduced.

Add a configuration item in the EMR console.
1. On the Configure tab of the SmartData service page, click the namespace tab.
2. Click Custom Configuration in the upper-right corner.
3. In the Add Configuration Item dialog box, add the logger.level parameter and set it to 1.
  
  Note The default value is 0, which indicates INFO. The value 1 indicates WARN.
4. Click OK.
Save the configuration.
1. Click Save in the upper-right corner.
2. In the Confirm Changes dialog box, specify Description and click OK.
Restart Jindo Namespace Service.
1. Choose Actions > Restart Jindo Namespace Service in the upper-right corner.
2. In the Cluster Activities dialog box, specify Description and click OK.
3. In the Confirm message, click OK.

What do I do if the disk space occupied by the /opt/bignode directory is extremely large and continues to increase?

Problem description: In most cases, the /opt/bignode directory occupies several gigabytes of disk space. However, in SmartData 2.X, the disk space occupied by the /opt/bignode directory is extremely large and continues to increase.
Possible cause: The monitoring process unexpectedly exits. As a result, the Jindo Storage Service process is deemed abnormal and is repeatedly started.
Solution: Perform the following steps to troubleshoot the issue:
Find the node on which the issue occurs and run the following command as the root user to view information about the processes:
```
ps -aux | grep b2-storageservice | grep -v grep
```
In most cases, the monitoring process xxxx/services/b2-storageservice.spec and the Jindo Storage Service process xxxx/bin/b2-storageservice are started. If no information about the xxxx/services/b2-storageservice.spec process is displayed, the process unexpectedly exits and causes the issue. In this case, you must run the following command to manually kill the xxxx/bin/b2-storageservice process:
```
kill -9 <PID of b2-storageservice>
```
Note <PID of b2-storageservice> specifies the ID of the xxxx/bin/b2-storageservice process. The process ID is obtained when you view process information.
After you run the preceding command, the monitoring process and the Jindo Storage Service process are automatically recovered. The disk space occupied by the /opt/bignode directory becomes normal.

What do I do if a data error occurs when I read cached data?

Problem description: In SmartData 3.0 to SmartData 3.7, if JindoFS in block storage mode or cache mode is used and data caching is enabled, data can be generated, but data is contaminated during read operations. (In block storage mode, data caching is enabled by default.) For example, a data format error is reported when you read data from ORC or Parquet files, or an HFile format error is reported when you read HBase data.
Possible cause: Invalid cached data blocks are read due to a known defect in the program that you use to read cached data blocks.
Solution: Use one of the following methods to implement a workaround. Then, contact EMR technical support engineers to update the program.
Use one of the following methods based on your business requirements to avoid the continuous impact of the issue on your business:
- Method 1: Disable data caching.
  In most cases, the defect is caused by data caching. You can disable data caching to implement a workaround. To disable data caching, perform the following steps: Go to the SmartData service page and click the Configure tab. Then, click the client tab in the Service Configuration section and configure related parameters. If you use JindoFS in block storage mode, set jfs.data-cache.enable to false. If you use JindoFS in cache mode, set jfs.cache.data-cache.enable to false.
- Method 2: Clear cached data of a file in which an error occurs.
  If Method 1 is not applicable due to considerations such as performance, you can use this method.
  1. Run the following command to submit a cache clearing task:
```
jindo jfs -uncache <Full path of the file in which an error occurs>
```
  2. Run the following command to view the task status. Make sure that the cached data of the file is cleared.
```
jindo jfs -status <Full path of the file in which an error occurs>
```
  3. On the client tab in the Service Configuration section of the Configure tab on the SmartData service page, add a custom parameter whose name is storage.compaction.enable and value is false.
    For more information, see Add parameters.
  4. Restart Jindo Storage Service.
    For more information, see Restart a service.

How do I manually undeploy a node in JindoFS?

You can run the following command to undeploy a node:

jindo jfs -decommission <excludeFile>

Note In the command, <excludeFile> specifies the file that contains the names of nodes to be undeployed. The name of each node occupies one line. You can obtain the names of nodes by running the hostname command. The following code shows an example of the content in the file:

emr-worker-1.cluster-29****
emr-worker-2.cluster-29****