Download data by using the distributed P2P caching feature - E-MapReduce

Peer-to-peer (P2P) caching of the JindoFSx client can be regarded as a form of local caching. Compared with the original local caching feature, P2P caching allows for data reads from other clients. When a client requests a data block from local caches, the client pulls the required data from other clients that contain the required data. If the client cannot send requests to pull data from other clients, the required data is read from remote servers or by using the Security Token Service (STS) token. This topic describes how to use the distributed P2P caching feature to download data.

Prerequisites

A cluster of EMR V3.42.0 or a later minor version, or a cluster of EMR V5.6.0 or a later minor version is created in the EMR console, and the JindoData service is selected from the optional services when you create the cluster. For more information, see Create a cluster.

Procedure

Note In this topic, an EMR V3.42.0 cluster is used.

Step 1: Configure the server
Step 2: Configure JindoSDK
Step 3: Use the distributed P2P caching feature

Step 1: Configure the server

Go to the common tab of the JindoData service.
1. Log on to the EMR on ECS console.
2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
3. On the EMR on ECS page, find the cluster that you want to manage and click Services in the Actions column.
4. Click Configure in the JindoData section.
5. Click the common tab.

Add configuration items.

On the common tab, click Add Configuration Item.

In the Add Configuration Item dialog box, add the configuration items described in the following table.

For more information about how to add configuration items, see the "Add configuration items" section in the Add configuration items topic.


Category	Parameter	Description
Server	jindofsx.p2p.tracker.thread.number	The number of threads that can be used by a coordinator node. In most cases, this parameter is set to 1. If the number of clients exceeds 1,000, you can specify a larger value. If the value is less than 1, the distributed P2P caching feature is disabled.
Server	jindofsx.p2p.file.prefix	The prefixes of paths from which files are downloaded by using the P2P caching feature. If multiple file paths are included, separate the paths with commas (,). Files are downloaded by using the P2P caching feature only if their paths match one of the prefixes. When you use a path that is mounted to JindoFSx by using a unified namespace at the application layer to download data, set this parameter to the actual path of the object. Example: `oss://bucket1/data-dir1/,oss://bucket2/data-dir2/`.
Client	fs.jindofsx.p2p.cache.capacity.limit	The maximum size of memory that can be occupied by downloads that are performed by using the distributed P2P caching feature. Unit: byte. The default size is 5 GB and the minimum size is 1 GB.
	fs.jindofsx.p2p.download.parallelism.per.file	The number of concurrent downloads when a single file is downloaded by using the P2P caching feature. For example, you can set this parameter to 5.
	fs.jindofsx.p2p.download.thread.pool.size	The total size of the thread pool used by downloads that are performed by using the P2P caching feature. For example, you can set this parameter to 5.

Click OK.
In the Save dialog box, configure the Execution Reason parameter and click Save.

Restart the JindoData service.
1. On the Services tab of the JindoData service, choose More > Restart.
2. In the Restart JINDODATA Services dialog box, specify the execution reason and click OK.
3. In the Confirm message, click OK.

Step 2: Configure JindoSDK

Important This step is to configure the client. You do not need to restart the JindoData service after you complete this step.

Go to the Configure tab.
1. Log on to the EMR on ECS console.
2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
3. On the EMR on ECS page, find the cluster that you want to manage and click Services in the Actions column.
4. On the Services tab, click Configure in the HADOOP-COMMON section.
5. Click the core-site.xml tab.

On the core-site.xml tab, modify the following configuration items.

For more information about how to add configuration items, see the "Add configuration items" section in the Add configuration items topic. For more information about how to modify configuration items, see the "Modify configuration items" section in the Modify configuration items topic.


Item	Parameter	Description
Configure the implementation class of Object Storage Service (OSS)	fs.AbstractFileSystem.oss.impl	Set the value to `com.aliyun.jindodata.oss.OSS`.
	fs.oss.impl	Set the value to `com.aliyun.jindodata.oss.JindoOssFileSystem`.
Specify the engine type	fs.xengine	Set the value to `jindofsx`.
Configure the endpoint of the JindoFSx Namespace service	fs.jindofsx.namespace.rpc.address	Specify the value in the ${headerhost}:8101 format. Example: master-1-1:8101. Note For more information about how to configure and use the Namespace service in high availability mode, see Configure and use the JindoFSx Namespace service in high availability mode.
Configure data caching for query acceleration Note After you enable this feature, hot data blocks are cached on local disks. By default, this feature is disabled, and you can read data from OSS/OSS-HDFS.	fs.jindofsx.data.cache.enable	Specifies whether to enable data caching for query acceleration. Valid values: false: disables the feature. This is the default value. true: enables the feature.

Save the modifications.
1. On the Configure tab, click Save.
2. In the Save dialog box, configure the Execution Reason parameter, turn on Automatically Update Configurations, and then click Save.

Step 3: Use the distributed P2P caching feature

After you complete the preceding configurations, if the paths of files that you want to read match one of the prefixes that are specified by the jindofsx.p2p.file.prefix parameter, all read requests are processed by using the distributed P2P caching feature without the need to call other API operations. For example, you can run Hadoop shell commands to download files to your on-premises machine. If the paths of files match one of the specified prefixes, the distributed P2P caching feature is automatically enabled.

If you want to verify whether the read request of a specific file is processed by using the distributed P2P caching feature, you can query the logs for verification. If your program prints INFO-level logs on the client, the following information is returned for read requests that are processed by using the distributed P2P caching feature:

P2P record for path:

If the preceding information exists, the read request of the file is processed by using the distributed P2P caching feature.