If you want to map OSS-HDFS to a local file system and then access objects in OSS-HDFS by calling the standard HDFS API and perform operations, such as reading, writing, and deleting objects, you can use JindoFuse to access OSS-HDFS. JindoFuse is a tool that allows you to access open source distributed file systems and is compatible with POSIX. JindoFuse allows AI applications to directly use OSS-HDFS for data storage and processing.
Preparations
You can use one of the following methods to access OSS-HDFS:
If you want to access OSS-HDFS by using an Alibaba Cloud EMR cluster, make sure that an EMR cluster whose version is 3.44.0 or later or 5.10.0 or later is created. EMR clusters that meet the preceding requirements are integrated with JindoFuse by default. For more information, see Create a cluster.
If you do not want to access OSS-HDFS by using an Alibaba Cloud EMR cluster, make sure that JindoSDK 4.6.2 or later is installed and deployed. For more information, see Deploy JindoSDK in an environment other than EMR.
Procedure
Configure environment variables.
If you want to access OSS-HDFS by using an Alibaba Cloud EMR cluster, skip this step and proceed to Step 2.
If you do not want to access OSS-HDFS by using an Alibaba Cloud EMR cluster, perform the following steps to configure JindoFuse:
Connect to the ECS instance. For more information, see Connect to an instance.
Modify environment variables.
In this example, jindosdk-x.x.x is installed in the root/ path. x.x.x indicates the version number of JindoSDK. Modify the environment variables based on the actual path in which JindoSDK is installed.
export JINDOSDK_HOME=/root/jindosdk-x.x.x
export HADOOP_CLASSPATH=`hadoop classpath`:${JINDOSDK_HOME}/lib/*
export JINDOSDK_CONF_DIR=/root/jindosdk-x.x.x/conf
export PATH=$PATH:$JINDOSDK_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${JINDOSDK_HOME}/lib/native
Configure the configuration file.
Create a configuration file named jindosdk.cfg in the conf/ directory of JindoSDK.
Add the following configuration items to the jindosdk.cfg configuration file:
[common]
logger.dir = /tmp/fuse-log
[jindosdk]
<!-- In this example, the China (Hangzhou) region is used. Specify your actual region. -->
fs.oss.endpoint = cn-hangzhou.oss-dls.aliyuncs.com
<! -- Configure the AccessKey ID and AccessKey secret that are used to access OSS-HDFS. -->
fs.oss.accessKeyId = LTAI5tJCTj5SxJepqxQ2****
fs.oss.accessKeySecret = i0uLwyd0mHxXetZo7b4j4CXP16****
Mount OSS-HDFS.
Run the following command to create a mount point:
Run the following command to mount OSS-HDFS:
jindo-fuse <mount_point> -ouri=[<oss_path>]
You must set -ouri to the dls path that you want to map. The path can be the root directory or a subdirectory of the bucket. After you run the command, a daemon process in the background starts to mount the <oss_path> that you specified to the mount point of the local file system. The mount point is specified by <mount_point>.
For more information about the mount options that you can configure when you mount OSS-HDFS, see Appendix 2: Mount options.
Run the following command to check whether OSS-HDFS is mounted:
ps -ef | grep jindo-fuse
If the following result is returned, OSS-HDFS is mounted:
root 2162 1 0 13:21 ? 00:00:00 jindo-fuse <mount_point> -ouri=[<oss_path>]
root 2714 2640 0 13:39 pts/0 00:00:00 grep --color=auto jindo-fuse
Use JindoFuse to perform read and write operations on objects in OSS-HDFS.
Create a directory
mkdir /mnt/oss/dir1
List all subdirectories in the /mnt/oss/ directory
ls /mnt/oss/
Write an object
echo "hello world" > /mnt/oss/dir1/hello.txt
Read an object
cat /mnt/oss/dir1/hello.txt
Delete a directory
rm -rf /mnt/oss/dir1/
Optional. Unmount OSS-HDFS.
You can unmount OSS-HDFS by using one of the following methods:
Manually unmount OSS-HDFS
umount <mount_point>
Automatically unmount OSS-HDFS
-oauto_unmount
You can run the preceding command to send SIGINT to the jindo-fuse process by using killall -9 jindo-fuse
. OSS-HDFS is automatically unmounted before the process exits.
FAQ
How do I troubleshoot JindoFuse errors?
If you use JindoSDK to call API operations, you can view the details of the error messages when errors are reported. If you use JindoFuse, you can view only the preset error messages of the operating system.
ls: /mnt/oss/: Input/output error
To identify the cause of an error, you must find the jindosdk.log file in the path that is specified by the logger.dir configuration item of JindoSDK. The following message is a common authentication error message that may appear when you use JindoFuse:
EMMDD HH:mm:ss jindofs_connectivity.cpp:13] Please check your Endpoint/Bucket/RoleArn.
Failed test connectivity, operation: mkdir, errMsg: [RequestId]: 618B8183343EA53531C62B74 [HostId]: oss-cn-shanghai-internal.aliyuncs.com [ErrorMessage]: [E1010]HTTP/1.1 403 Forbidden ...
If the preceding error message appears, check whether the endpoint, the bucket, and the role ARN are properly configured. For more information, see Connect non-EMR clusters to OSS-HDFS.
If a program error occurs, submit a ticket.
Appendix 1: Supported operations
The following table describes the POSIX-based API operations that are supported by JindoFuse.
Operation | Description |
getattr() | Queries the attributes of an object. This operation is similar to the ls command. |
mkdir() | Creates a directory. This operation is similar to the mkdir command. |
rmdir() | Deletes a directory. This operation is similar to the rm -rf command. |
unlink() | Deletes an object. This operation is similar to the unlink command. |
rename() | Renames an object or a directory. This operation is similar to the mv command. |
read() | Reads data in sequence. |
pread() | Reads data at random. |
write() | Writes data in sequence. |
pwrite() | Writes data at random. |
flush() | Flushes data from the memory to the kernel cache. |
fsync() | Flushes data from the memory to disks. |
release() | Closes an object. |
readdir() | Reads a directory. |
create() | Creates an object. |
open() O_APPEND | Opens an object by using the append mode. |
open() O_TRUNC | Opens an object by using the overwrite mode. |
ftruncate() | Truncates an opened object. |
truncate() | Truncates a closed object. This operation is similar to the truncate -s command. |
lseek() | Specifies the read and write location in an open object. |
chmod() | Modifies the permissions on an object. This operation is similar to the chmod command. |
access() | Queries the permissions on an object. |
utimes() | Modifies the points in time at which an object is stored and modified. |
setxattr() | Modifies the xattr attribute of an object. |
getxattr() | Queries the xattr attribute of an object. |
listxattr() | Lists the xattr attribute of an object. |
removexattr() | Deletes the xattr attribute of an object. |
lock() | Supports POSIX locks. This operation is similar to the fcntl command. |
fallocate() | Pre-allocates physical space to an object. |
symlink() | Creates a symbolic link. The symbolic link is available only in OSS-HDFS and does not support cache acceleration. |
readlink() | Reads a symbolic link. |
Appendix 2: Mount options
The following table describes the options that you can configure to use JindoFuse to mount objects from OSS-HDFS to a local file system.
Option | Required | Description | Example |
uri | Yes | The dls path that you want to map. The path can be the root directory of the bucket, such as -ouri=oss://bucket.endpoint/. It can also be a subdirectory of the bucket, such as -ouri=oss://bucket.endpoint/subdir. | -ouri=oss://examplebucket.cn-beijing.oss-dls.aliyuncs.com/ |
f | No | Starts the JindoFuse process. By default, a daemon process is used to start the JindoFuse process in the background. If you use this option, we recommend that you enable terminal logs. | -f |
d | No | Enables the debug mode. If you enable the debug mode, the JindoFuse process starts in the foreground. If you use this option, we recommend that you enable terminal logs. | -d |
auto_unmount | No | Automatically unmounts the mount point after the JindoFuse process exits. | -oauto_unmount |
ro | No | Mounts objects from OSS-HDFS in read-only mode. After you enable this option, you cannot perform write operations. | -oro |
direct_io | No | Allows object read and write without the need for page cache. | -odirect_io |
kernel_cache | No | Uses the kernel cache to optimize read performance. | -okernel_cache |
auto_cache | No | Enables automatic caching by default. Unlike kernel-cache , auto-cache enables automatic flushing of the cache if the object size or the time at which the object is modified changes. | -oauto_cache |
entry_timeout | No | The retention period of the object name in the cache when the object is read. Unit: seconds. This option is used to optimize performance. The value 0 specifies that the object name is not cached. Default value: 0.1. | -oentry_timeout=60 |
attr_timeout | No | The retention period of the object attributes in the cache. Unit: seconds. This option is used to optimize performance. The value 0 specifies that the object attributes are not cached. Default value: 0.1. | -oattr_timeout=60 |
negative_timeout | No | The retention period of the object name in the cache if the object fails to be read. Unit: seconds. This option is used to optimize performance. The value 0 specifies that the object name is not cached. Default value: 0.1. | -onegative_timeout=0 |
jindo_entry_size | No | The number of directories that are cached. This option is used to optimize readdir performance. The value of 0 indicates that the directories are not cached. Default value: 5000. | -ojindo_entry_size=5000 |
jindo_attr_size | No | The number of object attributes that are cached. This option is used to optimize getattr performance. The value 0 specifies that the object attributes are not cached. Default value: 50000. | -ojindo_attr_sizet=50000 |
max_idle_threads | No | The maximum number of idle threads. Default value: 10. | -omax_idle_threads=10 |
metrics_port | No | Enables the HTTP port to output metrics, such as http://localhost:9090/brpc_metrics. Default value: 9090. | -ometrics_port=9090 |
enable_pread | No | Calls the pread operation to read objects. | -oenable_pread |
Appendix 3: Configuration items
Item | Configuration node | Description |
logger.dir | common | The directory in which logs are stored. Default value: /tmp/jindodata-log. |
logger.sync | common | The mode in which logs are returned. Valid values: |
logger.consolelogger | common | Specifies whether to display logs. Valid values: |
logger.level | common | Returns logs whose levels are greater than or equal to the value of this configuration item. Enable terminal logs Valid values for log levels: 0 to 6. The following items show the mappings between the values of this configuration item and the log levels: 0: TRACE 1: DEBUG 2 (default): INFO 3: WARN 4: ERROR 5: CRITICAL 6: OFF
Disable terminal logs A log level that is less than or equal to 1 indicates WARN. A log level that is greater than 1 indicates INFO.
|
logger.verbose | common | Returns Verbose logs whose levels are greater than or equal to the value of this configuration item. Valid values: 0 to 99. Default value: 0. The value 0 specifies that no Verbose logs are returned. |
logger.cleaner.enable | common | Specifies whether to enable log cleanup. Valid values: |
fs.oss.endpoint | jindosdk | The endpoint that is used to access OSS-HDFS. Example: cn-hangzhou.oss-dls.aliyuncs.com . |
fs.oss.accessKeyId | jindosdk | The AccessKey ID that is used to access OSS-HDFS. |
fs.oss.accessKeySecret | jindosdk | The AccessKey secret that is used to access OSS-HDFS. |