Configure self-managed Hadoop to access OSS-HDFS using JindoSDK - Object Storage Service

If you run a self-managed Hadoop cluster and want to use OSS-HDFS as your storage layer, JindoSDK lets you do that without modifying your existing Hadoop or Spark applications. This guide walks you through setting up a single-node Hadoop environment on an Alibaba Cloud ECS instance, installing JindoSDK, and verifying read/write access to OSS-HDFS using standard hdfs dfs commands.

How it works

JindoSDK acts as a FileSystem bridge: it intercepts HDFS API calls made to oss:// paths and routes them to OSS-HDFS. Because OSS-HDFS is fully compatible with the HDFS API and POSIX, your existing Hadoop and Spark jobs run unchanged.

OSS-HDFS supports both flat and hierarchical namespaces and automatically converts between them, giving you centralized metadata management. Its metadata layer uses a multi-node active-active redundancy mechanism — unlike the active-standby NameNode in traditional HDFS — which makes it more reliable and scalable. You can manage exabytes of data and hundreds of millions of objects with terabytes of throughput.

Prerequisites

Before you begin, make sure that you have:

An OSS bucket with OSS-HDFS enabled. For setup instructions, see Enable OSS-HDFS
An Alibaba Cloud account with an AccessKey ID and AccessKey secret

Step 1: Create a VPC and an ECS instance

Using an internal endpoint gives you lower latency and better security when accessing OSS-HDFS from Alibaba Cloud resources. Create a virtual private cloud (VPC) and launch an Elastic Compute Service (ECS) instance inside it.

Create a VPC

Log on to the VPC console.
On the VPC page, click Create VPC.

Important

Create the VPC in the same region as the bucket for which OSS-HDFS is enabled. For detailed steps, see Create a VPC and a vSwitch.

Create an ECS instance

Click the VPC ID. On the detail page, click the Resource Management tab.
In the Basic Cloud Resources Included section, click the icon next to Elastic Compute Service.
Create an ECS instance from the Instances page.

Important

Create the ECS instance in the same region as the VPC. For detailed steps, see Create an instance.

Step 2: Set up a Hadoop runtime environment

Connect to the ECS instance, then complete the following setup in order.

Install Java

JindoSDK requires Java Development Kit (JDK) 1.8.0 or later.

Click Connect on the ECS instance. For connection methods, see Methods for connecting to an ECS instance.
Check the installed JDK version:
```
   java -version
```
(Optional) If the JDK version is earlier than 1.8.0, remove it:
```
   rpm -qa | grep java | xargs rpm -e --nodeps
```

Install JDK 1.8.0:

   sudo yum install java-1.8.0-openjdk* -y

Open /etc/profile and add the following environment variables. If /usr/lib/jvm/java-1.8.0-openjdk does not exist, check /usr/lib/jvm/ for the correct path.

   export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
   export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/jre/lib/rt.jar
   export PATH=$PATH:$JAVA_HOME/bin

Apply the changes:
```
   source /etc/profile
```

Enable SSH

Hadoop's startup scripts use SSH to manage local daemons, even in single-node deployments.

Install the SSH service:

   sudo yum install -y openssh-clients openssh-server

Enable and start the SSH service:

   systemctl enable sshd && systemctl start sshd

Generate an SSH key and add it to the trusted list:

   ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
   cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
   chmod 0600 ~/.ssh/authorized_keys

Install Hadoop

This guide uses Hadoop 3.4.0. For other versions, see Apache Hadoop downloads.

Download and extract Hadoop:

   wget https://downloads.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
   tar xzf hadoop-3.4.0.tar.gz
   mv hadoop-3.4.0 /usr/local/hadoop

Add Hadoop to your environment. Open /etc/profile and append:

   export HADOOP_HOME=/usr/local/hadoop
   export PATH=$HADOOP_HOME/bin:$PATH

Then apply the changes:

   source /etc/profile

Set JAVA_HOME in the Hadoop environment file. Open $HADOOP_HOME/etc/hadoop/hadoop-env.sh and replace ${JAVA_HOME} with the actual path:
```
   export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
```

Configure $HADOOP_HOME/etc/hadoop/core-site.xml:

   <configuration>
     <!-- Address of the HDFS NameNode. Replace localhost with the actual hostname if needed. -->
     <property>
       <name>fs.defaultFS</name>
       <value>hdfs://localhost:9000</value>
     </property>

     <!-- Hadoop temporary directory. Grant the admin user ownership with:
          sudo chown -R admin:admin /opt/module/hadoop-3.4.0 -->
     <property>
       <name>hadoop.tmp.dir</name>
       <value>/opt/module/hadoop-3.4.0/data/tmp</value>
     </property>
   </configuration>

Configure $HADOOP_HOME/etc/hadoop/hdfs-site.xml:

   <configuration>
     <!-- Replication factor. Set to 1 for single-node deployments. -->
     <property>
       <name>dfs.replication</name>
       <value>1</value>
     </property>
   </configuration>

Format the HDFS filesystem (run this only once):
```
   hdfs namenode -format
```
Start HDFS (NameNode, DataNode, and secondary NameNode):
```
   cd /usr/local/hadoop/
   sbin/start-dfs.sh
```
Verify the daemons are running:
```
   jps
```
The output should list NameNode, DataNode, and SecondaryNameNode. You can also open http://<instance-ip>:9870 in a browser to view the HDFS web UI.
Confirm Hadoop is working:
```
   hadoop version
```

Step 3: Install JindoSDK and configure OSS-HDFS access

Install JindoSDK

Switch to the target directory and download the latest JindoSDK JAR package. For the download link, see JindoSDK downloads on GitHub.
```
   cd /usr/lib/
   # Download the package — replace x.x.x with the actual version number
   wget <jindosdk-download-url>
```
Extract the package:
```
   tar zxvf jindosdk-x.x.x-linux.tar.gz
```

Open /etc/profile and add the following:

   export JINDOSDK_HOME=/usr/lib/jindosdk-x.x.x-linux
   export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${JINDOSDK_HOME}/lib/*

Apply the changes:
```
   . /etc/profile
```

Configure OSS-HDFS access in core-site.xml

Add the following properties to $HADOOP_HOME/etc/hadoop/core-site.xml.

JindoSDK implementation classes — tell Hadoop to route oss:// paths through JindoSDK:

<configuration>
  <property>
    <name>fs.AbstractFileSystem.oss.impl</name>
    <value>com.aliyun.jindodata.oss.JindoOSS</value>
  </property>

  <property>
    <name>fs.oss.impl</name>
    <value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
  </property>
</configuration>

AccessKey credentials — replace the placeholder values with your actual AccessKey ID and AccessKey secret:

<configuration>
  <property>
    <name>fs.oss.accessKeyId</name>
    <value>xxx</value>
  </property>

  <property>
    <name>fs.oss.accessKeySecret</name>
    <value>xxx</value>
  </property>
</configuration>

Configure the OSS-HDFS endpoint

Use the oss://<Bucket>.<Endpoint>/<Object> path format when accessing OSS-HDFS. The endpoint follows the pattern <region>.oss-dls.aliyuncs.com. For example:

oss://examplebucket.cn-shanghai.oss-dls.aliyuncs.com/exampleobject.txt

JindoSDK reads the endpoint from the path and routes the request to the correct OSS-HDFS service automatically. For other ways to configure the endpoint, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.

Step 4: Access OSS-HDFS

All commands below use the hdfs dfs client. Replace examplebucket and cn-hangzhou with your actual bucket name and region.

Create a directory:

hdfs dfs -mkdir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/

Upload an object:

hdfs dfs -put /root/workspace/examplefile.txt oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/examplefile.txt

List a directory:

hdfs dfs -ls oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/

List an object:

hdfs dfs -ls oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/examplefile.txt

Print object content:

hdfs dfs -cat oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/examplefile.txt

Important

The -cat command prints content as plain text. If the object is encoded or in binary format, use the HDFS Java API to decode and read it.

Copy a directory:

The following command copies subdir1 into subdir2/subdir1, preserving the directory structure and all its contents:

hdfs dfs -cp oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/subdir1/ \
             oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/subdir2/subdir1/

Move a directory:

hdfs dfs -mv oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/srcdir/ \
             oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/destdir/

Download an object:

hdfs dfs -get oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/exampleobject.txt /tmp/

Delete a directory:

Run the following command to delete a directory named destfolder/ and all objects in the directory from a bucket named examplebucket:

hdfs dfs -rm oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/destfolder/

Troubleshooting

Commands fail with "No FileSystem for scheme: oss"

The JindoSDK JAR files are not on the Hadoop classpath. Check:

Confirm HADOOP_CLASSPATH includes the JindoSDK lib/ directory:
```
   echo $HADOOP_CLASSPATH
```
Make sure you sourced /etc/profile in the current shell session:
```
   . /etc/profile
```
Confirm the JAR files exist in $JINDOSDK_HOME/lib/:
```
   ls $JINDOSDK_HOME/lib/
```

Commands fail with "AccessDenied" or authentication errors

Confirm the AccessKey ID and AccessKey secret in core-site.xml are correct and have not expired.
Verify that the RAM user or role associated with the AccessKey has the required permissions on the target bucket.

Commands fail with connection timeout or "UnknownHost"

Confirm the endpoint in the oss:// path matches the region of your bucket (for example, cn-hangzhou.oss-dls.aliyuncs.com).
Confirm the ECS instance and the bucket are in the same region.
If you are using a public endpoint instead of an internal endpoint, make sure the ECS instance has internet access.

HDFS daemons fail to start

Confirm SSH passwordless login works: ssh localhost. If it fails, re-run the key generation steps in Enable SSH.
Check that JDK 1.8.0 or later is installed and JAVA_HOME is set correctly in hadoop-env.sh.

What's next

OSS-HDFS overview — learn about OSS-HDFS features and use cases
Appendix 1: Other methods used to configure the endpoint of OSS-HDFS — alternative endpoint configuration options