If you run a self-managed Hadoop cluster and want to use OSS-HDFS as your storage layer, JindoSDK lets you do that without modifying your existing Hadoop or Spark applications. This guide walks you through setting up a single-node Hadoop environment on an Alibaba Cloud ECS instance, installing JindoSDK, and verifying read/write access to OSS-HDFS using standard hdfs dfs commands.
How it works
JindoSDK acts as a FileSystem bridge: it intercepts HDFS API calls made to oss:// paths and routes them to OSS-HDFS. Because OSS-HDFS is fully compatible with the HDFS API and POSIX, your existing Hadoop and Spark jobs run unchanged.
OSS-HDFS supports both flat and hierarchical namespaces and automatically converts between them, giving you centralized metadata management. Its metadata layer uses a multi-node active-active redundancy mechanism — unlike the active-standby NameNode in traditional HDFS — which makes it more reliable and scalable. You can manage exabytes of data and hundreds of millions of objects with terabytes of throughput.
Prerequisites
Before you begin, make sure that you have:
An OSS bucket with OSS-HDFS enabled. For setup instructions, see Enable OSS-HDFS
An Alibaba Cloud account with an AccessKey ID and AccessKey secret
Step 1: Create a VPC and an ECS instance
Using an internal endpoint gives you lower latency and better security when accessing OSS-HDFS from Alibaba Cloud resources. Create a virtual private cloud (VPC) and launch an Elastic Compute Service (ECS) instance inside it.
Create a VPC
Log on to the VPC console.
On the VPC page, click Create VPC.
Create the VPC in the same region as the bucket for which OSS-HDFS is enabled. For detailed steps, see Create a VPC and a vSwitch.
Create an ECS instance
Click the VPC ID. On the detail page, click the Resource Management tab.
In the Basic Cloud Resources Included section, click the icon next to Elastic Compute Service.
Create an ECS instance from the Instances page.
Create the ECS instance in the same region as the VPC. For detailed steps, see Create an instance.
Step 2: Set up a Hadoop runtime environment
Connect to the ECS instance, then complete the following setup in order.
Install Java
JindoSDK requires Java Development Kit (JDK) 1.8.0 or later.
Click Connect on the ECS instance. For connection methods, see Methods for connecting to an ECS instance.
Check the installed JDK version:
java -version(Optional) If the JDK version is earlier than 1.8.0, remove it:
rpm -qa | grep java | xargs rpm -e --nodepsInstall JDK 1.8.0:
sudo yum install java-1.8.0-openjdk* -yOpen
/etc/profileand add the following environment variables. If/usr/lib/jvm/java-1.8.0-openjdkdoes not exist, check/usr/lib/jvm/for the correct path.export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/jre/lib/rt.jar export PATH=$PATH:$JAVA_HOME/binApply the changes:
source /etc/profile
Enable SSH
Hadoop's startup scripts use SSH to manage local daemons, even in single-node deployments.
Install the SSH service:
sudo yum install -y openssh-clients openssh-serverEnable and start the SSH service:
systemctl enable sshd && systemctl start sshdGenerate an SSH key and add it to the trusted list:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys
Install Hadoop
This guide uses Hadoop 3.4.0. For other versions, see Apache Hadoop downloads.
Download and extract Hadoop:
wget https://downloads.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz tar xzf hadoop-3.4.0.tar.gz mv hadoop-3.4.0 /usr/local/hadoopAdd Hadoop to your environment. Open
/etc/profileand append:export HADOOP_HOME=/usr/local/hadoop export PATH=$HADOOP_HOME/bin:$PATHThen apply the changes:
source /etc/profileSet
JAVA_HOMEin the Hadoop environment file. Open$HADOOP_HOME/etc/hadoop/hadoop-env.shand replace${JAVA_HOME}with the actual path:export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdkConfigure
$HADOOP_HOME/etc/hadoop/core-site.xml:<configuration> <!-- Address of the HDFS NameNode. Replace localhost with the actual hostname if needed. --> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> <!-- Hadoop temporary directory. Grant the admin user ownership with: sudo chown -R admin:admin /opt/module/hadoop-3.4.0 --> <property> <name>hadoop.tmp.dir</name> <value>/opt/module/hadoop-3.4.0/data/tmp</value> </property> </configuration>Configure
$HADOOP_HOME/etc/hadoop/hdfs-site.xml:<configuration> <!-- Replication factor. Set to 1 for single-node deployments. --> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>Format the HDFS filesystem (run this only once):
hdfs namenode -formatStart HDFS (NameNode, DataNode, and secondary NameNode):
cd /usr/local/hadoop/ sbin/start-dfs.shVerify the daemons are running:
jpsThe output should list
NameNode,DataNode, andSecondaryNameNode. You can also openhttp://<instance-ip>:9870in a browser to view the HDFS web UI.Confirm Hadoop is working:
hadoop version
Step 3: Install JindoSDK and configure OSS-HDFS access
Install JindoSDK
Switch to the target directory and download the latest JindoSDK JAR package. For the download link, see JindoSDK downloads on GitHub.
cd /usr/lib/ # Download the package — replace x.x.x with the actual version number wget <jindosdk-download-url>Extract the package:
tar zxvf jindosdk-x.x.x-linux.tar.gzOpen
/etc/profileand add the following:export JINDOSDK_HOME=/usr/lib/jindosdk-x.x.x-linux export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${JINDOSDK_HOME}/lib/*Apply the changes:
. /etc/profile
Configure OSS-HDFS access in core-site.xml
Add the following properties to $HADOOP_HOME/etc/hadoop/core-site.xml.
JindoSDK implementation classes — tell Hadoop to route oss:// paths through JindoSDK:
<configuration>
<property>
<name>fs.AbstractFileSystem.oss.impl</name>
<value>com.aliyun.jindodata.oss.JindoOSS</value>
</property>
<property>
<name>fs.oss.impl</name>
<value>com.aliyun.jindodata.oss.JindoOssFileSystem</value>
</property>
</configuration>AccessKey credentials — replace the placeholder values with your actual AccessKey ID and AccessKey secret:
<configuration>
<property>
<name>fs.oss.accessKeyId</name>
<value>xxx</value>
</property>
<property>
<name>fs.oss.accessKeySecret</name>
<value>xxx</value>
</property>
</configuration>Configure the OSS-HDFS endpoint
Use the oss://<Bucket>.<Endpoint>/<Object> path format when accessing OSS-HDFS. The endpoint follows the pattern <region>.oss-dls.aliyuncs.com. For example:
oss://examplebucket.cn-shanghai.oss-dls.aliyuncs.com/exampleobject.txtJindoSDK reads the endpoint from the path and routes the request to the correct OSS-HDFS service automatically. For other ways to configure the endpoint, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.
Step 4: Access OSS-HDFS
All commands below use the hdfs dfs client. Replace examplebucket and cn-hangzhou with your actual bucket name and region.
Create a directory:
hdfs dfs -mkdir oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/Upload an object:
hdfs dfs -put /root/workspace/examplefile.txt oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/examplefile.txtList a directory:
hdfs dfs -ls oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir/List an object:
hdfs dfs -ls oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/examplefile.txtPrint object content:
hdfs dfs -cat oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/examplefile.txtThe -cat command prints content as plain text. If the object is encoded or in binary format, use the HDFS Java API to decode and read it.
Copy a directory:
The following command copies subdir1 into subdir2/subdir1, preserving the directory structure and all its contents:
hdfs dfs -cp oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/subdir1/ \
oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/subdir2/subdir1/Move a directory:
hdfs dfs -mv oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/srcdir/ \
oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/destdir/Download an object:
hdfs dfs -get oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/exampleobject.txt /tmp/Delete a directory:
Run the following command to delete a directory named destfolder/ and all objects in the directory from a bucket named examplebucket:
hdfs dfs -rm oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/destfolder/Troubleshooting
Commands fail with "No FileSystem for scheme: oss"
The JindoSDK JAR files are not on the Hadoop classpath. Check:
Confirm
HADOOP_CLASSPATHincludes the JindoSDKlib/directory:echo $HADOOP_CLASSPATHMake sure you sourced
/etc/profilein the current shell session:. /etc/profileConfirm the JAR files exist in
$JINDOSDK_HOME/lib/:ls $JINDOSDK_HOME/lib/
Commands fail with "AccessDenied" or authentication errors
Confirm the AccessKey ID and AccessKey secret in
core-site.xmlare correct and have not expired.Verify that the RAM user or role associated with the AccessKey has the required permissions on the target bucket.
Commands fail with connection timeout or "UnknownHost"
Confirm the endpoint in the
oss://path matches the region of your bucket (for example,cn-hangzhou.oss-dls.aliyuncs.com).Confirm the ECS instance and the bucket are in the same region.
If you are using a public endpoint instead of an internal endpoint, make sure the ECS instance has internet access.
HDFS daemons fail to start
Confirm SSH passwordless login works:
ssh localhost. If it fails, re-run the key generation steps in Enable SSH.Check that JDK 1.8.0 or later is installed and
JAVA_HOMEis set correctly inhadoop-env.sh.
What's next
OSS-HDFS overview — learn about OSS-HDFS features and use cases
Appendix 1: Other methods used to configure the endpoint of OSS-HDFS — alternative endpoint configuration options