All Products
Search
Document Center

Object Storage Service:Use Spark on an EMR cluster to process data stored in OSS-HDFS

Last Updated:Mar 20, 2026

OSS-HDFS provides an HDFS-compatible interface for Object Storage Service (OSS), so Apache Spark can read and write OSS data using standard HDFS APIs. This topic covers the interactive approach: starting Spark Shell on an E-MapReduce (EMR) cluster node and running Spark SQL statements against data stored in OSS-HDFS.

Prerequisites

Before you begin, ensure that you have:

Run Spark SQL against OSS-HDFS

Step 1: Log on to the EMR cluster

  1. Log on to the EMR on ECS console.

  2. Click the EMR cluster you created.

  3. Click the Nodes tab, then click p480359.jpg to the left of the node group.

  4. Click the ECS instance ID. On the Instances page, click Connect next to the instance ID.

For SSH key pair and SSH password login instructions on Windows and Linux, see Log on to a cluster.

Step 2: Start Spark Shell

Run the following command:

spark-shell

Step 3: Access OSS-HDFS with Spark SQL

OSS-HDFS paths use the format oss://<bucket>.<region>.oss-dls.aliyuncs.com/<path>.

  1. Create a table backed by OSS-HDFS:

    spark.sql("CREATE TABLE test_oss (`c1` string) OPTIONS (PATH 'oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir')")
  2. Insert a row:

    spark.sql("INSERT INTO TABLE test_oss SELECT 'testdata' AS c1")
  3. Query the table:

    spark.sql("SELECT c1 FROM test_oss")

Replace examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir with the endpoint and path of your OSS-HDFS bucket.