Use Spark on an EMR cluster to process data in OSS-HDFS - Object Storage Service

OSS-HDFS provides an HDFS-compatible interface for Object Storage Service (OSS), so Apache Spark can read and write OSS data using standard HDFS APIs. This topic covers the interactive approach: starting Spark Shell on an E-MapReduce (EMR) cluster node and running Spark SQL statements against data stored in OSS-HDFS.

Prerequisites

Before you begin, ensure that you have:

An EMR cluster running V3.42.0 or later, or V5.8.0 or later. See Create a cluster
OSS-HDFS enabled for your OSS bucket. See Enable OSS-HDFS
Access permissions granted on OSS-HDFS. See Grant access permissions

Run Spark SQL against OSS-HDFS

Step 1: Log on to the EMR cluster

Log on to the EMR on ECS console.
Click the EMR cluster you created.
Click the Nodes tab, then click to the left of the node group.
Click the ECS instance ID. On the Instances page, click Connect next to the instance ID.

For SSH key pair and SSH password login instructions on Windows and Linux, see Log on to a cluster.

Step 2: Start Spark Shell

Run the following command:

spark-shell

Step 3: Access OSS-HDFS with Spark SQL

OSS-HDFS paths use the format oss://<bucket>.<region>.oss-dls.aliyuncs.com/<path>.

Create a table backed by OSS-HDFS:

spark.sql("CREATE TABLE test_oss (`c1` string) OPTIONS (PATH 'oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir')")

Insert a row:

spark.sql("INSERT INTO TABLE test_oss SELECT 'testdata' AS c1")

Query the table:
```
spark.sql("SELECT c1 FROM test_oss")
```

Replace examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir with the endpoint and path of your OSS-HDFS bucket.