OSS-HDFS provides an HDFS-compatible interface for Object Storage Service (OSS), so Apache Spark can read and write OSS data using standard HDFS APIs. This topic covers the interactive approach: starting Spark Shell on an E-MapReduce (EMR) cluster node and running Spark SQL statements against data stored in OSS-HDFS.
Prerequisites
Before you begin, ensure that you have:
An EMR cluster running V3.42.0 or later, or V5.8.0 or later. See Create a cluster
OSS-HDFS enabled for your OSS bucket. See Enable OSS-HDFS
Access permissions granted on OSS-HDFS. See Grant access permissions
Run Spark SQL against OSS-HDFS
Step 1: Log on to the EMR cluster
Log on to the EMR on ECS console.
Click the EMR cluster you created.
Click the Nodes tab, then click
to the left of the node group.Click the ECS instance ID. On the Instances page, click Connect next to the instance ID.
For SSH key pair and SSH password login instructions on Windows and Linux, see Log on to a cluster.
Step 2: Start Spark Shell
Run the following command:
spark-shellStep 3: Access OSS-HDFS with Spark SQL
OSS-HDFS paths use the format oss://<bucket>.<region>.oss-dls.aliyuncs.com/<path>.
Create a table backed by OSS-HDFS:
spark.sql("CREATE TABLE test_oss (`c1` string) OPTIONS (PATH 'oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir')")Insert a row:
spark.sql("INSERT INTO TABLE test_oss SELECT 'testdata' AS c1")Query the table:
spark.sql("SELECT c1 FROM test_oss")
Replace examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir with the endpoint and path of your OSS-HDFS bucket.