This topic describes how to use Spark to read data from Object Storage Service (OSS).

Background information

E-MapReduce (EMR) provides the following methods for you to access OSS:
  • You can access OSS by using MetaService.
  • You can access OSS without using an AccessKey pair.
  • You can access OSS by explicitly writing an AccessKey pair and an endpoint.
    Note You must use an internal endpoint of OSS. For more information about endpoints, see OSS endpoints.

Example of using Spark to access OSS

The following example shows how to use Spark to read data from OSS and write the processed data back to OSS without using an AccessKey pair:
val conf = new SparkConf().setAppName("Test OSS")
val sc = new SparkContext(conf)
val pathIn = "oss://bucket/path/to/read"
val inputData = sc.textFile(pathIn)
val cnt = inputData.count
println(s"count: $cnt")
val outputPath = "oss://bucket/path/to/write"
val outpuData = inputData.map(e => s"$e has been processed.")
outpuData.saveAsTextFile(outputPath)

For the complete sample code, visit GitHub.

Example of using PySpark to access OSS

The following example shows how to use PySpark to read data from OSS and write the processed data back to OSS without using an AccessKey pair:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Python Spark SQL OSS example").getOrCreate()
pathIn = "oss://bucket/path/to/read"
df = spark.read.text(pathIn)
cnt = df.count()
print(cnt)
outputPath = "oss://bucket/path/to/write"
df.write.format("parquet").mode('overwrite').save(outputPath)