This topic describes how to connect Spark to OSS.
Background information
EMR provides the following features for accessing OSS:
- Supports MetaService.
- Allows you to access OSS without using an AccessKey pair.
- Allows you to access OSS by explicitly writing an AccessKey pair and an endpoint.
Note The OSS endpoint must be an internal domain name. For more information, see Regions and endpoints.
Example
The following example shows how Spark reads data from OSS and writes the processed
data back to OSS without using an AccessKey pair.
val conf = new SparkConf().setAppName("Test OSS")
val sc = new SparkContext(conf)
val pathIn = "oss://bucket/path/to/read"
val inputData = sc.textFile(pathIn)
val cnt = inputData.count
println(s"count: $cnt")
val outputPath = "oss://bucket/path/to/write"
val outpuData = inputData.map(e => s"$e has been processed.")
outpuData.saveAsTextFile(outputPath)
Note PySpark reads data from OSS in the same way as Spark.
Appendix
For the complete sample code, see Use Spark to access OSS.