This topic describes how to connect Spark to OSS.

Background information

EMR provides the following features for accessing OSS:
  • Supports MetaService.
  • Allows you to access OSS without using an AccessKey pair.
  • Allows you to access OSS by explicitly writing an AccessKey pair and an endpoint.
    Note The OSS endpoint must be an internal domain name. For more information, see Regions and endpoints.

Example

The following example shows how Spark reads data from OSS and writes the processed data back to OSS without using an AccessKey pair.
val conf = new SparkConf().setAppName("Test OSS")
    val sc = new SparkContext(conf)
    val pathIn = "oss://bucket/path/to/read"
    val inputData = sc.textFile(pathIn)
    val cnt = inputData.count
    println(s"count: $cnt")
    val outputPath = "oss://bucket/path/to/write"
    val outpuData = inputData.map(e => s"$e has been processed.")
    outpuData.saveAsTextFile(outputPath)
Note PySpark reads data from OSS in the same way as Spark.

Appendix

For the complete sample code, see Use Spark to access OSS.