When using E-MapReduce, you can use two types of OSS URIs:
native URI: oss://[accessKeyId:accessKeySecret@]bucket[.endpoint]/object/path
This URI is used for specifying input/output data sources in the job, and is similar to hdfs://. In OSS data operations, you can configure accessKeyId, accessKeySecret and endpoint in Configuration, or you can specify accessKeyId, accessKeySecret and endpoint in URI.
ref URI: ossref://bucket/object/path
It is only valid in E-MapReduce job configuration and is used to specify the resources needed for running the job. For example:
--class org.apache.spark.examples.SparkPi --master yarn-client --executor-memory 1G --num-executors 2 ossref://my-bucket/spark-examples-0.1-SNAPSHOT.jar 1000
We call “oss” and “ossref” prefixes as “scheme”. During usage, you need to pay special attention to the difference of scheme in URI.
When supporting data writing to OSS, E-MapReduce adopts multipart uploading method of OSS. It is worth reminding that when the job suffers abnormal interruptions, some generated data may be left in OSS and you need to delete the data manually. This action is consistent with job output to HDFS. When the job suffers abnormal interruptions, some data may be left in HDFS and you also need to delete the data manually. But there is a difference: OSS puts the uploaded files through multipart in Fragment Management. So you not only need to delete the files left in the output directory in OSS File Management, but also need to clear the files in OSS Fragment Management. Otherwise charges can be incurred for the data storage.