EMR integrates Spark with Object Storage Service (OSS), letting you read and write OSS data using Spark RDD (Scala), PySpark, or Spark SQL. EMR allows you to read data from and write data to OSS without specifying an AccessKey pair or by explicitly specifying an AccessKey pair.
Choose an access method
| Method | When to use |
|---|---|
| Password-free access (recommended) | Your EMR cluster supports password-free OSS access. No credentials to manage. |
| Explicit AccessKey | You need a specific AccessKey pair, or your cluster does not support password-free access. |
Prerequisites
-
An EMR cluster with Spark installed
-
SSH access to the master node. For details, see Log on to the master node of a cluster
-
An OSS bucket with data to read, or a writable OSS path for output
Access OSS without specifying an AccessKey pair
EMR clusters use password-free OSS access by default. The following examples use the oss:// URI scheme to read from and write to OSS.
Use Spark Shell (Scala)
-
Log on to the master node via SSH.
-
Start Spark Shell:
spark-shell -
Run the following code. Replace
<yourBucket>with your OSS bucket name.scala> val pathIn = "oss://<yourBucket>/path/to/read" scala> val inputData = sc.textFile(pathIn) scala> val cnt = inputData.count cnt: Long = ... scala> println(s"count: $cnt") scala> val outputPath = "oss://<yourBucket>/path/to/write" scala> val outputData = inputData.map(e => s"$e has been processed.") scala> outputData.saveAsTextFile(outputPath)The code reads all lines from the input path, counts them, appends a suffix to each line, and writes the result to the output path. For the complete sample, see SparkOssDemo.scala on GitHub.
Use PySpark
-
Log on to the master node via SSH.
-
Start PySpark:
pyspark -
Run the following code. Replace
<yourBucket>with your OSS bucket name.>>> pathIn = "oss://<yourBucket>/path/to/read" >>> df = spark.read.text(pathIn) >>> cnt = df.count() >>> print(cnt) >>> outputPath = "oss://<yourBucket>/path/to/write" >>> df.write.format("parquet").mode("overwrite").save(outputPath)The code reads the input path as a text DataFrame, prints the row count, and writes the result in Parquet format to the output path.
Use Spark SQL
-
Log on to the master node via SSH.
-
Start the Spark SQL CLI:
spark-sql -
Create a database stored in OSS, create a CSV table, and insert a row:
Parameter Description delimiterThe character used to separate fields in the CSV file. headerSet to trueif the first row contains column names;falseotherwise.CREATE DATABASE test_db LOCATION "oss://<yourBucket>/test_db"; USE test_db; CREATE TABLE student (id INT, name STRING, age INT) USING CSV OPTIONS ("delimiter"=";", "header"="true"); INSERT INTO student VALUES(1, "ab", 12); SELECT * FROM student;Replace
<yourBucket>with your OSS bucket name. TheSELECTstatement returns:1 ab 12 -
To verify the result, check the CSV file in OSS. The file uses semicolon delimiters, with the first row as headers:
id;name;age 1;ab;12
Access OSS by specifying an AccessKey pair
Use this method when password-free access is unavailable, or when you need to authenticate with a specific AccessKey pair.
Step 1: Remove the password-free configuration
Remove the fs.oss.credentials.provider parameter from the core-site.xml file of the Hadoop-Common service.
Step 2: Verify that password-free access is removed
Run the following command:
hadoop fs -ls oss://<yourBucket>/test_db
If the removal succeeded, you see:
ls: ERROR: not found login secrets, please configure the accessKeyId and accessKeySecret.
Step 3: Add the AccessKey parameters to core-site.xml
In the core-site.xml file of the Hadoop-Common service, add the following parameters:
| Key | Example value | Description |
|---|---|---|
fs.oss.accessKeyId |
LTAI5tM85Z4sc**** |
Your AccessKey ID |
fs.oss.accessKeySecret |
HF7P1L8PS6Eqf**** |
Your AccessKey secret |
Step 4: Verify the AccessKey configuration
Run the following command:
hadoop fs -ls oss://<yourBucket>/test_db
If the AccessKey pair is configured correctly, the output lists the OSS path:
drwxrwxrwx - root root 0 2025-02-24 11:45 oss://<yourBucket>/test_db/student
Step 5: Restart Spark services
Restart all Spark-related services. After they are running, use Spark Shell, PySpark, or Spark SQL to read from and write to OSS.
FAQ
How do I read from one bucket and write to another when they use different credentials?
Configure a bucket-level credential provider using the fs.oss.bucket.<BucketName>.credentials.provider parameter, where <BucketName> is the name of the bucket you want to configure. For details, see Configure a credential provider of OSS or OSS-HDFS by bucket.
How do I access a bucket in a different region?
Use the format oss://<BucketName>.<public endpoint of the bucket>/ to specify the bucket's public endpoint. Cross-region access incurs data transfer fees and will affect stability.
How do I use Amazon S3 SDKs to access OSS?
OSS provides Amazon S3-compatible API operations. After migrating data from Amazon S3 to OSS, update your client configuration to point to OSS endpoints. For details, see Use Amazon S3 SDKs to access OSS.