All Products
Search
Document Center

E-MapReduce:Connect Spark to OSS

Last Updated:Mar 26, 2026

EMR integrates Spark with Object Storage Service (OSS), letting you read and write OSS data using Spark RDD (Scala), PySpark, or Spark SQL. EMR allows you to read data from and write data to OSS without specifying an AccessKey pair or by explicitly specifying an AccessKey pair.

Choose an access method

Method When to use
Password-free access (recommended) Your EMR cluster supports password-free OSS access. No credentials to manage.
Explicit AccessKey You need a specific AccessKey pair, or your cluster does not support password-free access.

Prerequisites

  • An EMR cluster with Spark installed

  • SSH access to the master node. For details, see Log on to the master node of a cluster

  • An OSS bucket with data to read, or a writable OSS path for output

Access OSS without specifying an AccessKey pair

EMR clusters use password-free OSS access by default. The following examples use the oss:// URI scheme to read from and write to OSS.

Use Spark Shell (Scala)

  1. Log on to the master node via SSH.

  2. Start Spark Shell:

    spark-shell
  3. Run the following code. Replace <yourBucket> with your OSS bucket name.

    scala> val pathIn = "oss://<yourBucket>/path/to/read"
    scala> val inputData = sc.textFile(pathIn)
    scala> val cnt = inputData.count
    cnt: Long = ...
    scala> println(s"count: $cnt")
    scala> val outputPath = "oss://<yourBucket>/path/to/write"
    scala> val outputData = inputData.map(e => s"$e has been processed.")
    scala> outputData.saveAsTextFile(outputPath)

    The code reads all lines from the input path, counts them, appends a suffix to each line, and writes the result to the output path. For the complete sample, see SparkOssDemo.scala on GitHub.

Use PySpark

  1. Log on to the master node via SSH.

  2. Start PySpark:

    pyspark
  3. Run the following code. Replace <yourBucket> with your OSS bucket name.

    >>> pathIn = "oss://<yourBucket>/path/to/read"
    >>> df = spark.read.text(pathIn)
    >>> cnt = df.count()
    >>> print(cnt)
    >>> outputPath = "oss://<yourBucket>/path/to/write"
    >>> df.write.format("parquet").mode("overwrite").save(outputPath)

    The code reads the input path as a text DataFrame, prints the row count, and writes the result in Parquet format to the output path.

Use Spark SQL

  1. Log on to the master node via SSH.

  2. Start the Spark SQL CLI:

    spark-sql
  3. Create a database stored in OSS, create a CSV table, and insert a row:

    Parameter Description
    delimiter The character used to separate fields in the CSV file.
    header Set to true if the first row contains column names; false otherwise.
    CREATE DATABASE test_db LOCATION "oss://<yourBucket>/test_db";
    USE test_db;
    CREATE TABLE student (id INT, name STRING, age INT)
        USING CSV OPTIONS ("delimiter"=";", "header"="true");
    INSERT INTO student VALUES(1, "ab", 12);
    SELECT * FROM student;

    Replace <yourBucket> with your OSS bucket name. The SELECT statement returns:

    1    ab    12
  4. To verify the result, check the CSV file in OSS. The file uses semicolon delimiters, with the first row as headers:

    id;name;age
    1;ab;12

Access OSS by specifying an AccessKey pair

Use this method when password-free access is unavailable, or when you need to authenticate with a specific AccessKey pair.

Step 1: Remove the password-free configuration

Remove the fs.oss.credentials.provider parameter from the core-site.xml file of the Hadoop-Common service.

Step 2: Verify that password-free access is removed

Run the following command:

hadoop fs -ls oss://<yourBucket>/test_db

If the removal succeeded, you see:

ls: ERROR: not found login secrets, please configure the accessKeyId and accessKeySecret.

Step 3: Add the AccessKey parameters to core-site.xml

In the core-site.xml file of the Hadoop-Common service, add the following parameters:

Key Example value Description
fs.oss.accessKeyId LTAI5tM85Z4sc**** Your AccessKey ID
fs.oss.accessKeySecret HF7P1L8PS6Eqf**** Your AccessKey secret

Step 4: Verify the AccessKey configuration

Run the following command:

hadoop fs -ls oss://<yourBucket>/test_db

If the AccessKey pair is configured correctly, the output lists the OSS path:

drwxrwxrwx   - root root          0 2025-02-24 11:45 oss://<yourBucket>/test_db/student

Step 5: Restart Spark services

Restart all Spark-related services. After they are running, use Spark Shell, PySpark, or Spark SQL to read from and write to OSS.

FAQ

How do I read from one bucket and write to another when they use different credentials?

Configure a bucket-level credential provider using the fs.oss.bucket.<BucketName>.credentials.provider parameter, where <BucketName> is the name of the bucket you want to configure. For details, see Configure a credential provider of OSS or OSS-HDFS by bucket.

How do I access a bucket in a different region?

Use the format oss://<BucketName>.<public endpoint of the bucket>/ to specify the bucket's public endpoint. Cross-region access incurs data transfer fees and will affect stability.

How do I use Amazon S3 SDKs to access OSS?

OSS provides Amazon S3-compatible API operations. After migrating data from Amazon S3 to OSS, update your client configuration to point to OSS endpoints. For details, see Use Amazon S3 SDKs to access OSS.