All Products
Search
Document Center

Object Storage Service:Use Spark on an EMR cluster to process data stored in OSS-HDFS

Last Updated:Oct 24, 2023

This topic describes how to use Spark on an E-MapReduce (EMR) cluster to process data stored in OSS-HDFS.

Prerequisites

Procedure

  1. Log on to the EMR cluster.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. Click the EMR cluster that you created.

    3. Click the Nodes tab, and then click p480359.jpg on the left side of the node group.

    4. Click the ID of the ECS instance. On the Instances page, click Connect next to the instance ID.

    For more information about how to log on to a cluster in Windows or Linux by using an SSH key pair or SSH password, see Log on to a cluster.

  2. Run the following command on the terminal to start Spark Shell:

    spark-shell
  3. Use Spark to access OSS-HDFS.

    1. Create a table.

      spark.sql("CREATE TABLE test_oss (`c1` string) OPTIONS (PATH 'oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/dir')")
    2. Insert data into the table.

      spark.sql("INSERT INTO TABLE test_oss SELECT 'testdata' AS c1")
    3. Query data in the table.

      spark.sql("SELECT c1 FROM test_oss")