Spark processes data in JindoFileSystem (JindoFS) by using one of the following methods: call methods and use Spark SQL to read data from tables stored in JindoFS.

JindoFS configuration

For example, a namespace named emr-jfs is created with the following configuration:

  • jfs.namespaces=emr-jfs
  • jfs.namespaces.emr-jfs.oss.uri=oss://oss-bucket/oss-dir
  • jfs.namespaces.emr-jfs.mode=block

Process data in JindoFS

  • Call methods

    The read and write operations performed by Spark in JindoFS are similar to those in other file systems. For example, to access data in JindoFS, use a directory with the jfs prefix in the following Resilient Distributed Dataset (RDD) operation:

    val a = sc.textFile("jfs://emr-jfs/README.md")
    rdd_data

    To write data to JindoFS, call the following method:

    scala> a.collect().saveAsTextFile("jfs://emr-jfs/output")
  • Use Spark SQL

    Configure the parameter that sets the storage location to a directory in JindoFS when you create databases, tables, or partitions. For more information, see Use Hive to query data in JindoFS. Then, you can query data from tables stored in JindoFS.