Spark processes data in JindoFileSystem (JindoFS) by calling methods or using Spark SQL to read data from tables stored in JindoFS.

JindoFS configuration

For example, a namespace named emr-jfs is created with the following configuration:

  • jfs.namespaces=emr-jfs
  • jfs.namespaces.emr-jfs.uri=oss://oss-bucket/oss-dir
  • jfs.namespaces.emr-jfs.mode=block

Process data in JindoFS

  • Call methods

    The read and write operations performed by Spark in JindoFS are similar to those in other file systems. For example, to access data in JindoFS, use a directory with the jfs prefix in the following resilient distributed dataset (RDD) operation:

    val a = sc.textFile("jfs://emr-jfs/")

    To write data to JindoFS, call the following method:

    scala> a.collect().saveAsTextFile("jfs://emr-jfs/output")
  • Use Spark SQL

    Set the parameter that specifies the storage location to a directory in JindoFS when you create databases, tables, and partitions. For more information, see Use Hive to query data in JindoFS. Then, you can query data from tables stored in JindoFS.