By Jun Shi and Mingzhou Zhou
In the machine learning community, Apache Spark is widely used for data processing due to its efficiency in SQL-style operations, while TensorFlow is one of the most popular frameworks for model training. Although there are some data formats supported by both tools, TFRecord—the data format native to TensorFlow—is not fully supported by Spark. While there have been prior attempts to bridge the gap between these two systems (Spark-Tensorflow-Connector, for example), existing implementations leave out some important features provided by Spark.
In this post, we introduce and open source a new data source for Spark, Spark-TFRecord. The goal of Spark-TFRecord is to provide full support for the native TensorFlow data format in Spark. The intent of this project is to uplevel TFRecord as a first-class citizen in the Spark data source community to be on par with other internal formats, such as Avro, JSON, Parquet, etc. Spark-TFRecord provides not only the simple functions, such as data frame read and write, but also advanced ones, such as PartitionBy. As a result, a smooth data processing and training pipeline in TFRecord is possible.
Both TensorFlow and Spark are widely used at LinkedIn. Spark is used in many data processing and preparation pipelines. It is also the leading tool for data analytics. As more business units employ deep learning models, TensorFlow has become the mainstream modeling and serving tool. Open source TensorFlow models mainly use the TFRecord data format, while most of our internal datasets are in Avro format. In order to use open source models, we have to either change the model source code to take Avro files, or convert our datasets to TFRecord. This project facilitates the latter.
Prior to Spark-TFRecord, the most popular tool to read and write TFRecord in Spark has been Spark-Tensorflow-Connector. It is part of the TensorFlow ecosystem, and has been promoted by Databricks, the creator of Spark. Although it supports basic functions such as read and write, we noticed two disadvantages of its implementation for our use cases at LinkedIn. First, it is based on the RelationProvider interface. This interface is mainly for connecting Spark and a database (hence the name "connector"). In this case, the disk read and write operations are provided by the database. However, the main use case of Spark-Tensorflow-Connector is disk I/O operations, rather than connecting a database. In the absence of a database, the I/O operations have to be provided by the developers who implement the RelationProvider interface. This is why a considerable amount of code in Spark-Tensorflow-Connector is dedicated to various disk read and write scenarios.
In addition, Spark-Tensorflow-Connector lacks important functions such as PartitionBy, which splits the dataset according to a certain column. We find this function useful at LinkedIn when we need to train models for each entity, because it allows us to partition the training data by the entity IDs. Demand for this function runs high in the TensorFlow community, as well.
Spark-TFRecord fills these gaps by realizing the more versatile FileFormat interface, which is also used by other native formats such as Avro and Parquet. With this interface, all the DataFrame and DataSet I/O APIs are automatically available to TFRecord, including the sought-after PartitionBy function. In addition, future Spark I/O enhancements are automatically available through the interface.
We initially considered patching Spark-Tensorflow-Connector to obtain the PartitionBy function that we needed. But after examining its source code, we realized that RelationProvider, which Spark-Tensorflow-Connector is based on, is a Spark interface to SQL databases, making it not suitable for our purpose. Unfortunately, there does not exist a simple fix since RelationProvider is not designed to provide disk I/O operations. Instead, we took a totally different route and implemented FileFormat, which is designed for file-based I/O operations. This was helpful for our use cases at LinkedIn, where datasets are typically directly read from and written to disk, making FileFormat a more proper interface for those tasks.
The following diagram shows the building blocks.
Building blocks of Spark-TFRecord
Spark-TFRecord is fully backward-compatible with Spark-Tensorflow-Connector. Migration is easy: just include the spark-tfrecord jar file and specify the data format as "tfrecord". The example below shows how to use Spark-TFRecord to read, write, and partition TFRecord files. More examples can be found at our GitHub repository.
// launch spark-shell with the following command: // SPARK_HOME/bin/spark-shell --jar target/spark-tfrecord_2.11-0.1.jar import org.apache.spark.sql.SaveMode val df = Seq((8, "bat"),(8, "abc"), (1, "xyz"), (2, "aaa")).toDF("number", "word") df.show // scala> df.show // +------+----+ // |number|word| // +------+----+ // | 8| bat| // | 8| abc| // | 1| xyz| // | 2| aaa| // +------+----+ val tf_output_dir = "/tmp/tfrecord-test" // dump the tfrecords to files. df.repartition(3, col("number")).write.mode(SaveMode.Overwrite).partitionBy("number").format("tfrecord").option("recordType", "Example").save(tf_output_dir) // ls /tmp/tfrecord-test // _SUCCESS number=1 number=2 number=8 // read back the tfrecords from files. val new_df = spark.read.format("tfrecord").option("recordType", "Example").load(tf_output_dir) new_df.show // scala> new_df.show // +----+------+ // |word|number| // +----+------+ // | bat| 8| // | abc| 8| // | xyz| 1| // | aaa| 2|
Spark-TFRecord elevates TFRecord to be a first-class citizen within Spark, on par with other internal data formats. The full set of dataframe APIs, such as read, write, and partition are supported with this library. Currently, we limit the schemas to those supported by Spark-Tensorflow-Connector. Future work will expand to more complex schemas.
The authors would like to thank Min Shen, Liang Tang, Fangshi Li, Jun Jia, and Leon Gao for technical discussions, and Huiji Gao for help with resources.
Scan the QR code to join Apache Spark community!
Let's explore more open source big data platform solutions together!
Alibaba Clouder - October 15, 2019
Alibaba Clouder - January 21, 2020
Alibaba EMR - November 18, 2020
Alibaba F(x) Team - February 23, 2021
Alibaba Container Service - August 25, 2020
Alibaba Cloud MaxCompute - December 8, 2020
An end-to-end platform that provides various machine learning algorithms to meet your data mining and analysis requirements.Learn More
This technology can be used to predict the spread of COVID-19 and help decision makers evaluate the impact of various prevention and control measures on the development of the epidemic.Learn More
A fully-managed Apache Kafka service to help you quickly build data pipelines for your big data analytics.Learn More
A Big Data service that uses Apache Hadoop and Spark to process and analyze dataLearn More
More Posts by Alibaba EMR