This topic describes how to use the Spark connector provided by Hologres to read data from Apache Spark and write the data to Hologres.
Background information
Apache Spark is a unified analytics engine that processes large amounts of data. Hologres is integrated with both the Apache Spark community and Apache Spark on Amazon EMR to help you build data warehouses. Hologres provides the Spark connector to allow you to batch write data from Apache Spark to Hologres. You can use the connector to read data from multiple types of sources, such as files, Hive tables, MySQL tables, and PostgreSQL tables.
Hologres is compatible with PostgreSQL. You can use the Spark connector to read Hologres data based on the PostgreSQL Java Database Connectivity (JDBC) driver. Then, you can extract, transform, and load (ETL) the data and write it to Hologres or other destinations.
Prerequisites
- The version of your Hologres instance is V0.9 or later. You can view the version of your Hologres instance on the instance details page in the Hologres console. If the version of your Hologres instance is earlier than V0.9, submit a ticket to update the instance.
- Apache Spark of a specific version that supports the
spark-shell
command is installed.
Use the Spark connector to write data to Hologres (recommended)
We recommend that you use the built-in Spark connector of Hologres to write data to Hologres. The Spark connector works together with Holo Client. Compared with other methods of writing data, the Spark connector provides better write performance. To use the Spark connector to write data, perform the following steps. For information about the sample code, see Example of using the Spark connector to write data to Hologres.
Example of using the Spark connector to write data to Hologres
The following example shows how to use the Spark connector to write data to Hologres.
Use the Spark connector to read data and write the data to Hologres
Use the Spark connector to write data to Hologres in real time
Data type mapping
The following table describes the data type mappings between Spark and Hologres.
Spark data type | Hologres data type |
---|---|
IntegerType | INT |
LongType | BIGINT |
StringType | TEXT |
DecimalType | NUMERIC(38, 18) |
BooleanType | BOOL |
DoubleType | DOUBLE PRECISION |
FloatType | FLOAT |
TimestampType | TIMESTAMPTZ |
DateType | DATE |
BinaryType | BYTEA |
ArrayType(IntegerType) | int4[] |
ArrayType(LongType) | int8[] |
ArrayType(FloatType | float4[] |
ArrayType(DoubleType) | float8[] |
ArrayType(BooleanType) | boolean[] |
ArrayType(StringType) | text[] |