Spark access MaxCompute data - MaxCompute - Alibaba Cloud Documentation Center

Processing MaxCompute data with Apache Spark often involves inefficient data transfers between the two systems. The MaxCompute Spark Connector resolves this by providing high-throughput, direct access to your MaxCompute data. By leveraging the MaxCompute Storage API, the connector bypasses the SQL layer, enabling Spark to read data directly from the underlying storage. This eliminates data exports, reduces latency, and significantly improves performance, especially for large-scale workloads. Furthermore, its integration with the Spark Catalog API lets you manage and query MaxCompute tables using standard Spark SQL, just as you would with native Spark tables.

Scope and limitations

When a third-party engine accesses MaxCompute:
- You can read standard tables, partitioned tables, clustered tables, Delta Tables, and materialized views.
- You cannot read MaxCompute external tables or logical views.
Reading the JSON data type is not supported.

Procedure

Activate MaxCompute and create a MaxCompute project.
Purchase a subscription Data Transmission Service resource group.
Deploy a Spark developer environment.
Use a compatible Spark version (3.2.x to 3.5.x). Download the package from Spark and extract it to a local directory.
1. To set up the environment on a Linux, see Set up a Linux developer environment.
2. To set up the environment on a Windows, see Set up a Windows developer environment.

Download and compile the Spark connector. The connector version must correspond to your Spark version. This guide uses Spark 3.3.1 as an example. Ensure Git is installed in your environment.

## Download the Spark connector:
git clone https://github.com/aliyun/aliyun-maxcompute-data-collectors.git

## Switch to the spark-connector folder
cd aliyun-maxcompute-data-collectors/spark-connector 

## Compile the package
mvn clean package

## Location of the Datasource JAR package
datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar

## Copy the Datasource JAR package to the $SPARK_HOME/jars/ folder
cp datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar $SPARK_HOME/jars/

Configure your MaxCompute access credentials. In your $SPARK_HOME/conf directory, create a spark-defaults.conf file:

cd $SPARK_HOME/conf
vim spark-defaults.conf

Add the following configurations to the spark-defaults.conf file:

## Configure the account in spark-defaults.conf
spark.hadoop.odps.project.name=doc_test
spark.hadoop.odps.access.id=L********************
spark.hadoop.odps.access.key=*******************
spark.hadoop.odps.end.point=http://service.cn-beijing.maxcompute.aliyun.com/api
spark.hadoop.odps.tunnel.quota.name=ot_xxxx_p#ot_xxxx
## Configure the MaxCompute Catalog
spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog 
spark.sql.extensions=org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions

Use the Spark connector to access MaxCompute.

Run the following command in the bin folder of Spark to start the Spark SQL client:
```
cd $SPARK_HOME/bin
spark-sql
```
List the tables in your MaxCompute project:
```
SHOW tables in odps.doc_test;
```
doc_test is an example of a MaxCompute project name. Replace it with your actual project name.

Create a table:

CREATE TABLE odps.doc_test.mc_test_table (name STRING, num BIGINT);

Read data from the table:

SELECT * FROM odps.doc_test.mc_test_table;

Create a partitioned table:

 CREATE TABLE odps.doc_test.mc_test_table_pt (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING);

Read data from the partitioned table:

SELECT * FROM odps.doc_test.mc_test_table_pt;

Example output:

test1   1       2018    0601
test2   2       2018    0601
Time taken: 1.312 seconds, Fetched 2 row(s)

Delete the table:

DROP TABLE IF EXISTS odps.doc_test.mc_test_table;