Processing MaxCompute data with Apache Spark often involves inefficient data transfers between the two systems. The MaxCompute Spark Connector resolves this by providing high-throughput, direct access to your MaxCompute data. By leveraging the MaxCompute Storage API, the connector bypasses the SQL layer, enabling Spark to read data directly from the underlying storage. This eliminates data exports, reduces latency, and significantly improves performance, especially for large-scale workloads. Furthermore, its integration with the Spark Catalog API lets you manage and query MaxCompute tables using standard Spark SQL, just as you would with native Spark tables.
Scope and limitations
When a third-party engine accesses MaxCompute:
You can read standard tables, partitioned tables, clustered tables, Delta Tables, and materialized views.
You cannot read MaxCompute external tables or logical views.
Reading the JSON data type is not supported.
Procedure
Purchase a subscription Data Transmission Service resource group.
Deploy a Spark developer environment.
Use a compatible Spark version (
3.2.xto3.5.x). Download the package from Spark and extract it to a local directory.To set up the environment on a Linux, see Set up a Linux developer environment.
To set up the environment on a Windows, see Set up a Windows developer environment.
Download and compile the Spark connector. The connector version must correspond to your Spark version. This guide uses Spark 3.3.1 as an example. Ensure Git is installed in your environment.
## Download the Spark connector: git clone https://github.com/aliyun/aliyun-maxcompute-data-collectors.git ## Switch to the spark-connector folder cd aliyun-maxcompute-data-collectors/spark-connector ## Compile the package mvn clean package ## Location of the Datasource JAR package datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar ## Copy the Datasource JAR package to the $SPARK_HOME/jars/ folder cp datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar $SPARK_HOME/jars/Configure your MaxCompute access credentials. In your
$SPARK_HOME/confdirectory, create aspark-defaults.conffile:cd $SPARK_HOME/conf vim spark-defaults.confAdd the following configurations to the
spark-defaults.conffile:## Configure the account in spark-defaults.conf spark.hadoop.odps.project.name=doc_test spark.hadoop.odps.access.id=L******************** spark.hadoop.odps.access.key=******************* spark.hadoop.odps.end.point=http://service.cn-beijing.maxcompute.aliyun.com/api spark.hadoop.odps.tunnel.quota.name=ot_xxxx_p#ot_xxxx ## Configure the MaxCompute Catalog spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog spark.sql.extensions=org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensionsUse the Spark connector to access MaxCompute.
Run the following command in the
binfolder of Spark to start the Spark SQL client:cd $SPARK_HOME/bin spark-sqlList the tables in your MaxCompute project:
SHOW tables in odps.doc_test;doc_testis an example of a MaxCompute project name. Replace it with your actual project name.Create a table:
CREATE TABLE odps.doc_test.mc_test_table (name STRING, num BIGINT);Read data from the table:
SELECT * FROM odps.doc_test.mc_test_table;Create a partitioned table:
CREATE TABLE odps.doc_test.mc_test_table_pt (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING);Read data from the partitioned table:
SELECT * FROM odps.doc_test.mc_test_table_pt;Example output:
test1 1 2018 0601 test2 2 2018 0601 Time taken: 1.312 seconds, Fetched 2 row(s)Delete the table:
DROP TABLE IF EXISTS odps.doc_test.mc_test_table;