MaxCompute open storage allows Spark to use a connector to call the Storage API and directly read data from MaxCompute. This simplifies the data reading process and improves access performance. By integrating with the data storage capabilities of MaxCompute, Spark provides efficient, flexible, and powerful data processing and analysis.
Usage notes
When a third-party engine accesses MaxCompute:
You can read standard tables, partitioned tables, clustered tables, Delta Tables, and materialized views.
You cannot read MaxCompute foreign tables or logical views.
Reading the JSON data type is not supported.
Procedure
Purchase a subscription Data Transmission Service resource group.
Deploy a Spark developer environment.
Use a Spark package from version
Spark 3.2.x - Spark 3.5.x. You can click Spark to download the package and extract it to a local folder.To set up the environment on a Linux operating system, see Set up a Linux developer environment.
To set up the environment on a Windows operating system, see Set up a Windows developer environment.
Download and compile the Spark connector. (Only Spark 3.2.x to 3.5.x is supported. This topic uses Spark 3.3.1 as an example.)
Run the
git clonecommand to download the Spark connector installation package. Make sure that Git is installed in the environment. Otherwise, an error occurs when you run the command.## Download the Spark connector: git clone https://github.com/aliyun/aliyun-maxcompute-data-collectors.git ## Switch to the spark-connector folder cd aliyun-maxcompute-data-collectors/spark-connector ## Compile mvn clean package ## Location of the Datasource JAR package datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar ## Copy the Datasource JAR package to the $SPARK_HOME/jars/ folder cp datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar $SPARK_HOME/jars/Configure the access information for your MaxCompute account.
In the
conffolder of Spark, create aspark-defaults.conffile:cd $SPARK_HOME/conf vim spark-defaults.confConfigure the account information in the
spark-defaults.conffile:## Configure the account in spark-defaults.conf spark.hadoop.odps.project.name=doc_test spark.hadoop.odps.access.id=L******************** spark.hadoop.odps.access.key=******************* spark.hadoop.odps.end.point=http://service.cn-beijing.maxcompute.aliyun.com/api spark.hadoop.odps.tunnel.quota.name=ot_xxxx_p#ot_xxxx ## Configure the MaxCompute Catalog spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog spark.sql.extensions=org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensionsUse the Spark connector to access MaxCompute.
Run the following command in the
binfolder of Spark to start the Spark SQL client:cd $SPARK_HOME/bin spark-sqlQuery the tables in the MaxCompute project:
SHOW tables in odps.doc_test;doc_testis an example of a MaxCompute project name. Replace it with your actual project name.Create a table:
CREATE TABLE odps.doc_test.mc_test_table (name STRING, num BIGINT);Read data from the table:
SELECT * FROM odps.doc_test.mc_test_table;Create a partitioned table:
CREATE TABLE odps.doc_test.mc_test_table_pt (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING);Read data from the partitioned table:
SELECT * FROM odps.doc_test.mc_test_table_pt;The following code provides an example of the result:
test1 1 2018 0601 test2 2 2018 0601 Time taken: 1.312 seconds, Fetched 2 row(s)Delete the table:
DROP TABLE IF EXISTS odps.doc_test.mc_test_table;