All Products
Search
Document Center

MaxCompute:Spark Connector

Last Updated:Dec 04, 2025

MaxCompute open storage allows Spark to use a connector to call the Storage API and directly read data from MaxCompute. This simplifies the data reading process and improves access performance. By integrating with the data storage capabilities of MaxCompute, Spark provides efficient, flexible, and powerful data processing and analysis.

Usage notes

  • When a third-party engine accesses MaxCompute:

    • You can read standard tables, partitioned tables, clustered tables, Delta Tables, and materialized views.

    • You cannot read MaxCompute foreign tables or logical views.

  • Reading the JSON data type is not supported.

Procedure

  1. Activate MaxCompute and create a MaxCompute project.

  2. Purchase a subscription Data Transmission Service resource group.

  3. Deploy a Spark developer environment.

    Use a Spark package from version Spark 3.2.x - Spark 3.5.x. You can click Spark to download the package and extract it to a local folder.

    1. To set up the environment on a Linux operating system, see Set up a Linux developer environment.

    2. To set up the environment on a Windows operating system, see Set up a Windows developer environment.

  4. Download and compile the Spark connector. (Only Spark 3.2.x to 3.5.x is supported. This topic uses Spark 3.3.1 as an example.)

    Run the git clone command to download the Spark connector installation package. Make sure that Git is installed in the environment. Otherwise, an error occurs when you run the command.

    ## Download the Spark connector:
    git clone https://github.com/aliyun/aliyun-maxcompute-data-collectors.git
    
    ## Switch to the spark-connector folder
    cd aliyun-maxcompute-data-collectors/spark-connector 
    
    ## Compile
    mvn clean package
    
    ## Location of the Datasource JAR package
    datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar
    
    ## Copy the Datasource JAR package to the $SPARK_HOME/jars/ folder
    cp datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar $SPARK_HOME/jars/
  5. Configure the access information for your MaxCompute account.

    In the conf folder of Spark, create a spark-defaults.conf file:

    cd $SPARK_HOME/conf
    vim spark-defaults.conf

    Configure the account information in the spark-defaults.conf file:

    ## Configure the account in spark-defaults.conf
    spark.hadoop.odps.project.name=doc_test
    spark.hadoop.odps.access.id=L********************
    spark.hadoop.odps.access.key=*******************
    spark.hadoop.odps.end.point=http://service.cn-beijing.maxcompute.aliyun.com/api
    spark.hadoop.odps.tunnel.quota.name=ot_xxxx_p#ot_xxxx
    ## Configure the MaxCompute Catalog
    spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog 
    spark.sql.extensions=org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions
  6. Use the Spark connector to access MaxCompute.

    1. Run the following command in the bin folder of Spark to start the Spark SQL client:

      cd $SPARK_HOME/bin
      spark-sql
    2. Query the tables in the MaxCompute project:

      SHOW tables in odps.doc_test;

      doc_test is an example of a MaxCompute project name. Replace it with your actual project name.

    3. Create a table:

      CREATE TABLE odps.doc_test.mc_test_table (name STRING, num BIGINT);
    4. Read data from the table:

      SELECT * FROM odps.doc_test.mc_test_table;
    5. Create a partitioned table:

       CREATE TABLE odps.doc_test.mc_test_table_pt (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING);
    6. Read data from the partitioned table:

      SELECT * FROM odps.doc_test.mc_test_table_pt;

      The following code provides an example of the result:

      test1   1       2018    0601
      test2   2       2018    0601
      Time taken: 1.312 seconds, Fetched 2 row(s)
    7. Delete the table:

      DROP TABLE IF EXISTS odps.doc_test.mc_test_table;