All Products
Search
Document Center

MaxCompute:Spark Connector

Last Updated:Jan 06, 2026

Processing MaxCompute data with Apache Spark often involves inefficient data transfers between the two systems. The MaxCompute Spark Connector resolves this by providing high-throughput, direct access to your MaxCompute data. By leveraging the MaxCompute Storage API, the connector bypasses the SQL layer, enabling Spark to read data directly from the underlying storage. This eliminates data exports, reduces latency, and significantly improves performance, especially for large-scale workloads. Furthermore, its integration with the Spark Catalog API lets you manage and query MaxCompute tables using standard Spark SQL, just as you would with native Spark tables.

Scope and limitations

  • When a third-party engine accesses MaxCompute:

    • You can read standard tables, partitioned tables, clustered tables, Delta Tables, and materialized views.

    • You cannot read MaxCompute external tables or logical views.

  • Reading the JSON data type is not supported.

Procedure

  1. Activate MaxCompute and create a MaxCompute project.

  2. Purchase a subscription Data Transmission Service resource group.

  3. Deploy a Spark developer environment.

    Use a compatible Spark version (3.2.x to 3.5.x). Download the package from Spark and extract it to a local directory.

    1. To set up the environment on a Linux, see Set up a Linux developer environment.

    2. To set up the environment on a Windows, see Set up a Windows developer environment.

  4. Download and compile the Spark connector. The connector version must correspond to your Spark version. This guide uses Spark 3.3.1 as an example. Ensure Git is installed in your environment.

    ## Download the Spark connector:
    git clone https://github.com/aliyun/aliyun-maxcompute-data-collectors.git
    
    ## Switch to the spark-connector folder
    cd aliyun-maxcompute-data-collectors/spark-connector 
    
    ## Compile the package
    mvn clean package
    
    ## Location of the Datasource JAR package
    datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar
    
    ## Copy the Datasource JAR package to the $SPARK_HOME/jars/ folder
    cp datasource/target/spark-odps-datasource-3.3.1-odps0.43.0.jar $SPARK_HOME/jars/
  5. Configure your MaxCompute access credentials. In your $SPARK_HOME/conf directory, create a spark-defaults.conf file:

    cd $SPARK_HOME/conf
    vim spark-defaults.conf

    Add the following configurations to the spark-defaults.conf file:

    ## Configure the account in spark-defaults.conf
    spark.hadoop.odps.project.name=doc_test
    spark.hadoop.odps.access.id=L********************
    spark.hadoop.odps.access.key=*******************
    spark.hadoop.odps.end.point=http://service.cn-beijing.maxcompute.aliyun.com/api
    spark.hadoop.odps.tunnel.quota.name=ot_xxxx_p#ot_xxxx
    ## Configure the MaxCompute Catalog
    spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog 
    spark.sql.extensions=org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions
  6. Use the Spark connector to access MaxCompute.

    1. Run the following command in the bin folder of Spark to start the Spark SQL client:

      cd $SPARK_HOME/bin
      spark-sql
    2. List the tables in your MaxCompute project:

      SHOW tables in odps.doc_test;

      doc_test is an example of a MaxCompute project name. Replace it with your actual project name.

    3. Create a table:

      CREATE TABLE odps.doc_test.mc_test_table (name STRING, num BIGINT);
    4. Read data from the table:

      SELECT * FROM odps.doc_test.mc_test_table;
    5. Create a partitioned table:

       CREATE TABLE odps.doc_test.mc_test_table_pt (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING);
    6. Read data from the partitioned table:

      SELECT * FROM odps.doc_test.mc_test_table_pt;

      Example output:

      test1   1       2018    0601
      test2   2       2018    0601
      Time taken: 1.312 seconds, Fetched 2 row(s)
    7. Delete the table:

      DROP TABLE IF EXISTS odps.doc_test.mc_test_table;