Spark on MaxCompute is a computing service provided by MaxCompute. It is compatible with the open-source Spark. It provides a Spark computing framework based on unified computing resources and a dataset permission system, which allows you to submit and run Spark jobs in your preferred development method. Spark on MaxCompute can fulfill the diverse needs of data processing and analysis.

Limits

  • Currently, Spark on MaxCompute supports the following scenarios:
    • All Java and Scala offline jobs such as GraphX, MLlib, RDD, Spark SQL, and PySpark.
    • MaxCompute table I/O.
    • Provides unstructured storage support for OSS (Object Storage Service).
    • Read and write VPC services. For example, the services deployed on RDS, Redis, and ECS.
  • Currently, Spark on MaxCompute does not support the following scenarios:
    • Streaming service.
    • Interactive services such as Spark-Shell, Spark-SQL-Shell, and PySpark-Shell.

Features

  • Supports different versions of native Spark jobs.

    MaxCompute supports native community Spark , and it is compatible with the APIs of all native Spark versions. Different versions of Spark can run in MaxCompute at the same time. Spark on MaxCompute provides native Spark Web UIs.

  • Runs in unified computing resources.

    Similar to MaxCompute SQL and MapReduce, Spark on MaxCompute runs with the unified computing resources activated for MaxCompute projects.

  • Supports unified data and permission management.

    Spark on MaxCompute complies with the MaxCompute permission management, which allows you to query data without asking for any additional permissions.

  • Provides the same user experience as open-source systems.

    Spark on MaxCompute provides the same user experience as open-source systems, both in terms of an open-source application UI and online interactions, making it a perfect fit for those already familiar with Spark. Specifically, it supports native, open-source, and real-time UIs that are essential for debugging open-source applications, and also provides the historical log query function. For some open-source applications, it supports interactive experience, which enables real-time interactions in the background.

Architecture

Spark on MaxCompute is an Alibaba Cloud solution, which enables native Spark to run in MaxCompute.

The left half of the diagram shows the architecture of native Spark. The right half shows the architecture of Spark on MaxCompute, which runs on the Cupid platform developed by Alibaba Cloud. This Cupid platform is fully compatible with the computing framework that is supported by the open-source YARN.