MaxCompute Spark supports three running modes: Local, Cluster, and DataWorks.
Local mode
The Local mode is used to facilitate code debugging for applications. In Local mode,
you can use MaxCompute Spark the same way as native Spark in the community. Additionally,
you can use Tunnel to read data from and write data to MaxCompute tables. In this
mode, you can use either an IDE or the command line to run MaxCompute Spark. If you
use this mode, you must add the
spark.master=local[N]
configuration. N indicates the CPU resources required to implement this mode. To
use Tunnel to read data from and write data to tables in Local mode, you must add
the Tunnel configuration item to Spark-defaults.conf. Enter the endpoint based on the region and network environment where the MaxCompute
project is located. For more information about how to obtain the endpoint, see Configure endpoints. The following code provides an example on how to use the command line to run MaxCompute
Spark in this mode: 1.bin/spark-submit --master local[4] \
--class com.aliyun.odps.spark.examples.SparkPi \
${path to aliyun-cupid-sdk}/spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar
Cluster mode
In Cluster mode, you must specify the Main method as the entry point of a custom application.
A Spark job ends when Main succeeds or fails. This mode is suitable for offline jobs.
You can use MaxCompute Spark in this mode together with DataWorks to schedule jobs.
The following code provides an example on how to use the command line to run MaxCompute
Spark in this mode:
1.bin/spark-submit --master yarn-cluster \
-class SparkPi \
${ProjectRoot}/spark/spark-2.x/spark-examples/target/spark-examples_2.11-version-shaded.jar
DataWorks mode
You can run offline jobs of MaxCompute Spark (in Cluster mode) in DataWorks to integrate and schedule the other types of nodes.Note DataWorks supports the Spark node in the following regions: China (Hangzhou), China
(Beijing), China (Shanghai), China (Shenzhen), China (Hong Kong), US (Silicon Valley),
Germany (Frankfurt), India (Mumbai), and Singapore.
- Upload the resources in the DataWorks business flow and click Submit.
- In the created business flow, select ODPS Spark from Data Analytics.
- Double-click the Spark node and define the Spark job. Select a Spark version, a development
language, and a resource file. The resource file is the file uploaded and published
in the business flow. You can specify configuration items, such as the number of executors
and memory size, for the job to be submitted. You also need to set
spark.hadoop.odps.cupid.webproxy.endpoint
to the endpoint of the region where the project is located, for example, http://service.cn.maxcompute.aliyun-inc.com/api. - Run the Spark node to view the task operational log and obtain the URLs of LogView
and JobView from the log for further analysis and diagnosis.
After you have defined the Spark job, orchestrate and schedule services of different types in the business flow if required.