This topic describes how to use the task orchestration feature of Data Management
(DMS) to train a machine learning model.
Prerequisites
- An Alibaba Cloud account is created.
- DMS is activated.
- Data Lake Analytics (DLA) is activated. For more information, see Activate DLA.
- Object Storage Service (OSS) is activated. For more information, see Activate OSS.
Background information
As big data technologies develop and computing capabilities improve in recent years,
machine learning and deep learning are widely applied. Relevant applications include
personalized recommendation systems, facial recognition payments, and autonomous driving.
MLlib is a machine learning library of Apache Spark. It provides a variety of algorithms
for training machine learning models, such as classification, regression, clustering,
collaborative filtering, and dimensionality reduction. In this topic, the k-means
clustering algorithm is used. You can use the task orchestration feature of DMS to
create a Serverless Spark task to train a machine learning model.
Upload data and code to OSS
- Log on to the OSS console.
- Create a data file. In this example, create a file named data.txt and add the following
content to the file:
0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2
- Write code and package the code into a fat JAR file.
Note In this example, use the following code to read data from the data.txt file and train
a machine learning model by using the k-means clustering algorithm.
package com.aliyun.spark
import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.SparkSession
object SparkMLlib {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Spark MLlib Kmeans Demo")
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
val rawDataPath = args(0)
val data = spark.sparkContext.textFile(rawDataPath)
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
val numClusters = 2
val numIterations = 20
val model = KMeans.train(parsedData, numClusters, numIterations)
for (c <- model.clusterCenters) {
println(s"cluster center: ${c.toString}")
}
val modelOutputPath = args(1)
model.save(spark.sparkContext, modelOutputPath)
}
}
- Upload the data.txt file and the fat JAR file to OSS. For more information, see Upload objects.
Create a Serverless Spark task in DMS
- Log on to the DMS console.
- In the top navigation bar, choose Data Factory > Task Orchestration. The Home tab of the Task Orchestration page appears.
- In the Free orchestration tasks section, click New task flow.
- In the New Task Flow dialog box, enter relevant information in the Task Flow Name and Description fields and click OK. In this example, set the task flow name to
Just_Spark
and enter Just_Spark demo.
in the Description field.
- In the navigation tree of the Task Orchestration page, find the Serverless Spark task node and drag the task node to the canvas.
- Click the Serverless Spark task node on the canvas. The Content tab appears on the right. Complete the following configurations on this tab:
- Select the region where the Spark cluster you created resides from the Region drop-down list.
- Select the Spark cluster from the Spark cluster drop-down list.
- Write code in the Job configuration field. The code is used to train the machine learning model on the Spark cluster.
In this example, write the following code:
{
"name": "spark-mllib-test",
"file": "oss://oss-bucket-name/kmeans_demo/spark-mllib-1.0.0-SNAPSHOT.jar",
"className": "com.aliyun.spark.SparkMLlib",
"args": [
"oss://oss-bucket-name/kmeans_demo/data.txt",
"oss://oss-bucket-name/kmeans_demo/model/"
],
"conf": {
"spark.driver.resourceSpec": "medium",
"spark.executor.instances": 2,
"spark.executor.resourceSpec": "medium",
"spark.dla.connectors": "oss"
}
}
Note
file
: specifies the absolute path of the fat JAR file in OSS.
args
: specifies the absolute paths of the data.txt file and the machine learning model
to be trained in OSS.
- Click Save in the lower part of the Content tab.
- Click Try Run in the upper-left corner to test the Serverless Spark task.
Result
To view the execution result of the Serverless Spark task, click the Operation Centre icon on the left-side navigation submenu.