This topic describes how to use the task orchestration feature of Data Management (DMS) to train a machine learning model.

Prerequisites

  • An Alibaba Cloud account is created.
  • DMS is activated.
  • Data Lake Analytics (DLA) is activated. For more information, see Activate DLA.
  • Object Storage Service (OSS) is activated. For more information, see Activate OSS.

Background information

As big data technologies develop and computing capabilities improve in recent years, machine learning and deep learning are widely applied. Relevant applications include personalized recommendation systems, facial recognition payments, and autonomous driving. MLlib is a machine learning library of Apache Spark. It provides a variety of algorithms for training machine learning models, such as classification, regression, clustering, collaborative filtering, and dimensionality reduction. In this topic, the k-means clustering algorithm is used. You can use the task orchestration feature of DMS to create a Serverless Spark task to train a machine learning model.

Create a Spark cluster in the DLA console

  1. Log on to the DLA console.
  2. Create a Spark cluster. For more information, see View and modify the configuration of a virtual cluster.
  3. Authorize DLA to delete objects in OSS. For more information, see Insert data.

Upload data and code to OSS

  1. Log on to the OSS console.
  2. Create a data file. In this example, create a file named data.txt and add the following content to the file:
    0.0 0.0 0.0
    0.1 0.1 0.1
    0.2 0.2 0.2
    9.0 9.0 9.0
    9.1 9.1 9.1
    9.2 9.2 9.2
  3. Write code and package the code into a fat JAR file.
    Note In this example, use the following code to read data from the data.txt file and train a machine learning model by using the k-means clustering algorithm.
    package com.aliyun.spark
    
    import org.apache.spark.SparkConf
    import org.apache.spark.mllib.clustering.KMeans
    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.sql.SparkSession
    
    object SparkMLlib {
      def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("Spark MLlib Kmeans Demo")
        val spark = SparkSession
          .builder()
          .config(conf)
          .getOrCreate()
        val rawDataPath = args(0)
    
        val data = spark.sparkContext.textFile(rawDataPath)
        val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
        val numClusters = 2
        val numIterations = 20
        val model = KMeans.train(parsedData, numClusters, numIterations)
        for (c <- model.clusterCenters) {
          println(s"cluster center: ${c.toString}")
        }
        val modelOutputPath = args(1)
        model.save(spark.sparkContext, modelOutputPath)
      }
    }
  4. Upload the data.txt file and the fat JAR file to OSS. For more information, see Upload objects.

Create a Serverless Spark task in DMS

  1. Log on to the DMS console.
  2. In the top navigation bar, choose Data Factory > Task Orchestration. The Home tab of the Task Orchestration page appears.
  3. In the Free orchestration tasks section, click New task flow.
  4. In the New Task Flow dialog box, enter relevant information in the Task Flow Name and Description fields and click OK. In this example, set the task flow name to Just_Spark and enter Just_Spark demo. in the Description field.
    just_spark
  5. In the navigation tree of the Task Orchestration page, find the Serverless Spark task node and drag the task node to the canvas.
    Serverless Spark task node
  6. Click the Serverless Spark task node on the canvas. The Content tab appears on the right. Complete the following configurations on this tab:
    1. Select the region where the Spark cluster you created resides from the Region drop-down list.
    2. Select the Spark cluster from the Spark cluster drop-down list.
    3. Write code in the Job configuration field. The code is used to train the machine learning model on the Spark cluster. In this example, write the following code:
      {
          "name": "spark-mllib-test",
          "file": "oss://oss-bucket-name/kmeans_demo/spark-mllib-1.0.0-SNAPSHOT.jar",
          "className": "com.aliyun.spark.SparkMLlib",
          "args": [
              "oss://oss-bucket-name/kmeans_demo/data.txt",
              "oss://oss-bucket-name/kmeans_demo/model/"
          ],
          "conf": {
              "spark.driver.resourceSpec": "medium",
              "spark.executor.instances": 2,
              "spark.executor.resourceSpec": "medium",
              "spark.dla.connectors": "oss"
          }
      }
      Note
      • file: specifies the absolute path of the fat JAR file in OSS.
      • args: specifies the absolute paths of the data.txt file and the machine learning model to be trained in OSS.
    4. Click Save in the lower part of the Content tab.
      Save button
  7. Click Try Run in the upper-left corner to test the Serverless Spark task.

Result

To view the execution result of the Serverless Spark task, click the Operation Centre icon on the left-side navigation submenu.View the execution result