You can run Spark jobs in the Container Service for Kubernetes (ACK) console and use Alluxio to accelerate data processing in a distributed manner. ACK provides Spark Operator to simplify the procedure of submitting Spark jobs. ACK also provides Spark History Server to record historical data of Spark jobs. This facilitates troubleshooting. This topic describes how to set up an environment in the ACK console to run Spark jobs.

Background information

To run Spark jobs in the ACK console, you must perform the following operations:

Create an ACK cluster

For more information about how to create an ACK cluster, see Create an ACK managed cluster.
Notice
Take note of the following information when you set the cluster parameters:
  • When you set the instance type of worker nodes, select ecs.d1ne.6xlarge in the Big Data Network Performance Enhanced instance family and set the number of worker nodes to 20.
  • Each worker node of the ecs.d1ne.6xlarge instance type has 12 HDDs. Each HDD has a storage capacity of 5 TB. Before you mount the HDDs, you must partition and format them. For more information, see Partition and format a data disk larger than 2 TiB in size.
  • After you partition and format the HDDs, mount them to the ACK cluster. You can run the df -h command to query the mount information of the HDDs. The Figure 1 figure shows an example of the command output.
  • The 12 file paths under the /mnt directory are used in the configuration file of Alluxio. ACK provides a simplified method to mount data disks when the cluster has a large number of nodes. For more information, see Use LVM to manage local storage.
Figure 1. Mount information
Mount information

Create an OSS bucket

You must create an Object Storage Service (OSS) bucket to store data, including the test data generated by TPC-DS, test results, and test logs. In this example, the name of the OSS bucket is cloudnativeai. For more information about how to create an OSS bucket, see Create buckets.

Install ack-spark-operator

You can install ack-spark-operator and use the component to simplify the procedure of submitting Spark jobs.

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, choose Marketplace > App Catalog.
  3. On the Marketplace page, click the App Catalog tab. Find and click ack-spark-operator.
  4. On the ack-spark-operator page, click Deploy.
  5. In the Deploy wizard, select a cluster and namespace, and then click Next.
  6. On the Parameters wizard page, set the parameters and click OK.

Install ack-spark-history-server

ack-spark-history-server generates logs and events for Spark jobs and provides a user interface to help you troubleshoot issues.

When you install ack-spark-history-server, you must specify parameters related to the OSS bucket on the Parameters wizard page. The OSS bucket is used to store historical data of Spark jobs. For more information about how to install ack-spark-history-server, see Install ack-spark-operator.

The following code block shows the parameters related to the OSS bucket:
oss:
  enableOSS: true
  # Please input your accessKeyId
  alibabaCloudAccessKeyId: ""
  # Please input your accessKeySecret
  alibabaCloudAccessKeySecret: ""
  # oss bucket endpoint such as oss-cn-beijing.aliyuncs.com
  alibabaCloudOSSEndpoint: ""
  # oss file path such as oss://bucket-name/path
  eventsDir: "oss://cloudnativeai/spark/spark-events"
Run the following command to check whether ack-spark-history-server is installed:
kubectl get service ack-spark-history-server -n {YOUR-NAMESPACE}

Install Alluxio

You must run the Helm command to install Alluxio in the ACK console.

  1. Run the following command to download the installation file of Alluxio:
    wget http://kubeflow.oss-cn-beijing.aliyuncs.com/alluxio-0.6.8.tgz
    tar -xvf alluxio-0.6.8.tgz
  2. Create and configure a file named config.yaml in the directory where the installation file of Alluxio is saved.

    For more information about how to configure the file, see config.yaml.

    The following code block shows the key parameters.
    • Modify the following parameters based on the information of the OSS bucket: AccessKey ID, AccessKey secret, the endpoint of the OSS bucket, and UNIX File System (UFS).
      # Site properties for all the components
      properties:
        fs.oss.accessKeyId: YOUR-ACCESS-KEY-ID
        fs.oss.accessKeySecret: YOUR-ACCESS-KEY-SECRET
        fs.oss.endpoint: oss-cn-beijing-internal.aliyuncs.com
        alluxio.master.mount.table.root.ufs: oss://cloudnativeai/
        alluxio.master.persistence.blacklist: .staging,_temporary
        alluxio.security.stale.channel.purge.interval: 365d
        alluxio.user.metrics.collection.enabled: 'true'
        alluxio.user.short.circuit.enabled: 'true'
        alluxio.user.file.write.tier.default: 1
        alluxio.user.block.size.bytes.default: 64MB #default 64MB
        alluxio.user.file.writetype.default: CACHE_THROUGH
        alluxio.user.file.metadata.load.type: ONCE
        alluxio.user.file.readtype.default: CACHE
        #alluxio.worker.allocator.class: alluxio.worker.block.allocator.MaxFreeAllocator
        alluxio.worker.allocator.class: alluxio.worker.block.allocator.RoundRobinAllocator
        alluxio.worker.file.buffer.size: 128MB
        alluxio.worker.evictor.class: alluxio.worker.block.evictor.LRUEvictor
        alluxio.job.master.client.threads: 5000
        alluxio.job.worker.threadpool.size: 300
    • In the tieredstore section, mediumtype specifies the IDs of the data disks on a worker node, and path specifies the paths where the data disks are mounted.
      tieredstore:
        levels:
          - level: 0
            alias: HDD
            mediumtype: HDD-0,HDD-1,HDD-2,HDD-3,HDD-4,HDD-5,HDD-6,HDD-7,HDD-8,HDD-9,HDD-10,HDD-11
            path: /mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/disk5,/mnt/disk6,/mnt/disk7,/mnt/disk8,/mnt/disk9,/mnt/disk10,/mnt/disk11,/mnt/disk12
            type: hostPath
            quota: 1024G,1024G,1024G,1024G,1024G,1024G,1024G,1024G,1024G,1024G,1024G,1024G
            high: 0.95
            low: 0.7
  3. Add the alluxio=true label to the worker nodes of the ACK cluster.
    For more information about how to add node labels, see Manage node labels.
  4. Run the following Helm command to install Alluxio:
    kubectl create namespace alluxio
    helm install -f config.yaml -n alluxio alluxio alluxio
  5. Run the following command to check whether Alluxio is installed:
    kubectl get pod -n alluxio
  6. Run the following command as the Alluxio admin to check whether the disks are mounted to worker nodes of the cluster:
    kubectl exec -it alluxio-master-0 -n alluxio -- /bin/bash
    ./bin/alluxio fsadmin report capacity
    If disks are mounted to each worker node, Alluxio is installed. Alluxio

What to do next

Write test code