Container Service for Kubernetes (ACK) clusters provides the Advanced Horizontal Pod Autoscaler (AHPA) component that supports predictive scaling. The predictive scaling feature can prefetch resources for scaling events of applications that have periodical traffic patterns. This provides fast scaling for your applications. AHPA supports multiple metrics, including CPU utilization, memory utilization, GPU utilization, and queries per second (QPS). This topic describes how to use AHPA to perform predictive scaling based on the GPU metrics that are collected by Application Real-Time Monitoring Prometheus Service.
Prerequisites
- An ACK managed cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK managed cluster with GPU-accelerated nodes.
- AHPA is installed and data sources are configured to collect metrics. For more information, see AHPA overview.
- Prometheus Service is enabled, and application statistics within at least seven days are collected by Prometheus Service. The statistics include the details of the GPU resources that are used by an application. For more information, see Enable ARMS Prometheus.
Background information
GPU-accelerated computing is widely used in high-performance computing scenarios, such as model training and model inference in deep learning. To reduce resource costs, you can enable cluster auto scaling based on GPU metrics, such as GPU utilization and GPU memory usage. You can use Prometheus Adapter to leverage the GPU metrics that are collected by Prometheus. This way, you can configure AHPA to perform predictive scaling based on GPU utilization.
Step 1: Deploy Metrics Adapter
- Obtain the internal HTTP API endpoint of the Prometheus instance that is used by your cluster.
- Log on to the ARMS console. In the left-side navigation pane, choose .
- In the upper part of the Prometheus Monitoring page, select the region in which your cluster is deployed and click the Prometheus instance that is used by your cluster. You are redirected to the instance details page.
- In the left-side navigation pane of the instance details page, click Settings and record the internal endpoint in the HTTP API Address section.
- Deploy ack-alibaba-cloud-metrics-adapter.
- Log on to the ACK console and choose in the left-side navigation pane.
- On the Marketplace page, click the App Catalog tab. Find and click ack-alibaba-cloud-metrics-adapter.
- In the upper-right corner of the ack-alibaba-cloud-metrics-adapter page, click Deploy.
- On the Basic Information wizard page, set the Cluster and Namespace parameters. Then, click Next.
- On the Parameters wizard page, set the Chart Version parameter and specify the internal HTTP API endpoint that you obtained in Step 1 as the value of the
prometheus.url
parameter in the Parameters section. Then, click OK.
Step 2: Use AHPA to perform predictive scaling based on GPU metrics
In this example, an inference service is deployed. Then, requests are sent to the inference service to check whether AHPA can perform predictive scaling based on GPU metrics.
- Deploy an inference service.
- Run the following command to deploy an inference service:
cat <<EOF | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: bert-intent-detection
spec:
replicas: 1
selector:
matchLabels:
app: bert-intent-detection
template:
metadata:
labels:
app: bert-intent-detection
spec:
containers:
- name: bert-container
image: registry.cn-hangzhou.aliyuncs.com/ai-samples/bert-intent-detection:1.0.1
ports:
- containerPort: 80
resources:
limits:
cpu: "1"
memory: 2G
nvidia.com/gpu: "1"
requests:
cpu: 200m
memory: 500M
nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Service
metadata:
name: bert-intent-detection-svc
labels:
app: bert-intent-detection
spec:
selector:
app: bert-intent-detection
ports:
- protocol: TCP
name: http
port: 80
targetPort: 80
type: LoadBalancer
EOF
- Run the following command to query the status of the pod:
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
bert-intent-detection-7b486f6bf-f**** 1/1 Running 0 3m24s 10.15.1.17 cn-beijing.192.168.94.107 <none> <none>
- Run the following command to send requests to the inference service and check whether the service is deployed.
You can run the
kubectl get svc bert-intent-detection-svc
command to query the IP address of the GPU-accelerated node on which the inference service is deployed. Then, replace
47.95.XX.XX
in the following command with the IP address that you obtained:
curl -v "http://47.95.XX.XX/predict?query=Music"
Expected output:
* Trying 47.95.XX.XX...
* TCP_NODELAY set
* Connected to 47.95.XX.XX (47.95.XX.XX) port 80 (#0)
> GET /predict?query=Music HTTP/1.1
> Host: 47.95.XX.XX
> User-Agent: curl/7.64.1
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Content-Type: text/html; charset=utf-8
< Content-Length: 9
< Server: Werkzeug/1.0.1 Python/3.6.9
< Date: Wed, 16 Feb 2022 03:52:11 GMT
<
* Closing connection 0
PlayMusic # The query result.
If the HTTP status code
200
and the query result are returned, the inference service is deployed.
- Configure AHPA.
In this example, AHPA is configured to scale pods when the GPU utilization of the pod exceeds 20%.
- Configure data sources to collect metrics for AHPA.
- Create a file named application-intelligence.yaml and copy the following content to the file.
Set the armsUrl
parameter to the internal endpoint of the Prometheus instance that you obtained in Step 1.
apiVersion: v1
kind: ConfigMap
metadata:
name: application-intelligence
namespace: kube-system
data:
armsUrl: "http://cn-shanghai-intranet.arms.aliyuncs.com:9090/api/v1/prometheus/da9d7dece901db4c9fc7f5b*******/1581204543170*****/c54417d182c6d430fb062ec364e****/cn-shanghai"
- Run the following command to deploy application-intelligence:
kubectl apply -f application-intelligence.yaml
- Deploy AHPA
- Create a file named fib-gpu.yaml and copy the following content to the file.
In this example, the observer
mode is used. For more information about the parameters related to AHPA, see Parameters.
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: AdvancedHorizontalPodAutoscaler
metadata:
name: fib-gpu
namespace: default
spec:
metrics:
- resource:
name: gpu
target:
averageUtilization: 20
type: Utilization
type: Resource
minReplicas: 0
maxReplicas: 100
prediction:
quantile: 95
scaleUpForward: 180
scaleStrategy: observer
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: bert-intent-detection
instanceBounds:
- startTime: "2021-12-16 00:00:00"
endTime: "2022-12-16 00:00:00"
bounds:
- cron: "* 0-8 ? * MON-FRI"
maxReplicas: 50
minReplicas: 4
- cron: "* 9-15 ? * MON-FRI"
maxReplicas: 50
minReplicas: 10
- cron: "* 16-23 ? * MON-FRI"
maxReplicas: 50
minReplicas: 12
- Run the following command to deploy AHPA:
kubectl apply -f fib-gpu.yaml
- Run the following command to query the status of AHPA:
kubectl get ahpa
Expected output:
NAME STRATEGY REFERENCE METRIC TARGET(%) CURRENT(%) DESIREDPODS REPLICAS MINPODS MAXPODS AGE
fib-gpu observer bert-intent-detection gpu 20 0 0 1 10 50 6d19h
In the output, 0
is returned in the CURRENT(%)
column and 20
is returned in the TARGET(%)
column. This indicates that the current GPU utilization is 0% and pod scaling will be triggered if the GPU utilization exceeds 20%.
- Test whether the inference service can be automatically scaled.
- Run the following command to access the inference service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: fib-loader
namespace: default
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: fib-loader
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: fib-loader
spec:
containers:
- args:
- -c
- |
/ko-app/fib-loader --service-url="http://bert-intent-detection-svc.${NAMESPACE}/predict?query=Music" --save-path=/tmp/fib-loader-chart.html
command:
- sh
env:
- name: NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: registry.cn-huhehaote.aliyuncs.com/kubeway/knative-sample-fib-loader:20201126-110434
imagePullPolicy: IfNotPresent
name: loader
ports:
- containerPort: 8090
name: chart
protocol: TCP
resources:
limits:
cpu: "8"
memory: 16000Mi
requests:
cpu: "2"
memory: 4000Mi
- Run the following command to query the status of AHPA:
Expected output:
NAME STRATEGY REFERENCE METRIC TARGET(%) CURRENT(%) DESIREDPODS REPLICAS MINPODS MAXPODS AGE
fib-gpu observer bert-intent-detection gpu 20 189 10 4 10 50 6d19h
The output shows that the current GPU utilization ( CURRENT(%)
) is higher than the scaling threshold (TARGET(%)
). Therefore, pod scaling is triggered and the expected number of pods is 10
, which is the value returned in the DESIREDPODS
column.
- Run the following command to query the prediction results:
kubectl get --raw '/apis/metrics.alibabacloud.com/v1beta1/namespaces/default/predictionsobserver/fib-gpu'|jq -r '.content' |base64 -d > observer.html
The following figures show the prediction results of GPU utilization based on historical data within the last seven days.
- Predict GPU Resource Observer: The actual GPU utilization is represented by a blue line. The GPU utilization predicted by AHPA is represented by a green line. You can find that the predicted GPU utilization is higher than the actual GPU utilization.
- Predict POD Observer: The actual number of pods that are added or deleted in scaling events is represented by a blue line. The number of pods that AHPA predicts to be added or deleted in scaling events is represented by a green line. You can find that the predicted number of pods is less than the actual number of pods. You can set the scaling mode to
auto
and configure other settings based on the predicted number of pods. This way, AHPA can save pod resources.
The results show that AHPA can use predictive scaling to handle fluctuating workloads as expected. After you confirm the prediction results, you can set the scaling mode to auto
, which allows AHPA to automatically scale pods.