本文主要为您提供AGS帮助示例。
前提条件
- 您已成功创建一个Kubernetes集群。参见创建Kubernetes托管版集群。
- 您已连接到Kubernetes集群,参见通过kubectl工具连接集群。
Log
- 执行
ags config sls
命令,在AGS上配置并安装日志服务。原生Argo查看Pod日志只能从本地拉取,当Pod或者所在节点被删除后,日志也会随之丢失,这对查看错误分析以及原因等带来很大问题。 如果将日志上传到阿里云日志服务,即使节点消失也会从日志服务重新拉取,将日志持久化。 - 执行
ags logs
命令,查看工作流的日志。本例中执行ags logs POD/WORKFLOW
,查看Pod/Workflow的日志。# ags logs view logs of a workflow Usage: ags logs POD/WORKFLOW [flags] Flags: -c, --container string Print the logs of this container (default "main") -f, --follow Specify if the logs should be streamed. -h, --help help for logs -l, --recent-line int how many lines to show in one call (default 100) --since string Only return logs newer than a relative duration like 5s, 2m, or 3h. Defaults to all logs. Only one of since-time / since may be used. --since-time string Only return logs after a specific date (RFC3339). Defaults to all logs. Only one of since-time / since may be used. --tail int Lines of recent log file to display. Defaults to -1 with no selector, showing all log lines otherwise 10, if a selector is provided. (default -1) --timestamps Include timestamps on each line in the log output -w, --workflow Specify that whole workflow logs should be printed
说明- 如果Pod存在本机,AGS会从本地将Pod日志查询出来,所有flag兼容原Argo命令。
- 如果Pod已经被删除,AGS会从阿里云日志服务来查询日志,默认返回最近100条日志,可以通过-l flag来指定到底返回多少条日志。
List
您可以通过--limit参数选择查看的Workflow条目数。
# ags remote list --limit 8
+-----------------------+-------------------------------+------------+
| JOB NAME | CREATE TIME | JOB STATUS |
+-----------------------+-------------------------------+------------+
| merge-6qk46 | 2020-09-02 16:52:34 +0000 UTC | Pending |
| rna-mapping-gpu-ck4cl | 2020-09-02 14:47:57 +0000 UTC | Succeeded |
| wgs-gpu-n5f5s | 2020-09-02 13:14:14 +0000 UTC | Running |
| merge-5zjhv | 2020-09-02 12:03:11 +0000 UTC | Succeeded |
| merge-jjcw4 | 2020-09-02 10:44:51 +0000 UTC | Succeeded |
| wgs-gpu-nvxr2 | 2020-09-01 22:18:44 +0000 UTC | Succeeded |
| merge-4vg42 | 2020-09-01 20:52:13 +0000 UTC | Succeeded |
| rna-mapping-gpu-2ss6n | 2020-09-01 20:34:45 +0000 UTC | Succeeded |
集成kubectl命令
# ags get test-v2
Name: test-v2
Namespace: default
ServiceAccount: default
Status: Running
Created: Thu Nov 22 11:06:52 +0800 (2 minutes ago)
Started: Thu Nov 22 11:06:52 +0800 (2 minutes ago)
Duration: 2 minutes 46 seconds
STEP PODNAME DURATION MESSAGE
● test-v2
└---● bcl2fq test-v2-2716811808 2m
# ags kubectl describe pod test-v2-2716811808
Name: test-v2-2716811808
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: cn-shenzhen.i-wz9gwobtqrbjgfnqxl1k/192.168.0.94
Start Time: Thu, 22 Nov 2018 11:06:52 +0800
Labels: workflows.argoproj.io/completed=false
workflows.argoproj.io/workflow=test-v2
Annotations: workflows.argoproj.io/node-name=test-v2[0].bcl2fq
workflows.argoproj.io/template={"name":"bcl2fq","inputs":{},"outputs":{},"metadata":{},"container":{"name":"main","image":"registry.cn-hangzhou.aliyuncs.com/dahu/curl-jp:1.2","command":["sh","-c"],"ar...
Status: Running
IP: 172.16.*.***
Controlled By: Workflow/test-v2
通过使用ags kubectl命令,可以查看到describe pod的状态信息,所有kubectl原生命令AGS均支持。
集成ossutil命令
AGS初始化完毕后,您可以使用如下命令进行文件的上传和查看。
# ags oss cp test.fq.gz oss://my-test-shenzhen/fasq/
Succeed: Total num: 1, size: 690. OK num: 1(upload 1 files).
average speed 3000(byte/s)
0.210685(s) elapsed
# ags oss ls oss://my-test-shenzhen/fasq/
LastModifiedTime Size(B) StorageClass ETAG ObjectName
2020-09-02 17:20:34 +0800 CST 690 Standard 9FDB86F70C6211B2EAF95A9B06B14F7E oss://my-test-shenzhen/fasq/test.fq.gz
Object Number is: 1
0.117591(s) elapsed
通过使用ags oss命令,可以进行文件的上传下载等,所有的ossutil原生命令AGS均支持。
查看Workflow资源使用量
- 创建并拷贝内容到arguments-workflow-resource.yaml文件中,并执行
ags submit arguments-workflow-resource.yaml
命令,指定resource requests。apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: name: test-resource spec: arguments: {} entrypoint: test-resource- templates: - inputs: {} metadata: {} name: test-resource- outputs: {} parallelism: 1 steps: - - arguments: {} name: bcl2fq template: bcl2fq - container: args: - id > /tmp/yyy;echo `date` > /tmp/aaa;ps -e -o comm,euid,fuid,ruid,suid,egid,fgid,gid,rgid,sgid,supgid > /tmp/ppp;ls -l /tmp/aaa;sleep 100;pwd command: - sh - -c image: registry.cn-hangzhou.aliyuncs.com/dahu/curl-jp:1.2 name: main resources: #don't use too much resources requests: memory: 320Mi cpu: 1000m inputs: {} metadata: {} name: bcl2fq outputs: {}
- 执行
ags get test456 --show
命令,查看Workflow资源使用。本例中,结果显示的是Pod和test456使用的核/时。# ags get test456 --show Name: test456 Namespace: default ServiceAccount: default Status: Succeeded Created: Thu Nov 22 14:41:49 +0800 (2 minutes ago) Started: Thu Nov 22 14:41:49 +0800 (2 minutes ago) Finished: Thu Nov 22 14:43:30 +0800 (27 seconds ago) Duration: 1 minute 41 seconds Total CPU: 0.02806 (core*hour) Total Memory: 0.00877 (GB*hour) STEP PODNAME DURATION MESSAGE CPU(core*hour) MEMORY(GB*hour) ✔ test456 0 0 └---✔ bcl2fq test456-4221301428 1m 0.02806 0.00877
securityContext安全支持
ags submit arguments-security-context.yaml
命令,绑定对应的psp来进行权限控制。apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: test
spec:
arguments: {}
entrypoint: test-security-
templates:
- inputs: {}
metadata: {}
name: test-security-
outputs: {}
parallelism: 1
steps:
- - arguments: {}
name: bcl2fq
template: bcl2fq
- container:
args:
- id > /tmp/yyy;echo `date` > /tmp/aaa;ps -e -o comm,euid,fuid,ruid,suid,egid,fgid,gid,rgid,sgid,supgid
> /tmp/ppp;ls -l /tmp/aaa;sleep 100;pwd
command:
- sh
- -c
image: registry.cn-hangzhou.aliyuncs.com/dahu/curl-jp:1.2
name: main
resources: #don't use too much resources
requests:
memory: 320Mi
cpu: 1000m
inputs: {}
metadata: {}
name: bcl2fq
outputs: {}
securityContext:
runAsUser: 800
YAML定义自动重试功能
bash命令会由于不明原因失败,重试就可以解决,AGS提供一种基于YAML配置的自动重启机制,当Pod内命令运行失败后,会自动拉起重试,并且可以设置重试次数。
ags submit arguments-auto-retry.yaml
命令,配置Workflow的自动重启机制。# This example demonstrates the use of retries for a single container.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: retry-container-
spec:
entrypoint: retry-container
templates:
- name: retry-container
retryStrategy:
limit: 10
container:
image: python:alpine3.6
command: ["python", -c]
# fail with a 66% probability
args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]
基于最近失败断点重试整个Workflow
在整个Workflow运行中,有时候任务中的某个步骤会失败,这时候希望从某个失败的节点重试Workflow,类似断点续传的断点重试功能。
- 执行
ags get test456 --show
命令,查看workflow test456从哪个步骤断点。# ags get test456 --show Name: test456 Namespace: default ServiceAccount: default Status: Succeeded Created: Thu Nov 22 14:41:49 +0800 (2 minutes ago) Started: Thu Nov 22 14:41:49 +0800 (2 minutes ago) Finished: Thu Nov 22 14:43:30 +0800 (27 seconds ago) Duration: 1 minute 41 seconds Total CPU: 0.0572 (core*hour) Total Memory: 0.01754 (GB*hour) STEP PODNAME DURATION MESSAGE CPU(core*hour) MEMORY(GB*hour) ✔ test456 0 0 └---✔ bcl2fq test456-4221301428 1m 0.02806 0.00877 └---X bcl2fq test456-4221301238 1m 0.02806 0.00877
- 执行
ags retry test456
命令,从最近失败断点处继续重试workflow test456。
使用ECI运行workflow
ECI操作请参见弹性容器实例ECI。
配置使用ECI前,请先安装AGS,请参见AGS 下载和安装。
- 执行
kubectl get cm -n argo
命令,获取Workflow对应的YAML文件名称。# kubectl get cm -n argo NAME DATA AGE workflow-controller-configmap 1 4d
- 执行
kubectl get cm -n argo workflow-controller-configmap -o yaml
命令,打开workflow-controller-configmap.yaml文件,并使用如下内容覆盖当前YAML文件的内容。apiVersion: v1 data: config: | containerRuntimeExecutor: k8sapi kind: ConfigMap
- 执行
kubectl delete pod <podName>
命令,重启argo controller。说明 这里的podName为workflow所在的Pod的名称。 - 创建并拷贝内容到arguments-workflow-eci.yaml文件中,并执行
ags submit arguments-workflow-eci.yaml
命令,在ECI上运行的容器添加nodeSelector和Tolerations这两个标识。apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: hello-world- spec: entrypoint: whalesay templates: - name: whalesay container: image: docker/whalesay command: [env] #args: ["hello world"] resources: limits: memory: 32Mi cpu: 100m nodeSelector: # 添加nodeSelector type: virtual-kubelet tolerations: # 添加tolerations - key: virtual-kubelet.io/provider operator: Exists - key: alibabacloud.com effect: NoSchedule
查看Workflow实际资源使用量以及峰值
ags workflow controller会通过metrics-server自动获取Pod每分钟的实际资源使用量,并且统计出来总量和各个Pod的峰值使用量。
ags get steps-jr6tw --metrics
命令,查看Workflow实际资源使用量以及峰值。➜ ags get steps-jr6tw --metrics
Name: steps-jr6tw
Namespace: default
ServiceAccount: default
Status: Succeeded
Created: Tue Apr 16 16:52:36 +0800 (21 hours ago)
Started: Tue Apr 16 16:52:36 +0800 (21 hours ago)
Finished: Tue Apr 16 19:39:18 +0800 (18 hours ago)
Duration: 2 hours 46 minutes
Total CPU: 0.00275 (core*hour)
Total Memory: 0.04528 (GB*hour)
STEP PODNAME DURATION MESSAGE CPU(core*hour) MEMORY(GB*hour) MaxCpu(core) MaxMemory(GB)
✔ steps-jr6tw 0 0 0 0
└---✔ hello1 steps-jr6tw-2987978173 2h 0.00275 0.04528 0.000005 0.00028
设置Workflow优先级
当前面有一些任务正在运行时,有一个紧急任务急需运行,此时,您可以给Workflow设置高、中、低的优先级,高优先级抢占低优先级任务的资源。
- 您可以给某个Pod设置高优先级,示例如下:创建并拷贝内容到arguments-high-priority-taskA.yaml文件中,并执行
ags submit arguments-high-priority-taskA.yaml
命令,给任务A设置高优先级。apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: high-priority value: 1000000 globalDefault: false description: "This priority class should be used for XYZ service pods only."
- 您可以给某个Pod设置中优先级,示例如下:创建并拷贝内容到arguments-high-priority-taskB.yaml文件中,并执行
ags submit arguments-high-priority-taskB.yaml
命令,给任务B设置中优先级。apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: medium-priority value: 100 globalDefault: false description: "This priority class should be used for XYZ service pods only."
- 您也可以一个Workflow设置高优先级,示例如下:创建并拷贝内容到arguments-high-priority-Workflow.yaml文件中,并执行
ags submit arguments-high-priority-Workflow.yaml
命令,给Workflow中所有的Pod设置高优先级。apiVersion: argoproj.io/v1alpha1 kind: Workflow # new type of k8s spec metadata: generateName: high-proty- # name of the workflow spec spec: entrypoint: whalesay # invoke the whalesay template podPriorityClassName: high-priority # workflow level priority templates: - name: whalesay # name of the template container: image: ubuntu command: ["/bin/bash", "-c", "sleep 1000"] resources: requests: cpu: 3
下面以一个Workflow里面含有两个Pod,分别给一个Pod设置中优先级,另一个Pod设置高优先级,此时,高优先级的Pod就能抢占低优先级Pod的资源。
- 创建并拷贝内容到arguments-high-priority-steps.yaml文件中,并执行
ags submit arguments-high-priority-steps.yaml
命令,给Pod设置优先级。apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: steps- spec: entrypoint: hello-hello-hello templates: - name: hello-hello-hello steps: - - name: low template: low - - name: low-2 template: low - name: high template: high - name: low container: image: ubuntu command: ["/bin/bash", "-c", "sleep 30"] resources: requests: cpu: 3 - name: high priorityClassName: high-priority # step level priority container: image: ubuntu command: ["/bin/bash", "-c", "sleep 30"] resources: requests: cpu: 3
- 运行结果为高优先级Pod会抢占并删除旧的Pod,执行结果如下所示。
Name: steps-sxvrv Namespace: default ServiceAccount: default Status: Failed Message: child 'steps-sxvrv-1724235106' failed Created: Wed Apr 17 15:06:16 +0800 (1 minute ago) Started: Wed Apr 17 15:06:16 +0800 (1 minute ago) Finished: Wed Apr 17 15:07:34 +0800 (now) Duration: 1 minute 18 seconds STEP PODNAME DURATION MESSAGE ✖ steps-sxvrv child 'steps-sxvrv-1724235106' failed ├---✔ low steps-sxvrv-3117418100 33s └-·-✔ high steps-sxvrv-603461277 45s └-⚠ low-2 steps-sxvrv-1724235106 45s pod deleted
说明 这里高优先级任务会自动抢占低优先Pod所占用资源,会停止低优先级任务,中断正在运行的进程,所以使用时要非常谨慎。
Workflow Filter
在ags get workflow中,针对较大的Workflow可以使用filter列出指定状态的Pod。
- 执行
ags get <pod名称> --status Running
命令,列出指定状态的Pod。# ags get pod-limits-n262v --status Running Name: pod-limits-n262v Namespace: default ServiceAccount: default Status: Running Created: Wed Apr 17 15:59:08 +0800 (1 minute ago) Started: Wed Apr 17 15:59:08 +0800 (1 minute ago) Duration: 1 minute 17 seconds Parameters: limit: 300 STEP PODNAME DURATION MESSAGE ● pod-limits-n262v ├-● run-pod(13:13) pod-limits-n262v-3643890604 1m ├-● run-pod(14:14) pod-limits-n262v-4115394302 1m ├-● run-pod(16:16) pod-limits-n262v-3924248206 1m ├-● run-pod(17:17) pod-limits-n262v-3426515460 1m ├-● run-pod(18:18) pod-limits-n262v-824163662 1m ├-● run-pod(20:20) pod-limits-n262v-4224161940 1m ├-● run-pod(22:22) pod-limits-n262v-1343920348 1m ├-● run-pod(2:2) pod-limits-n262v-3426502220 1m ├-● run-pod(32:32) pod-limits-n262v-2723363986 1m ├-● run-pod(34:34) pod-limits-n262v-2453142434 1m ├-● run-pod(37:37) pod-limits-n262v-3225742176 1m ├-● run-pod(3:3) pod-limits-n262v-2455811176 1m ├-● run-pod(40:40) pod-limits-n262v-2302085188 1m ├-● run-pod(6:6) pod-limits-n262v-1370561340 1m
- 执行
ags get <pod名称> --sum-info
命令,统计当前Pod状态信息。# ags get pod-limits-n262v --sum-info --status Error Name: pod-limits-n262v Namespace: default ServiceAccount: default Status: Running Created: Wed Apr 17 15:59:08 +0800 (2 minutes ago) Started: Wed Apr 17 15:59:08 +0800 (2 minutes ago) Duration: 2 minutes 6 seconds Pending: 198 Running: 47 Succeeded: 55 Parameters: limit: 300 STEP PODNAME DURATION MESSAGE ● pod-limits-n262v
敏捷版Autoscaler使用流程
- 您已经有一个VPC。
- 您已经有一个vSwitch。
- 您已经设置好一个安全组。
- 您已经获取到敏捷版的APIServer内网地址。
- 您明确扩容节点的规格。
- 您已创建好一个ECS实例且拥有公网访问能力。
$ags config autoscaler根据提示输入对应的值Please input vswitchs with comma separated
vsw-hp3cq3fnv47bpz7x58wfe
Please input security group id
sg-hp30vp05x6tlx13my0qu
Please input the instanceTypes with comma separated
ecs.c5.xlarge
Please input the new ecs ssh password
xxxxxxxx
Please input k8s cluster APIServer address like(192.168.1.100)
172.24.61.156
Please input the autoscaling mode (current: release. Type enter to skip.)
Please input the min size of group (current: 0. Type enter to skip.)
Please input the max size of group (current: 1000. Type enter to skip.)
Create scaling group successfully.
Create scaling group config successfully.
Enable scaling group successfully.
Succeed
配置完成后,登录弹性伸缩控制台, 可以看到创建好的自动伸缩组。
配置使用ags configmap
本例中,默认使用hostNetwork。
- 执行
kubectl get cm -n argo
命令,获取Workflow对应的YAML文件名称。# kubectl get cm -n argo
NAME DATA AGE workflow-controller-configmap 1 6d23h
- 执行
kubectl edit cm workflow-controller-configmap -n argo
命令,打开workflow-controller-configmap.yaml文件,将如下内容填入当前YAML文件中。data: config: | extraConfig: enableHostNetwork: true defaultDnsPolicy: Default
填入完成后,workflow-controller-configmap.yaml全文如下所示。apiVersion: v1 data: config: | extraConfig: enableHostNetwork: true defaultDnsPolicy: Default kind: ConfigMap metadata: name: workflow-controller-configmap namespace: argo
- 配置完成后,新部署的Workflow均会默认使用hostNetwork,且dnsPolicy为Default。
- 可选:如果配置了psp,需要在psp中对应的YAML文件增加如下内容。
hostNetwork: true
说明 如果该YAML文件中已有hostNetwork参数,需要将值改为true。完整YAML示例模板如下:
apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: restricted annotations: seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default,runtime/default' apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default' seccomp.security.alpha.kubernetes.io/defaultProfileName: 'runtime/default' apparmor.security.beta.kubernetes.io/defaultProfileName: 'runtime/default' spec: privileged: false # Required to prevent escalations to root. allowPrivilegeEscalation: false # This is redundant with non-root + disallow privilege escalation, # but we can provide it for defense in depth. requiredDropCapabilities: - ALL # Allow core volume types. volumes: - 'configMap' - 'emptyDir' - 'projected' - 'secret' - 'downwardAPI' # Assume that persistentVolumes set up by the cluster admin are safe to use. - 'persistentVolumeClaim' hostNetwork: false hostIPC: false hostPID: false runAsUser: # Require the container to run without root privileges. rule: 'MustRunAsNonRoot' seLinux: # This policy assumes the nodes are using AppArmor rather than SELinux. rule: 'RunAsAny' supplementalGroups: rule: 'MustRunAs' ranges: # Forbid adding the root group. - min: 1 max: 65535 fsGroup: rule: 'MustRunAs' ranges: # Forbid adding the root group. - min: 1 max: 65535 readOnlyRootFilesystem: false