All Products
Search
Document Center

GetBatchMetrics

Last Updated: Jun 19, 2020

This topic describes the metrics used for monitoring resource usage in Batch Compute and how to query the metrics and their statistics. Batch Compute allocates resources for running jobs on clusters. The metrics in Batch Compute are available in the cluster and job dimensions.

Cluster metrics

Metrics

The following table describes the metrics that Batch Compute provides in the cluster dimension.

Metric Name Unit Aggregated statistics
cls_dataVfsFsSizePused Data disk usage % Average, maximum, and minimum
cls_systemCpuLoad CPU load % Average, maximum, and minimum
cls_systemCpuUtilIdle CPU idle rate % Average, maximum, and minimum
cls_systemCpuUtilUsed CPU usage % Average, maximum, and minimum
cls_vfsFsSizePused System disk usage % Average, maximum, and minimum
cls_vmMemorySizePused Memory usage % Average, maximum, and minimum
  • Batch Compute collects the metrics for each instance.
  • Batch Compute collects the statistics of each metric by cluster, group, and instance.
  • The statistics of each metric include the average, maximum, and minimum values in the past minute.
  • By default, Batch Compute pushes the statistics to CloudMonitor every 10 seconds.
  • When you call the DescribeMetricData operation to query the statistics of a metric, you can set the Period parameter to specify the period for aggregating the statistics. The default aggregation period is 1 minute.

Sample statistics

clsdata

Job metrics

Metrics

The following table describes the metrics that Batch Compute provides in the job dimension.

Metric Name Unit Aggregated statistics
job_dataVfsFsSizePused Data disk usage % Average, maximum, and minimum
job_systemCpuLoad CPU load % Average, maximum, and minimum
job_systemCpuUtilIdle CPU idle rate % Average, maximum, and minimum
job_systemCpuUtilUsed CPU usage % Average, maximum, and minimum
job_vfsFsSizePused System disk usage % Average, maximum, and minimum
job_vmMemorySizePused Memory usage % Average, maximum, and minimum

Sample statistics

jobdata

Use API operations to query metrics and related statistics

Batch Compute pushes the statistics of all metrics to CloudMonitor. You can query the metrics and related statistics of Batch Compute instances by calling the API operations that CloudMonitor provides.

DescribeMetricMetaList

You can call this operation to query metrics.

DescribeMetricData

You can call this operation to query the statistics of each metric collected for a cluster or job.

Sample code

  1. #!/usr/bin/env python
  2. #coding=utf-8
  3. # https://www.alibabacloud.com/help/doc-detail/51936.htm
  4. import os
  5. import json
  6. import time
  7. import sys
  8. import datetime
  9. from functools import wraps
  10. from aliyunsdkcore.client import AcsClient
  11. from aliyunsdkcore.acs_exception.exceptions import ClientException
  12. from aliyunsdkcore.acs_exception.exceptions import ServerException
  13. from aliyunsdkcms.request.v20190101.DescribeMetricListRequest import DescribeMetricListRequest
  14. from aliyunsdkcms.request.v20190101.DescribeMetricMetaListRequest import DescribeMetricMetaListRequest
  15. akId = 'your key Id'
  16. akKey = 'your key'
  17. region = 'cn-hangzhou'
  18. # jobId = "job-000000005D16F74B00006883000303E9"
  19. jobId = "job-000000005D16F74B00006883000332DD"
  20. def retryWrapper(func):
  21. @wraps(func)
  22. def wrapper(*args,**kwargs):
  23. index = 0
  24. while True:
  25. try:
  26. res = func(*args,**kwargs)
  27. break
  28. except Exception,e:
  29. if index > 6:
  30. raise Exception(str(e))
  31. else:
  32. time.sleep(0.5 * pow(2,index))
  33. index += 1
  34. return res
  35. return wrapper
  36. @retryWrapper
  37. def listBatchMetricMeta(client, objId):
  38. metrics = []
  39. request = DescribeMetricMetaListRequest()
  40. request.set_accept_format('json')
  41. request.set_Namespace("acs_batchcomputenew")
  42. response = client.do_action_with_exception(request)
  43. res = json.loads(response)
  44. prefix = objId.strip().split("-")[0]
  45. for metric in res["Resources"]["Resource"]:
  46. if prefix not in metric["MetricName"]:
  47. continue
  48. metrics.append(metric["MetricName"])
  49. return metrics
  50. @retryWrapper
  51. def getSpecJobMetricsInfo(client, objId, metrics, startTime = None):
  52. nextToken = None
  53. request = DescribeMetricListRequest()
  54. request.set_accept_format('json')
  55. request.set_Period("60")
  56. request.set_Length("1000")
  57. request.set_EndTime(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time())))
  58. # By default, the statistics in the past seven days are queried.
  59. if not startTime:
  60. sevenDayAgo = (datetime.datetime.now() - datetime.timedelta(days = 7))
  61. startTime = sevenDayAgo.strftime("%Y-%m-%d %H:%M:%S")
  62. request.set_StartTime(startTime)
  63. prefix = objId.strip().split("-")[0]
  64. if "job" in prefix:
  65. dimensionInfo = [{"jobId":objId}]
  66. else:
  67. dimensionInfo = [{"clusterId":objId}]
  68. request.set_Dimensions(json.dumps(dimensionInfo))
  69. request.set_MetricName(metrics)
  70. request.set_Namespace("acs_batchcomputenew")
  71. metricsInfo = []
  72. while True:
  73. if nextToken:
  74. request.set_NextToken(nextToken)
  75. response = client.do_action_with_exception(request)
  76. res = json.loads(response)
  77. if res.has_key("Datapoints") and len(res["Datapoints"]):
  78. metricsInfo.extend(json.loads(res["Datapoints"]))
  79. else:
  80. print res
  81. if res.has_key("NextToken") and res["NextToken"]:
  82. nextToken = res["NextToken"]
  83. continue
  84. else:
  85. break
  86. return metricsInfo
  87. if __name__ == "__main__":
  88. client = AcsClient(akId, akKey, region)
  89. # metricsName = ['job_systemCpuUtilIdle', 'job_systemCpuLoad', 'job_vmMemorySizePused', 'job_vfsFsSizePused', 'job_dataVfsFsSizePused']
  90. metricsName = listBatchMetricMeta(client, jobId)
  91. for metrics in metricsName:
  92. try:
  93. ret = getSpecJobMetricsInfo(client, jobId, metrics)
  94. except Exception,e:
  95. print "get metrics info failed, %s" % str(e)
  96. sys.exit(1)
  97. if not len(ret):
  98. continue
  99. # You can aggregate the returned data.
  100. print ret
  • Before running the sample code, install the Alibaba Cloud SDK for Python by running the following commands:
  • pip install aliyun_python_sdk_cms
  • pip install aliyun_python_sdk_core
  • Make sure that the account with the AccessKey specified in the code has the AliyunCloudMonitorReadOnlyAccess permission. For more information about how to grant permissions, see section 5.2 in Activate Batch Compute.

Use OpenAPI Explorer to query metrics

Alibaba Cloud provides OpenAPI Explorer to simplify API usage. You can use OpenAPI Explorer to automatically generate the sample code by configuring the basic information required for the target operation.