In addition to the existing system events, status change events are released in CloudMonitor for Elastic Compute Service (ECS). The status change events include interruption notification events that are applicable to preemptible ECS instances. A status change event is triggered when the status of an ECS instance changes. Instance status changes can be caused by operations that you perform by using the ECS console or SDKs or by calling API operations. Instance status changes can also be caused by automatic scaling, overdue payments, or system exceptions.
Background information
The existing system events of ECS are used to notify you of alerts that require manual intervention. Status change events are not alerts. Status change events are regular notifications for automated audit and O&M scenarios. CloudMonitor allows you to automatically handle the status change events of ECS instances by using Function Compute or Message Service (MNS).
Preparations for automatically handling status change events of ECS instances
- Create an MNS queue.
- Log on to the MNS console.
- In the left-side navigation pane, click Queues. In upper-left corner of the top navigation bar, select a region. On the Queues page, click Create Queue. The Create Queue panel appears.
- Enter a queue name, such as ecs-cms-event, configure other parameters, and then click OK.
- Create an event-triggered alert rule.
- Log on to the CloudMonitor console.
- In the left-side navigation pane, choose Event Monitoring > System Event.
- On the System Event page, click the Event-triggered Alert Rules tab. Then, click Create Alert Rule in the upper-left corner.
- In the Basic Info section of the Create/Modify Event-triggered Alert Rule panel, enter a name for the alert rule. Example: ecs-test-rule.
- In the Event-triggered Alert Rules section, perform the following operations:
- Set the Product Type parameter to Elastic Compute Service (ECS).
- Set the Event Type parameter to Status Notification.
- Configure the Event Name parameter based on your business requirements.
- Configure the Resource Range parameter based on your business requirements. If you set the Resource Range parameter to All Resources, CloudMonitor sends alert notifications for all resource-related events. If you set the Resource Range parameter to Application Groups, CloudMonitor sends alert notifications only for events that are related to the resources in the specified application group.
- In the Notification Method section, perform the following operations:
- Configure the Contact Group and Notification Method parameters based on your business requirements.
- Select Message Service - Queue and configure the Region and Queue parameters based on your business requirements. For example, set the Queue parameter to ecs-cms-event.
- Click OK.
- Install Python dependencies.
The following code is tested in Python 3.6. You can use other programming languages, such as Java, based on your business requirements.
Use Python Package Index (PyPI) to install the following Python dependencies:- aliyun-python-sdk-core-v3>=2.12.1
- aliyun-python-sdk-ecs>=4.16.0
- aliyun-mns>=1.1.5
Procedure on how to automatically handle status change events
CloudMonitor sends all status change events of ECS instances to MNS. Then, you can write code to receive messages from MNS and handle the messages.
- Practice 1: Record all creation and release events of ECS instances You cannot query ECS instances that are released in the ECS console. If you want to query released ECS instances, you can store status change events of all ECS instances in your databases or Log Service. When an ECS instance is created, a Pending event is triggered. When an ECS instance is released, a Deleted event is triggered. CloudMonitor records both types of events.
- Create a Conf file. Add the following parameters that are related to MNS to the Conf file:
endpoint
: the endpoint that is used to access MNS. To obtain the endpoint, click Get Endpoint on the Queues page in the MNS console.access_key
andaccess_key_secret
: the AccessKey ID and the AccessKey secret that are used to access MNS. You can obtain the AccessKey ID and the AccessKey secret in the User Management console.region_id
andqueue_name
: the region where the MNS queue resides and the name of the MNS queue. You can obtain the region ID and queue name on the Queues page in the MNS console.
class Conf: endpoint = 'http://<id>.mns.<region>.aliyuncs.com/' access_key = '<access_key>' access_key_secret = '<access_key_secrect>' = 'cn-beijing' queue_name = 'test' vsever_group_id = '<your_vserver_group_id>'
- Use the MNS SDK to develop an MNS client that is used to receive messages from MNS.
# -*- coding: utf-8 -*- import json from mns.mns_exception import MNSExceptionBase import logging from mns.account import Account from . import Conf class MNSClient(object): def __init__(self): self.account = Account(Conf.endpoint, Conf.access_key, Conf.access_key_secret) self.queue_name = Conf.queue_name self.listeners = dict() def regist_listener(self, listener, eventname='Instance:StateChange'): if eventname in self.listeners.keys(): self.listeners.get(eventname).append(listener) else: self.listeners[eventname] = [listener] def run(self): queue = self.account.get_queue(self.queue_name) while True: try: message = queue.receive_message(wait_seconds=5) event = json.loads(message.message_body) if event['name'] in self.listeners: for listener in self.listeners.get(event['name']): listener.process(event) queue.delete_message(receipt_handle=message.receipt_handle) except MNSExceptionBase as e: if e.type == 'QueueNotExist': logging.error('Queue %s not exist, please create queue before receive message.', self.queue_name) else: logging.error('No Message, continue waiting') class BasicListener(object): def process(self, event): pass
The preceding code is used to receive messages from MNS and delete the messages after the listener is called to consume the messages.
- Register a listener to consume events. The listener generates a log entry each time the listener receives a Pending or Deleted event.
# -*- coding: utf-8 -*- import logging from .mns_client import BasicListener class ListenerLog(BasicListener): def process(self, event): state = event['content']['state'] resource_id = event['content']['resourceId'] if state == 'Pending': logging.info(f'The instance {resource_id} state is {state}') elif state == 'Deleted': logging.info(f'The instance {resource_id} state is {state}')
Add the following code to the Main function:mns_client = MNSClient() mns_client.regist_listener(ListenerLog()) mns_client.run()
In the production environment, you can store the events in your databases or Log Service for subsequent queries and audits.
- Create a Conf file.
- Practice 2: Automatically start ECS instances that are shut down
In scenarios in which ECS instances may be unexpectedly shut down, you may want to automatically start the ECS instances.
You can reuse the MNS client that is developed in Practice 1 and create another listener to automatically start ECS instances. When the listener receives a Stopped event of an ECS instance, you can run the start command on the ECS instance to start the instance.
# -*- coding: utf-8 -*- import logging from aliyunsdkecs.request.v20140526 import StartInstanceRequest from aliyunsdkcore.client import AcsClient from .mns_client import BasicListener from .config import Conf class ECSClient(object): def __init__(self, acs_client): self.client = acs_client # Start the ECS instance. def start_instance(self, instance_id): logging.info(f'Start instance {instance_id} ...') request = StartInstanceRequest.StartInstanceRequest() request.set_accept_format('json') request.set_InstanceId(instance_id) self.client.do_action_with_exception(request) class ListenerStart(BasicListener): def __init__(self): acs_client = AcsClient(Conf.access_key, Conf.access_key_secret, Conf.region_id) self.ecs_client = ECSClient(acs_client) def process(self, event): detail = event['content'] instance_id = detail['resourceId'] if detail['state'] == 'Stopped': self.ecs_client.start_instance(instance_id)
In the production environment, you can listen for Starting, Running, or Stopped events after you run the start command. Then, you can perform O&M by using a timer and a counter based on whether the ECS instance is started.
- Practice 3: Automatically remove preemptible instances from Server Load Balancer (SLB) instances before the preemptible instances are released
An interruption notification event is triggered 5 minutes before a preemptible instance is released. During the 5 minutes, you can perform specific operations to prevent your services from being interrupted. For example, you can remove the preemptible instance from a Server Load Balancer (SLB) instance.
You can reuse the MNS client that is developed in Practice 1 and create another listener. When the listener receives an interruption notification event for a preemptible instance, you can call the SLB SDK to remove the preemptible instance from an SLB instance.
# -*- coding: utf-8 -*- from aliyunsdkcore.client import AcsClient from aliyunsdkcore.request import CommonRequest from .mns_client import BasicListener from .config import Conf class SLBClient(object): def __init__(self): self.client = AcsClient(Conf.access_key, Conf.access_key_secret, Conf.region_id) self.request = CommonRequest() self.request.set_method('POST') self.request.set_accept_format('json') self.request.set_version('2014-05-15') self.request.set_domain('slb.aliyuncs.com') self.request.add_query_param('RegionId', Conf.region_id) def remove_vserver_group_backend_servers(self, vserver_group_id, instance_id): self.request.set_action_name('RemoveVServerGroupBackendServers') self.request.add_query_param('VServerGroupId', vserver_group_id) self.request.add_query_param('BackendServers', "[{'ServerId':'" + instance_id + "','Port':'80','Weight':'100'}]") response = self.client.do_action_with_exception(self.request) return str(response, encoding='utf-8') class ListenerSLB(BasicListener): def __init__(self, vsever_group_id): self.slb_caller = SLBClient() self.vsever_group_id = Conf.vsever_group_id def process(self, event): detail = event['content'] instance_id = detail['instanceId'] if detail['action'] == 'delete': self.slb_caller.remove_vserver_group_backend_servers(self.vsever_group_id, instance_id)
ImportantFor interruption notification events, set the
event name
in themns_client.regist_listener(ListenerSLB(Conf.vsever_group_id), 'Instance:PreemptibleInstanceInterruption')
format.In the production environment, you can create another preemptible instance and add the instance as a backend server to the SLB instance. This ensures that your services are not interrupted.