Automate O&M based on status change events of ECS instances -

This topic describes how CloudMonitor automatically processes the status change events of Elastic Compute Service (ECS) instances by using Message Service (MNS) queues.

Prerequisites

A queue is created in the MNS console. Example: ecs-cms-event.
For more information, see Manage queues in the console.
A system event-triggered alert rule is created in the CloudMonitor console.
For more information, see Manage system event-triggered alert rules.
Python dependencies are installed.
All the code in this topic is written in Python 3.6. You can also use other programming languages, such as Java and PHP.
For information about how to install CloudMonitor SDK for Python, see the Install CloudMonitor SDK for Python section of the "CloudMonitor SDK for Python" topic.

Background information

In addition to the existing system events, CloudMonitor supports the status change events for ECS. The status change events include interruption notification events that are applied to preemptible instances. A status change event is triggered when the status of an ECS instance changes. Instance status changes can be caused by operations that you perform by using the ECS console or SDKs or by calling API operations. Instance status changes can also be caused by automatic scaling, overdue payments, or system exceptions.

CloudMonitor provides the following notification methods for system events: MNS queues, Function Compute, callback URLs, and Simple Log Service. In this example, MNS queues are used to describe three best practices about how CloudMonitor automatically processes the status change events of ECS instances.

Procedure

CloudMonitor sends all status change events of ECS instances to MNS. MNS receives messages and handles the messages.

Practice 1: Record all creation and release events of ECS instances

You cannot query ECS instances that have been released in the ECS console. If you need to query released ECS instances, you can store status change events of all ECS instances in a database or Simple Log Service. When an ECS instance is created, CloudMonitor sends a Pending event. When an ECS instance is released, CloudMonitor sends a Deleted event.

Create a Conf file.

The Conf file must contain the following parameters: endpoint, access_key, access_key_secret, region_id (example: cn-beijing), and queue_name.

Note

To obtain the endpoint, you can log on to the MNS console, go to the Queues page, and then click Get Endpoint.

import os

# Make sure that the ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variables are configured in the code runtime environment. 
# If the project code is leaked, the AccessKey pair may be leaked and the security of resources within your account may be compromised. The following sample code shows how to use environment variables to obtain an AccessKey pair and use the AccessKey pair to call API operations. The sample code is for reference only. We recommend that you use Security Token Service (STS) tokens, which provide higher security.
class Conf:
    endpoint = 'http://<id>.mns.<region>.aliyuncs.com/'
    access_key = os.environ['ALIBABA_CLOUD_ACCESS_KEY_ID']
    access_key_secret = os.environ['ALIBABA_CLOUD_ACCESS_KEY_SECRET']
    region_id = 'cn-beijing'
    queue_name = 'test'
    vsever_group_id = '<your_vserver_group_id>'

Use the MNS SDK to develop an MNS client for receiving messages from MNS.

# -*- coding: utf-8 -*-
import json
from mns.mns_exception import MNSExceptionBase
import logging
from mns.account import Account
from . import Conf


class MNSClient(object):
    def __init__(self):
        self.account =  Account(Conf.endpoint, Conf.access_key, Conf.access_key_secret)
        self.queue_name = Conf.queue_name
        self.listeners = dict()

    def regist_listener(self, listener, eventname='Instance:StateChange'):
        if eventname in self.listeners.keys():
            self.listeners.get(eventname).append(listener)
        else:
            self.listeners[eventname] = [listener]

    def run(self):
        queue = self.account.get_queue(self.queue_name)
        while True:
            try:
                message = queue.receive_message(wait_seconds=5)
                event = json.loads(message.message_body)
                if event['name'] in self.listeners:
                    for listener in self.listeners.get(event['name']):
                        listener.process(event)
                queue.delete_message(receipt_handle=message.receipt_handle)
            except MNSExceptionBase as e:
                if e.type == 'QueueNotExist':
                    logging.error('Queue %s not exist, please create queue before receive message.', self.queue_name)
                else:
                    logging.error('No Message, continue waiting')


class BasicListener(object):
    def process(self, event):
        pass

The preceding code is used to receive messages from MNS and delete the messages after the listener is called to consume the messages.

Register a listener to consume events. The listener generates a log entry each time the listener receives a Pending or Deleted event.

 # -*- coding: utf-8 -*-
import logging
from .mns_client import BasicListener


class ListenerLog(BasicListener):
    def process(self, event):
        state = event['content']['state']
        resource_id = event['content']['resourceId']
        if state == 'Panding':
            logging.info(f'The instance {resource_id} state is {state}')
        elif state == 'Deleted':
            logging.info(f'The instance {resource_id} state is {state}')

Add the following code to the Main function:

mns_client = MNSClient()

mns_client.regist_listener(ListenerLog())

mns_client.run()

In the production environment, you can store the events in a database or Simple Log Service for subsequent queries and audits.

Practice 2: Automatically restart ECS instances that are shut down

In scenarios where ECS instances may be shut down unexpectedly, you may need to automatically restart the ECS instances.

You can reuse the MNS client developed in Practice 1 and create another listener. When you receive a Stopped event for an ECS instance, you can run the start command on the ECS instance to start the instance.

# -*- coding: utf-8 -*-
import logging
from aliyunsdkecs.request.v20140526 import StartInstanceRequest
from aliyunsdkcore.client import AcsClient
from .mns_client import BasicListener
from .config import Conf


class ECSClient(object):
    def __init__(self, acs_client):
        self.client = acs_client

    # Start the ECS instance.
    def start_instance(self, instance_id):
        logging.info(f'Start instance {instance_id} ...')
        request = StartInstanceRequest.StartInstanceRequest()
        request.set_accept_format('json')
        request.set_InstanceId(instance_id)
        self.client.do_action_with_exception(request)


class ListenerStart(BasicListener):
    def __init__(self):
        acs_client = AcsClient(Conf.access_key, Conf.access_key_secret, Conf.region_id)
        self.ecs_client = ECSClient(acs_client)

    def process(self, event):
        detail = event['content']
        instance_id = detail['resourceId']
        if detail['state'] == 'Stopped':
            self.ecs_client.start_instance(instance_id)

In the production environment, you can listen to Starting, Running, or Stopped events after the start command is run. Then, you can perform further O&M by using a timer and a counter based on whether the ECS instance is started.

Practice 3: Automatically remove preemptible instances from SLB instances before the preemptible instances are released

An interruption event notification is triggered 5 minutes before a preemptible instance is released. During the 5 minutes, you can perform specific operations to prevent your services from being interrupted. For example, you can remove the preemptible instance from a Server Load Balancer (SLB) instance.

You can reuse the MNS client developed in Practice 1 and create another listener. When the listener receives the interruption event notification for a preemptible instance, you can call the SLB SDK to remove the preemptible instance from an SLB instance.

# -*- coding: utf-8 -*-
from aliyunsdkcore.client import AcsClient
from aliyunsdkcore.request import CommonRequest
from .mns_client import BasicListener
from .config import Conf


class SLBClient(object):
    def __init__(self):
        self.client = AcsClient(Conf.access_key, Conf.access_key_secret, Conf.region_id)
        self.request = CommonRequest()
        self.request.set_method('POST')
        self.request.set_accept_format('json')
        self.request.set_version('2014-05-15')
        self.request.set_domain('slb.aliyuncs.com')
        self.request.add_query_param('RegionId', Conf.region_id)

    def remove_vserver_group_backend_servers(self, vserver_group_id, instance_id):
        self.request.set_action_name('RemoveVServerGroupBackendServers')
        self.request.add_query_param('VServerGroupId', vserver_group_id)
        self.request.add_query_param('BackendServers',
                                     "[{'ServerId':'" + instance_id + "','Port':'80','Weight':'100'}]")
        response = self.client.do_action_with_exception(self.request)
        return str(response, encoding='utf-8')


class ListenerSLB(BasicListener):
    def __init__(self, vsever_group_id):
        self.slb_caller = SLBClient()
        self.vsever_group_id = Conf.vsever_group_id

    def process(self, event):
        detail = event['content']
        instance_id = detail['instanceId']
        if detail['action'] == 'delete':
            self.slb_caller.remove_vserver_group_backend_servers(self.vsever_group_id, instance_id)

Important

For interruption event notifications, set the event name in the mns_client.regist_listener(ListenerSLB(Conf.vsever_group_id), 'Instance:PreemptibleInstanceInterruption') format.

In the production environment, you can apply for another preemptible instance and add it as a backend server of an SLB instance to ensure the performance of your services.