Self-built MySQL database on ECS for manual disaster tolerance
Created#More Posted time:Jan 25, 2017 10:22 AM
1. About the environment
Many cloud customers may have only purchased ECS because of their small database size. But database server is also requisite, so they may resort to a self-built MySQL database. Alibaba Cloud RDS provides a full range of database O&M services, including monitoring, optimization and a master-slave mechanism to achieve high availability. It is equally necessary for new users to understand how to build a MySQL master-slave mechanism and the master-slave switch procedures for their business continuity and learning needs. Before the master-slave mechanism is available, a single server or instance lacks a real-time data replication scheme to provide another intact and independent instance for quick failover in case of an instance fault, or recover the customer business within a short span of time.
2. Implement master-slave mechanism for databases
2.1. Principle of master-slave mechanism
MySQL provides a simple and reliable master-slave mechanism and its basic principle is illustrated in the figure below. There are three steps involved:
(1) The master node logs the changes to the binary log (these logs are called binary log events);
(2) The slave node copies the binary log events of the master node to its relay logs;
(3) The slave node redoes the events in the relay logs and reflects the changes in its own data.
2.2. Plan the master-slave mechanism
MySQL provides master-slave replication components, but not automated monitoring or switching for HA. Alibaba Group has self-developed tools to achieve HA and the open-source component Keepalived + LVS also offer similar functions. Here we would like to propose a solution with SLB and leave the implementation of automated HA to readers. We suppose we have two MySQL instances, A and B. We want to implement master-slave two-way replication, that is, the MM (Master-Master) schema for deployment. The front-end is mounted to SLB. SLB only sets one master instance at the same time point by assigning the instance weight to 100, and sets the weight of the other instance (slave instance) to 0, and the read_only parameter of the slave instance is enabled. When a fault occurs, the weight of the slave instance is set to 100 and that of the master instance is set to 0. In this way, SLB can achieve manual HA. Applications may be unavailable for a short period of time (1 to 2 minutes). After the SLB completes the switching, applications can resume their services through the re-connect mechanism of the connection pool.
2.3. Implement master-slave mechanism
(1) Make the production instance ready on an existing ECS. Apply for a new ECS and ensure that the newly-applied ECS is not in the same physical NC of the production instance ECS. If the MySQL of the production instance is installed in the system disk, you can create a snapshot for the system disk and make an image from it. The new ECS can be created based on this user-defined image, saving the following steps:
(2) Install and configure MySQL on the new ECS. Try to make its configuration as consistent as possible with that of the production instance;
(3) Edit the /etc/my.cnf file of the production instance and add the following content:
The binlog configuration is a focus in the steps above. Except server_id which you should modify to the last two digits of the IP address + port number, you can directly copy the configuration for other items. For example, in the example, 3017441834 means the server IP address of the MySQL is: *.*.124.182, and the MySQL port is 3306.
At the same time, make sure the /home/mysql/log directory exists. You must have the permission to this directory to start or stop MySQL.
(4) Start the MySQL production instance quickly and then log in to MySQL to check whether the configuration has taken effect.
MySQL databases support the following statements to stop the instance:
mysqladmin -h127.0.0.1 -uroot -P3306 shutdown
You can use statements similar to the one below to check whether the configuration has taken effect:
show variables like ’ log-bin%’
(5) If the configuration is effective in the production instance, perform the same operations on the slave instance again;
(6) Execute “flush tables with read lock” to lock the production instance;
(7) Synchronize the data on the production instance to the slave instance. Two schemes are available: identical deployment on the production instance and the slave instance, or different deployment on the production instance and the slave instance. In the former scheme, you can directly copy all the files and directories under the datadir directory to the slave instance remotely. In the latter scheme, you can run the “mysqldump” command to export the data on the production instance by database, and then import the data to the slave instance. Keep the production instance locked throughout the process;
(8) Unlock the production instance: unlock tables;
(9) Query the status of the master production instance, put down the binlog file name and the corresponding position serial number;
(10) Execute the following statements on the slave instance:
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 10000;
SET GLOBAL rpl_semi_sync_slave_enabled = 1;
grant replication slave,replication client on . to repuser@'%'identified by 'repuser';
Here the default repuser password is also “repuser”, but MySQL5.7 imposes mandatory requirements on password complexity so the password should be complicated.
change master to master_host=’production instance IP’, master_port=3306(production instance port), master_user=’repuser’, master_password=’repuser’, master_log_file=’The recorded binlog file name in Step 9’, master_log_pos=’The recorded binlog position serial number in Step 9’;
(11) Start the slave instance to copy data from the production instance. Execute: start slave
(12) Check the slave instance status: show slave status \G; Pay attention to two parameters:
Slave_IO_Running and Slave_SQL_Running. If the values of the two parameters are both “yes”, the configuration is correct and the data copying has started;
(13) Next configure the reverse data replication from the slave instance to the production instance. Check the master instance status on the slave instance: show master status; Put down the binlog file name and position serial number;
(14) Repeat steps 10-12 on the production instance to complete the reverse replica configuration.
(15) Test services using SLB and observe whether the data on the production instance is synchronized to the slave instance in real time;
(16) Add the following content to the /etc/my.cnf file of all the instances:
Meanwhile, we should set the slave instance to read-only:
set global read_only=1
3. Disaster failover procedures
In the flowchart, A represents the production instance and B represents the slave instance.
(1) When A instance fails, you should first consider restarting or repairing A. If A manages to start normally within a short period (within 30 minutes), and the startup logs and business-side tests are normal, the services can be deemed as recovered and the emergency is handled successfully;
(2) If A restarts successfully, but there are errors reported in the logs, analyze the errors in the log first and identify the problem impact scope and severity, so as to determine whether to ignore or repair the error. If the repairs can be completed within 30 minutes, we can still deem the services as recovered successfully;
(3) If A restarts successfully, but the errors reported in the logs cannot be fixed within 30 minutes, we should start the failover procedure, that is,
enable the read and write permissions to the slave instance: set global read_only=0
adjust the weight of A instance from 100 to 0, and the weight of B instance from 0 to 100 on SLB to recover the services first;
(4) When B instance is promoted to the master instance and offering external services, we should try to fix A instance in the background as fast as possible. If A instance is repaired successfully, check the replication relationship between the A and B instances. If yes, our emergency handling is over;
(5) When B instance is promoted to the master instance and offering external services, if we manage to fix A instance, but the replication relationship between the two instances fail to recover, re-configure master-slave mechanism following steps in Section 2.3;
(6) If A instance cannot be recovered fully after the fault, such as the ECS cannot be restarted, initiate the failover procedure directly and promote B instance to the master instance. Then hurry up to fix and recover A instance in the background.
4. Fix MySQL synchronization failure
4.1. Synchronization initialization failure
On the slave database, execute: show slave status\G;
Error: NO appears.
On the master database, execute:
On the slave database, execute:
Change master to master_host='10.101.3.10',master_port=3301,master_user='repuser',master_password='repuser',master_auto_position=1;
4.2. Interrupted data synchronization
On the slave database, execute: show slave status\G;
For Slave_SQL_Running: You can perform the following when NO appears:
On the slave database, execute:
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1；
Execute show slave status\G; Check whether there are two YES’s.