[Others]What can we learn from GitLab’s failure in fully restoring deleted databases
Created#More Posted time:Mar 30, 2017 16:21 PM
On February 1, GitLab made big news: they deleted the database!
On February 1, after a long exhausting workday, an engineer of the famous data code hosting website Gitlab accidentally deleted the data. When the engineer realized what he was doing, he had deleted 495.5 GB production environment data of the total 500 GB, with only 4.5 GB left. This guy did not try to hide it. He live broadcasted the progress of restoring the data. After nearly seven hours of effort, the data was finally restored, but six hours' data was still lost.
Fortunately, the code data was not lost, only PR and Issue discussions were lost. Many programmers may build version control using GitLab or use enterprise services provided by GitLab.
Let’s get back to the point. Since we are not involved in the management of GitLab, we should not discuss their development issues. Let's talk about some problems behind the accident.
GitLab provides public services. They have always paid special attention to data backup. GitLab.com has five backup mechanisms: regular backup (every 24 hours), automatic synchronization, LVM snapshot (every 24 hours), Azure backup (only used for NFS, not valid for databases), and S3 backup. ** When this accident occurred, all backups were invalid!
Fortunately, there was a "possibly viable" backup six hours before the accident, and the data was restored successfully.
Regarding the accident, GitLab provided their own solution and a To Do List for future backups
1. Assign different colors to server terminals to distinguish the production environment (red) from the test environment (yellow).
2. Check the database backup directory regularly, and check whether the data is backed up successfully.
3. Add backup reminders, and check the size of S3 Bucket regularly. If the backup size is smaller than the database size and the backup frequency exceeds the set time, a warning shall be sent promptly.
4. Adjust PostgreSQL configuration. Since there too many simultaneous PGSQL connections in the online environment, the data backup may fail. Reduce it from 8,000 to 2,000, or use a database connection pool.
What have you learned from the GitLab accident? How did you previously back up your server? How do you ensure data security?
[Walter edited the post at Apr 5, 2017 13:43 PM]
1st Reply#Posted time:Mar 31, 2017 9:11 AM
GitLab is honest. However, with reasonable operations, it is possible to restore more than ninety percent of the data. Xiachufang also deleted their disk data once, if any of you still remember.
Utilize redundancy for data operations and maintenance, and do not delete anything before careful consideration. When writing a program to perform operations on databases, add a deleted tag to mark deletions to prevent data from being mistakenly deleted.
2nd Reply#Posted time:Apr 1, 2017 9:18 AM
I missed the live broadcast of the database recovery because I was taking a train. However, according to relevant news, it can be seen that the recovery process was very difficult, and the recovered data was only six hours ago, which shows that the recovery of the database was very complex. Because I am not engaged in operations and maintenance work, I don't have much experience with database backup operations. I can only talk about my own opinions.
1. Multi-machine backup is necessary, and it is best to have a remote disaster recovery program. When one machine fails, another can readily be used in its place.
2. Database operations and those of other important systems should be limited to a specified time, which can effectively avoid fatigue or malicious operations.
3. Check the status of the backup server regularly. When there are problems with the capacity and time of the backup data, they should be repaired promptly, and disaster drills should be performed if necessary.
4. Prohibit some dangerous commands on important servers.
These are my personal opinions. If there is anything wrong, please let me know.
3rd Reply#Posted time:Apr 5, 2017 10:04 AM
This accident reminds us of the following warnings:
Do not work when you are fatigued and do not perform operations when you are drunk, especially not on the database.
It is recommended to set an alias for the rm command, for example to mv files to specified directory.
Backup and recovery verification should work properly. Recovery drills should be performed regularly, which will verify whether the backup data is valid or not and whether the recovery program works normally.
Do not blame anyone during data recovery, especially during accident analysis. Accident analysis should focus on finding out the root cause and developing improvement measures;
In dealing with accidents, we must consider whether the treatment measures will lead to cascading failures, and think twice before important operations;
The response and repair for accidents takes a long time, and spare hardware is often insufficient which may cause data loss, and this cannot be tolerated by users. The emergency plan should be prepared in advance;
Do not add a “online approval by leaders” step for improvement measures, which will only affect recovery efficiency.
4Floor#Posted time:Apr 6, 2017 10:07 AM
Some suggestions for GitLab:
Adjust PostgreSQL configuration. Since there too many simultaneous PGSQL connections in the online environment, the data backup may fail. Reduce it from 8,000 to 2,000, or use a database connection pool.
If PostgreSQL supports reserving connections for superusers, it may be better than the above improvement.
Setting a connection limit for user groups is also supported.
Through modifying the PG kernel, you can also reserve connections for replication users.
5Floor#Posted time:Apr 7, 2017 10:09 AM
I'm glad that Gitlab made public the process of the accident handling, and I've found some details from later multi-party proposals.
First, the guy should not work when he is extremely fatigued. In any industry, the operator must work in good conditions, and I think that the accident will slightly change GitLab's working policy.
Second, evidenced by this accident, the six backup schemes are all invalid, which means that these schemes have a lot of problems. Experts have also commented that GitLab should test and validate these backup schemes regularly to ensure recovery feasibility.
Third, Simon Riggs, CTO of 2nd Quadrant, posted an article on his blog: Dataloss at GitLab, providing some good advice regarding this accident:
1. About PostgreSQL 9.6's data synchronization hanging, there may be some bugs, and they are being worked on.
2. It is normal for PostgreSQL to have a 4GB synchronization lag.
3. In the normal mode, the slave node is stopped first, which will make the master node automatically release the WALSender connection. Therefore, the max_wal_senders parameters of the master node should not be reconfigured. However,
when the slave node is stopped, the complex connections with the master node will not be released immediately, and newly started slave nodes will consume more links. In his opinion, GitLab configured 32 links, which is too high, and usually two to four links are enough.
4. In addition, GitLab configured max_connections=8000, which is too high, and it is reasonable to drop to 2,000 at present.
5. Pg_basebackup will first build a checkpoint on the master node, and then start synchronization. The process takes about 4 minutes.
6. It is very dangerous to manually delete the database directory, and this should be performed by a program. The latest release of repmgr is recommended
7. Backup recovery is also very important. Therefore, it should be done with the corresponding procedures. Barman is recommended (which supports S3)
8. Testing backup and recovery is a very important process.
It is also found that GitLab employees are not familiar with PostgreSQL. That's a major weakness.
Finally, I think the operation and maintenance department should try to use scripts for these processes rather than handle them manually.
6Floor#Posted time:Apr 10, 2017 9:39 AM
I watched the live broadcast till early morning. The GitLab engineer was very tired when he summed up his experience.
All five backups being invalid is really rare. Even though I am not working in the computer industry and have not dealt with such a large database, honestly, I really think data backup is very important.
Phpmyadmin is used for basic database backup. There are also command line backups and script backups which automatically backup. However, no matter what methods are used for database backups, the data integrity is the most important.
As I have limited technology experience and no practical experience with large databases, please let me know if I made any mistakes.
Data integrity verification should be carried out after completing each backup. If the backup size differs much from the actual size of the database, the backup must be wrong!
When I make backups in reality, I will adopt multiple backups if the database is small and adopt multi-point backups if the amount of data is large. Automated scripts will be used for backup to dedicated backup machines or storing multi-part backups in OSS. Perform regular backups using scheduled tasks.
I am learning how to make incremental backups for the database. It is not possible to perform complete backups on large databases every time, so incremental backup is very important and it is easy to recover.
If I encounter a large database, I will sum up more practical experience.
7Floor#Posted time:Apr 11, 2017 9:51 AM
Almost all businesses depend on data, and we should attach great importance to data. My idea is to use two separate databases, Database A and Database B, running on different hosts. Then an access interface I should be provided, and the functions of interface I should meet the following requirements:
1. Any query, as long as it does not change the data, accesses Database A by default (the premise is that A is normal)
2. Any query, as long as it changes the data, accesses Database B
3. Once a database fails, all queries to that database access the other database; and warnings should be given, so that it can be repaired promptly.
4. If both databases fail, there is nothing we can do...
The two-database model may affect the efficiency of some operations, which can be optimized to reduce the impact.