Interpretation of and protection against Hadoop hacking for ransom
Created#More Posted time:Mar 20, 2017 13:06 PM
The memory of MongoDB data loss incident is still fresh as if it were yesterday, during which thousands of MongoDB databases were deleted or hacked for ransom. Recently hackers are also attacking Hadoop, and many Hadoop clusters have suffered full data loss. The lost data may even amount to terabytes in size, causing a huge loss for the victim enterprise. This article describes this problem and the subsequent preventive plans.
I hope competent Hadoop developers who have seen this article could fix the issue in time to avoid data loss. You are also welcome to repost this article to make it visible to more developers.
Users usually open up the 50070 web port for HDFS in their telecom IDC or on the cloud for convenience or out of negligence. In this case, the hacker can employ a few simple commands, such as:
curl -i -X DELETE "http://ip:50070/webhdfs/v1/tmp?op=DELETE&recursive=true"
curl -i -X PUT "http://ip:50070/webhdfs/v1/NODATA4U_SECUREYOURSHIT?op=MKDIRS"
Hackers can delete the data under the tmp directory (or other directories), and they can also see a lot of log information to obtain sensitive data. Or they can keep writing data until the HDFS is full. Some other hackers create an external copy of the data, delete HDFS, and then send an e-mail to blackmail the user directly.
The last interface is shown in the figure. The hacker generally gives a prompt as follows:
In fact, this is not an HDFS vulnerability, but a webhdfs feature provided by HDFS. But many developers fail to realize that data can be deleted in this way.
Currently if you query the 50070 port in https://www.shodan.io, the result will be shown as follows:
In China there are 8,500-plus Hadoop clusters with their 50070 ports open to the public network. These clusters are at risks in theory. This figure is only one published by this website, and the real figure should be far greater than they've shown. The security situation is not optimistic.
The best solution is to disable all the open ports, including the 50070 port, from internet accessibility. Such ports include:
Hadoop opens a lot of ports by default to provide WebUI. Below is a list of the information of relevant ports:
o NameNode default port 50070
o SecondNameNode default port 50090
o DataNode default port 50075
o Backup/Checkpoint node default port 50105-
• YARN (JobTracker)
o ResourceManager default port 8088
o JobTracker default port 50030
o TaskTracker default port 50060
• Hue default port 8080
and so on.
If you are not clear about it yet, you can follow the minimization principle to minimize the ports opened, if you really need to access these ports:
1. If it is a cloud environment, you can buy an interface-enabled environment (Win, Linux and so on and this machine is the springboard machine) on the cloud, and then connect this machine with the Hadoop environment via a minimized channel.
2. Even if you need to open a port to the public network, you can go to the ECS security group to open the port on your local environment to the public IP address. Do not open ports network-wide.
P.S.: Some customers may argue that this is a test environment, not a production environment, and it doesn't matter if its data is lost. But you should be aware that if the hacker attacks this machine, it will become the hacker's bot, and when there are other machines in the same security group with this machine, the bot is likely to attack other machines.
• Disable the webhdfs feature. Set dfs.webhdfs.enabled to false
• Enable the Kerberos feature or similar features (this is complicated and I will not detail it here)
• Back up important data in time, such as backing up on-cloud data to OSS
Currently after you purchase the E-MapReduce cluster products:
• All the ports of all big data components will be disabled by default.
• YARN UI and Ganglia among others are accessed through Port 80. Account and password access mechanisms are established to ensure security. You don’t need to open up the 50070 port directly.
• You can set up periodic tasks to back up some important data to the OSS (run DiskCP through the E-MapReduce workflow). It is usually unrealistic to back up the entire cluster, because Hadoop means a huge amount of data. You can design your own backup principles to back up core and confidential data.
• The kernel team is concerned with Hadoop-related vulnerabilities and fixes vulnerabilities from the product and kernel level in a timely manner.
• E-MapReduce scans some ports periodically to prevent customers from enabling them by mistake.
E-MapReduce will explore a more powerful security protection mechanism as an option for its customers, such as the E-MapReduce-level account-password mechanism. The E-MapReduce team will always attach importance to the data security issue.
It is worth particular attention that E-MapReduce has implemented a good security protection mechanism, but some customers open the port to public networks, which has caused data loss. This is very unfortunate.
I hope competent Hadoop developers who have seen this article could fix the issue promptly to avoid data loss. You are also welcome to repost this article to make it visible to more developers.