edit-icon download-icon

Intercept malicious crawlers

Last Updated: Jan 30, 2018

This document explains the features of malicious crawlers and describes how to use WAF to block them.

It is noteworthy that, professional crawlers constantly change their crawling methods to bypass anti-crawling policies set by the website administrators. It is impossible to achieve perfect protection by applying fixed rules. In addition, anti-crawling has strong association with the characteristics of your own business. Therefore, you must regularly review and update the protection policies to achieve relatively ideal results.

Distinguish malicious crawlers

Normal crawlers are usually labeled with marks similar to xxspider’s user-agent. They request in a regular manner, and the URLs and time are relatively scattered. If you perform an inverted nslookup or tracert on a legitimate crawler, you can always find the legitimate source address. For example, a Baidu crawler record is shown in the following figure.

baiduspider

However, malicious crawlers may send a large amount of requests to a specific URL/interface of a domain name during a specific period of time. It may be a HTTP flood attack disguised as a crawler, or a crawler that crawls targeted sensitive information disguising as a third party. When the number of requests sent by a malicious crawler is large enough, it can usually cause sharp rise in CPU usage, failure to open the website, and service interruptions.

WAF performs Risk warning against malicious crawlers, and alerts you about yesterday’s crawler requests. You can configure one or more of the following rules based on your actual business situation, to block the corresponding crawler requests.

Block malicious crawlers

Configure HTTP ACL policy to block specific crawlers

You can configure the HTTP ACL policy to use user-agent, URL, and other keywords to filter out malicious crawler requests. For example, the following configuration only allows Baidu crawler, and filters out other crawlers (keywords are not case-sensitive).

allow baidu spider

Note: Multiple conditions in a rule are connected by the “AND” logical relationship, that is, a request must satisfy all conditions of a rule for the rule to be effective.

You can use the following configurations to prevent all crawlers from accessing contents under the /userinfo directory.

deny userinfo

Configure custom HTTP flood policies to block malicious requests

Note: The WAF Business and Enterprise editions support customizing HTTP flood protection rules.

Using custom HTTP flood protection rules allows you to set a few specific URLs blocking rules under certain access frequency.

Thank you! We've received your feedback.