This topic describes the best practices for blocking malicious crawlers by using WAF.
Background information
Malicious crawlers come in various types. They constantly change their crawling methods to bypass anti-crawling policies configured by website administrators. Therefore, it is impossible to block all malicious crawlers by using fixed rules. To block the malicious crawlers, WAF provides a bot management feature. This feature has close relations to the characteristics of your business. However, this feature can deliver optimal protection only with the help of security experts.
In addition to the bot management feature, you can also use the custom protection policy and IP blacklist functions to configure specific crawler blocking policies based on the following characteristics of malicious crawlers.
Risks and characteristics of malicious crawlers
xxspider
keyword in the User-Agent field and have the following characteristics: lower request rate, scattered URLs, and wide time range. If you run a reverse
nslookup
or
tracert
command on a legitimate crawler, you can obtain the source IP address that initiates the crawler request. For example, if you run the reverse nslookup command with the IP address of the Baidu crawler, you can obtain the source IP address of the crawler.

Malicious crawlers may send a large number of requests to a specific URL or port of a domain name during a certain period of time, for example, HTTP flood attacks that are disguised as crawlers or requests that are disguised by third parties to crawl targeted sensitive information. A large number of malicious requests can cause a sharp rise in CPU utilization, website access failure, and service interruptions.
Create a custom protection policy
You can use the custom protection policy function to combine key fields such as User-Agent and URL to filter out malicious crawler requests. For more information, see Create a custom protection policy.
- Log on to the WAF console. On the Custom Protection Policy page, configure the following ACL rule to allow only Baidu spiders.
- Log on to the WAF console. On the Custom Protection Policy page, configure the following ACL rule to prevent all crawlers from accessing the /userinfo directory.
For high-frequency malicious crawler requests, you can configure Rate Limiting on the Custom Protection Policy page to block domain-specific IP addresses that send requests exceeding the threshold.

uid=12345
.

Configure an IP address blacklist
If a large number of malicious crawler requests are from the same region and normal requests are not from this region, you can enable IP Blacklist to block all access requests from this specific region. For more information, see Configure the IP blacklist.
