This topic describes the best practices for blocking malicious crawlers by using WAF.

Background information

Malicious crawlers come in various types. They constantly change their crawling methods to bypass anti-crawling policies configured by website administrators. Therefore, it is impossible to block all malicious crawlers by using fixed rules. To block the malicious crawlers, WAF provides a bot management feature. This feature has close relations to the characteristics of your business. However, this feature can deliver optimal protection only with the help of security experts.

If you need stronger protection against malicious crawlers or need help from security experts, we recommend that you use the bot management feature. The feature provides malicious crawler IP libraries and dynamically updates IP libraries of various public clouds and data centers based on network-wide threat intelligence of Alibaba Cloud in real time. This helps you block malicious requests from the addresses in the malicious crawler IP libraries. For more information, see Configure the bot management whitelist.
Note Bot management is a value-added service that is separately enabled when you purchase or upgrade WAF.

In addition to the bot management feature, you can also use the custom protection policy and IP blacklist functions to configure specific crawler blocking policies based on the following characteristics of malicious crawlers.

Risks and characteristics of malicious crawlers

Normal crawler requests typically contain the xxspider keyword in the User-Agent field and have the following characteristics: lower request rate, scattered URLs, and wide time range. If you run a reverse nslookup or tracert command on a legitimate crawler, you can obtain the source IP address that initiates the crawler request. For example, if you run the reverse nslookup command with the IP address of the Baidu crawler, you can obtain the source IP address of the crawler. View origin server information

Malicious crawlers may send a large number of requests to a specific URL or port of a domain name during a certain period of time, for example, HTTP flood attacks that are disguised as crawlers or requests that are disguised by third parties to crawl targeted sensitive information. A large number of malicious requests can cause a sharp rise in CPU utilization, website access failure, and service interruptions.

Create a custom protection policy

You can use the custom protection policy function to combine key fields such as User-Agent and URL to filter out malicious crawler requests. For more information, see Create a custom protection policy.

Sample configuration:
  • Log on to the WAF console. On the Custom Protection Policy page, configure the following ACL rule to allow only Baidu spiders. New rule to allow Baidu spiders
  • Log on to the WAF console. On the Custom Protection Policy page, configure the following ACL rule to prevent all crawlers from accessing the /userinfo directory.Block crawlers
Note The method used to restrict the User-Agent field is ineffective for specially crafted crawler attacks. For example, an attacker may include a baidu character in the User-Agent field of the malicious crawler request to disguise the malicious crawler as a Baidu crawler. This way, the ACL rule does not block the malicious crawler request. In addition, an attacker can hide the crawler identity by removing the spider character in the User-Agent field. This way, the ACL rule does not block the attack.

For high-frequency malicious crawler requests, you can configure Rate Limiting on the Custom Protection Policy page to block domain-specific IP addresses that send requests exceeding the threshold.

You can configure a rule, as shown in the following figure. If an IP address sends requests to any path under the domain more than 1,000 times in 30 seconds, the IP address is blocked for 10 hours. Rule limiting
If you have purchased a WAF Enterprise instance, you can use custom statistical objects in addition to IP addresses and sessions during the rate limiting configuration. Blocking IP addresses may affect NAT. You can use cookies or default parameters that identify users as statistical objects. In the following example, select Cookie for Statistical Object and Captcha for Action. Assume that the cookie format is as follows: uid=12345. Cookie

Configure an IP address blacklist

If a large number of malicious crawler requests are from the same region and normal requests are not from this region, you can enable IP Blacklist to block all access requests from this specific region. For more information, see Configure the IP blacklist.

Configuration example: You can log on to the WAF console, go to the IP Blacklist page, and then configure the following rule to block access requests from IP addresses outside China. 封禁区域