The bot management module of Web Application WAF (WAF) is upgraded to provide the scenario-specific configuration feature. This feature allows you to configure specific anti-crawler rules to protect your business from malicious crawlers. This topic describes how to configure anti-crawler rules for websites.

Prerequisites

If you use a subscription WAF instance that runs the Pro, Business, or Enterprise edition, the bot management module is enabled.

Background information

The scenario-specific configuration feature allows you to configure anti-crawler rules based on your business requirements. This feature can be used in combination with the intelligent algorithm feature to precisely identify crawler traffic. In addition, this feature can automatically handle the crawler traffic that matches the configured anti-crawler rules. After you configure anti-crawler rules, you can verify the rules in a test environment. This prevents adverse effects on your websites or apps caused by inappropriate rule configurations or compatibility issues. The adverse effects include false positives and undesired protection results.

Configure an anti-crawler rule for a website

  1. Log on to the Web Application Firewall console.
  2. In the top navigation bar, select the resource group and region to which the WAF instance belongs. The region can be Mainland China or International.
  3. In the left-side navigation pane, choose Protection Settings > Website Protection.
  4. In the upper part of the Website Protection page, select the domain name for which you want to configure the whitelist. Switch Domain Name
  5. Click the Bot Management tab. In the Scenario-specific Configuration section, click Start to create your first anti-crawler rule. Enable the scenario-specific configuration feature for the first time
    If you have created an anti-crawler rule, you can skip this step and click Add in the upper-right corner to create a rule.
  6. In the Configure Scenarios step, configure basic information about the domain name that you want to protect and click Next.
    Parameter description:
    • Scenario: Specify the type of scenario in which you want to protect the domain name. Examples: logon, registration, and order placement.
    • Service Type: Select Websites. This way, WAF protects web pages and HTML5 pages. HTML5 apps are also protected.

      If the domain name that you want to protect is accessed from a different domain name, you must select Use Intermediate Domain Name. Then, select the source domain name from the drop-down list.

    • Traffic Characteristics: Add conditions to identify traffic destined for the domain name that you want to protect. To add a condition, you must specify the matching field, logical operator, and matching item. The matching field is a header field of HTTP requests. For more information about the matching fields, see Fields in match conditions.
  7. In the Configure Protection Rules step, configure detailed settings for the anti-crawler rule and click Next.
    Parameter description:
    • Simple XSS Attack Blocking: If you enable this feature, WAF performs JavaScript validation on clients. The traffic from non-browser tools that cannot run JavaScript code is blocked. This way, simple XSS attacks are blocked.
    • Intelligent Protection: If you enable this feature, the intelligent protection engine analyzes and automatically learns access traffic patterns. Then, a blacklist or protection rule is generated based on the analysis results and learned patterns. You can set the Protection Mode parameter to Monitor or Slider CAPTCHA. If you set the Protection Mode parameter to Monitor, the anti-crawler rule allows the traffic that matches the rule and records the traffic in security reports. If you set the Protection Mode parameter to Slider CAPTCHA, clients are required to pass slider CAPTCHA verification before they can access the protected domain name.
    • Bot Threat Intelligence Feed: The threat intelligence library of Alibaba Cloud is used to identify IP addresses that are frequently used to crawl content from Alibaba Cloud users. The clients that use these IP addresses are required to pass slider CAPTCHA verification before they can access the protected domain name.
    • Data Center Blacklist: If you enable this feature, you must select libraries from the drop-down list. This way, WAF blocks access requests from IP addresses in the libraries to the protected domain name. The libraries contain known malicious IP addresses for data centers of Alibaba Cloud and other mainstream cloud providers. Data Center Blacklist
    • IP Address Throttling and Custom Session-based Throttling: If you enable these features, you can configure throttling conditions to filter out the requests for crawling, which are frequently initiated. This way, HTTP flood attacks are mitigated.
      • IP Address Throttling: You can configure a throttling condition for IP addresses. If the number of requests from the same IP address within the specified statistical period exceeds the threshold, WAF applies the specified action to subsequent requests. You can also configure the period during which the action is performed. The action can be Monitor, Block, or Captcha. You can configure a maximum of three conditions. For more information, see Create a custom protection policy.
      • Custom Session-based Throttling: You can configure a throttling condition for sessions. If the number of requests from the same session within the specified statistical period exceeds the threshold, WAF applies the specified action to subsequent requests. You can also configure the period during which the action is performed. The action can be Monitor, Block, or Captcha. For more information, see Create a custom protection policy.
  8. Optional:In the Verify Actions step, test the effectiveness of the anti-crawler rule.
    This step is optional. To skip this step, you can click Skip in the lower-left corner. If this is your first time to configure an anti-crawler rule, we recommend that you complete this step before you publish the anti-crawler rule. This way, false positives that are caused by inappropriate configurations or compatibility issues can be prevented.
    Parameter description:
    1. Step 1: Enter a Public IP Address: Enter the public IP address of your test device, such as a computer or mobile phone. The test of the anti-crawler rule takes effect only for the public IP address. The test does not affect your business.
      Notice Do not enter the IP address that you obtain by running the ipconfig command. This command returns an internal IP address. You can click Alibaba Network Diagnose Tool to obtain the public IP address of your test device. You can also use your browser to query the IP address.
    2. Step 2: Select an Action: Test the effectiveness of the protection actions that you specify in the anti-crawler rule. WAF generates a test rule that takes effect only for the specified IP address. The actions are JavaScript Validation, Slider CAPTCHA Verification, and Block Verification.

      After you click Start Test for an action, WAF immediately delivers the test rule to the test device. In the dialog box that appears, WAF provides the test procedure, expected result, and demonstration. We recommend that you carefully read them.

      After the test is complete, you can click I Have Completed Test to go to the next step. If the test result shows exceptions, you can click Go Back to optimize the anti-crawler rule. Then, perform the test again.

      For more information about the exceptions that may occur during a test and the solutions to these exceptions, see FAQ.

  9. In the Preview and Publish Protection Rules step, confirm the content of the anti-crawler rule and click Publish.
    After the anti-crawler rule is published, the rule immediately takes effect.
    Note If this is your first time to create an anti-crawler rule, you cannot view the rule ID until the rule is published. The rule ID is displayed on the Bot Management tab of the Security report page. You can use the ID of an anti-crawler rule to check for requests that match the rule in Log Service for WAF.

FAQ

Error Cause Solution
No valid test requests are detected. See WAF documentation or contact us to analyze the possible causes. The test request fails to be sent or is not sent to WAF. Verify that the test request is sent to the IP address that maps the CNAME provided by WAF.
The header fields in the test request do not match the header fields that you configure for Traffic Characteristics in the anti-crawler rule. Modify the settings of Traffic Characteristics in the anti-crawler rule.
The source IP address of the test request is inconsistent with the public IP address that you enter in the anti-crawler rule. Use the correct public IP address. We recommend that you click Alibaba Network Diagnose Tool to obtain your public IP address.
The test requests failed the verification. See WAF documentation or contact us to analyze the possible causes. No real user access is simulated. For example, the debugging mode or automation tools are used. Simulate real user access during the test.
An incorrect service type is selected. For example, Websites is selected when you configure an anti-crawler rule for apps. Change the value of the Service Type parameter.
An intermediate domain name is used but is not correctly configured in the anti-crawler rule. Select Use Intermediate Domain Name. Then, select the intermediate domain name from the drop-down list.
Compatibility issues occur in the frontend. Contact customer service in the DingTalk group or submit a ticket.
No verification is triggered. See WAF documentation or contact us to analyze the possible causes. No test rule is generated. Perform the test several times until the test rule is generated.
No valid test requests are detected or blocked. See WAF documentation or contact us to analyze the possible causes. The test request fails to be sent or is not sent to WAF. Verify that the test request is sent to the IP address that maps the CNAME provided by WAF.
The header fields in the test request do not match the header fields that you configure for Traffic Characteristics in the anti-crawler rule. Modify the settings of Traffic Characteristics in the anti-crawler rule.
The source IP address of the test request is inconsistent with the public IP address that you enter in the anti-crawler rule. Use the correct public IP address. We recommend that you click Alibaba Network Diagnose Tool to obtain your public IP address.

What to do next

Go to the Bot Management tab of the Security Report page and view the protection results and the details of the requests that match the anti-crawler rule. Then, optimize the anti-crawler rule based on the protection results.