The bot management module of Web Application Firewall (WAF) allows you to configure anti-crawler rules for websites and apps. If your web pages, HTML5 pages, or HTML5 apps are accessible from browsers, you can configure anti-crawler rules for the websites to protect your services from malicious crawlers. This topic describes how to configure anti-crawler rules for websites.

Prerequisites

Create an anti-crawler rule template for websites

  1. Log on to the WAF 3.0 console.In the top navigation bar, select the resource group and the region to which the WAF instance belongs. You can select the Chinese Mainland or Outside Chinese Mainland region.
  2. In the left-side navigation pane, choose Protection Configuration > Protection Rules.
  3. Create a template.
    • If no bot management rule template exists, you can click Configure Now in the Bot Management card in the upper part of the Protection Rules page. You can also click Create Template in the Bot Management section in the lower part of the Protection Rules page.
    • If a bot management rule template exists, you can only click Create Template in the Bot Management section in the lower part of the Protection Rules page.
  4. In the Configure Scenarios step, configure the basic information about the website that you want to protect and click Next.
    Parameter Description
    Template Name Enter a name for the template.

    The name can contain letters, digits, and underscores (_).

    Template Description Enter a description for the template.
    Service Type Select Websites. This way, WAF protects web pages, HTML5 pages, and HTML5 apps.
    Web SDK Integration
    • Automatic Integration (Recommended):

      WAF provides Web SDK for JavaScript to improve the protection effect for websites and prevent incompatibility issues.

      If you enable automatic integration, WAF automatically references the SDK in the HTML pages of the website that you want to protect. Then, the SDK collects information such as browser information, probe signatures, and malicious behaviors. Sensitive information is not collected. WAF detects and blocks malicious crawlers based on the collected information.

      If the domain name of the website that you want to protect is accessed from a different domain name, you must select Use Intermediate Domain Name. Then, select the intermediate domain name from the drop-down list. For example, if you access Domain Name A from Intermediate Domain Name B, you must select Use Intermediate Domain Name and select Intermediate Domain Name B from the drop-down list.

      Automatic integration is supported only for protected objects that are added to WAF in CNAME record mode. Automatic integration is not supported for protected objects that are added to WAF in cloud native mode, including Application Load Balancer (ALB) instances and Microservices Engine (MSE) instances.

    • Manual Integration

      If automatic integration is not supported, you can use manual integration.

    For more information, see Integrate the Web SDK into web applications.

    Traffic Characteristics Add match conditions to identify traffic destined for the domain name of the website that you want to protect. To add a condition, you must specify the match field, logical operator, and match content. The match field is a header field of HTTP requests. For more information about the match fields, see Fields in match conditions. You can add up to five match conditions.
    Notice After you enter an IP address, you must press the Enter key.
  5. In the Configure Protection Rules step, configure anti-crawler rules and click Next.
    Parameter Description
    Legitimate Bot Management Select Spider Whitelist and select search engines from the drop-down list. The crawler library is dynamically updated and contains the crawler IP addresses of mainstream search engines, including Google, Baidu, Sogou, 360, Bing, and Yandex.

    After you select search engines, requests that are sent from the crawler IP addresses of the search engines are sent to the origin server. Then, the bot management module no longer checks these requests.

    Bot Characteristic Detection
    • Script-based Bot Block (JavaScript Validation):

      If you select Script-based Bot Block (JavaScript Validation), WAF performs JavaScript validation on clients. To prevent simple script-based attacks, traffic from non-browser tools that cannot run JavaScript is blocked.

    • Advanced Bot Protection (Dynamic Token-based Authentication):

      If you select Advanced Bot Protection (Dynamic Token-based Authentication), WAF verifies the signature of each request. Requests that fail signature verification are blocked. Signature Verification Exception is selected by default and cannot be cleared. Requests that do not contain signatures or requests that contain invalid signatures are detected. You can also select Signature Timestamp Exception and WebDriver Attack.

    Bot Behavior Detection If you select Intelligent Protection, the intelligent protection engine analyzes access traffic and performs machine learning. Then, a blacklist or a protection rule is generated based on the analysis results and learned patterns. You can set the protection mode to Monitor or Slider CAPTCHA.
    • If you set the protection mode to Monitor, the anti-crawler rule allows traffic that matches the rule and records the traffic in security reports.
    • If you set the protection mode to Slider CAPTCHA, clients must pass slider CAPTCHA verification before the clients can access the website that is protected by WAF.
    Custom Throttling
    Configure custom throttling conditions to filter out crawl requests that are frequently initiated. This helps prevent HTTP flood attacks.
    • IP Address Throttling (Default):

      You can configure throttling conditions for IP addresses. If the number of requests from the same IP address within the value specified by Statistical Interval (Seconds) exceeds the value of Threshold (Times), WAF performs the specified action on subsequent requests. The action can be specified by selecting Slider CAPTCHA, Block, or Monitor from the Action drop-down list. You can also set Throttling Interval (Seconds) which specifies the period during which the specified action is performed. You can configure up to three throttling conditions. For more information, see Configure the custom rule module.

    • Custom Session Throttling

      You can configure throttling conditions for sessions. You can set Session Type to specify the session type. If the number of requests from the same IP address within the value specified by Statistical Interval (Seconds) exceeds the value of Threshold (Times), WAF performs the specified action on subsequent requests. The action can be specified by selecting Slider CAPTCHA, Block, or Monitor from the Action drop-down list. You can also set Throttling Interval (Seconds) which specifies the period during which the specified action is performed. For more information, see Configure the custom rule module.

    Bot Threat Intelligence
    • Bot Threat Intelligence Library:

      The library contains the IP addresses of attackers that have sent multiple requests to crawl content from Alibaba Cloud users over a period of time.

      You can set the protection mode to Monitor or Slider CAPTCHA.

    • Data Center Blacklist

      If you select Data Center Blacklist, the IP addresses in the selected IP address libraries of data centers are blocked. If you use the source IP addresses of public clouds or data centers to access the website that you want to protect, you must add the IP addresses to the whitelist. For example, you must add the callback IP addresses of Alipay or WeChat and the IP addresses of monitoring applications to the whitelist. The data center blacklist supports the following IP address libraries: IP Address Library of Data Center-Alibaba Cloud, IP Address Library of Data Center-21Vianet, IP Address Library of Data Center-Meituan Open Services, IP Address Library of Data Center-Tencent Cloud, and IP Address Library of Data Center-Other.

      You can set the Actions parameter to Monitor, Slider CAPTCHA, or Block.

    • Fake Spider Blocking:

      If you select Fake Spider Blocking, WAF blocks the User-Agent headers that are used by all search engines specified in the Legitimate Bot Management section. If the IP addresses of clients that access the search engines are proved to be valid, WAF allows requests from the search engines.

  6. In the Configure Effective Scope step, select the object or object group that you want to protect and click add to add the object or object group to the Selected Objects section on the right. Then, click Next.
  7. In the Verify Protection Effect step, test the effectiveness of the anti-crawler rule.
    Before you publish the anti-crawler rule, we recommend that you verify the protection effect to prevent false positives caused by improper rule configurations or compatibility issues. If you are certain that the rule configurations are correct, click Skip to skip this step.
    Test steps:
    1. Step 1: Enter a public IP address.: Enter the public IP address of your test device, such as a computer or mobile phone. The test of the anti-crawler rule takes effect only for the public IP address. The test does not affect your business.
      Notice Do not enter the IP address that you obtained by running the ipconfig command. This command returns an internal IP address. If you are not sure about the public IP address of your test device, you can use a tool or website to query the public IP address.
    2. Step 2: Select an action.: Test the effectiveness of the protection action that you specified in the Configure Protection Rules step. WAF generates a test rule only for the specified IP address. The action can be JavaScript Validation, Dynamic Token-based Authentication, Slider CAPTCHA Verification, or Block Verification.

      After you click Test for an action, WAF immediately delivers the test rule to the test device. In the dialog box that appears, WAF provides the test procedure, expected result, and demonstration. We recommend that you carefully read them.

      After the test is complete, you can click I Have Completed the Test to go to the next step. If the test result shows exceptions, you can click Go Back to optimize the anti-crawler rule. Then, perform the test again.

FAQ

If an exception occurs during the Verify Protection Effect step, refer to the following table to resolve the issue.

Error Cause Solution
No valid test requests are detected. See WAF documentation or contact us to analyze the possible causes. The test request failed to send or is not sent to WAF. Make sure that the test request is sent to the IP address that maps the CNAME provided by WAF.
The header fields in the test request do not match the header fields that you configured for Traffic Characteristics in the anti-crawler rule. Modify the settings of Traffic Characteristics in the anti-crawler rule.
The originating IP address of the test request is different from the public IP address that you specified in the anti-crawler rule. Use the correct public IP address. We recommend that you click Alibaba Network Diagnose Tool to obtain your public IP address.
The test requests failed the verification. See WAF documentation or contact us to analyze the possible causes. No real user access is simulated. For example, the debugging mode or automation tools are used. Simulate real user access during the test.
An incorrect service type is selected. For example, Websites is selected when you configure an anti-crawler rule for apps. Change the value of the Service Type parameter.
An intermediate domain name is used, but an incorrect intermediate domain name is selected in the anti-crawler rule. Select Use Intermediate Domain Name. Then, select the correct intermediate domain name from the drop-down list.
Compatibility issues occur in the frontend. Contact customer service in the DingTalk group or submit a ticket.
No verification is triggered. See WAF documentation or contact us to analyze the possible causes. No test rules are generated. Perform the test several times until a test rule is generated.
No valid test requests are detected or blocked. See WAF documentation or contact us to analyze the possible causes. The test request failed to send or is not sent to WAF. Make sure that the test request is sent to the IP address that maps the CNAME provided by WAF.
The header fields in the test request do not match the header fields that you configured for Traffic Characteristics in the anti-crawler rule. Modify the settings of Traffic Characteristics in the anti-crawler rule.
The originating IP address of the test request is different from the public IP address that you specified in the anti-crawler rule. Use the correct public IP address. We recommend that you click Alibaba Network Diagnose Tool to obtain your public IP address.

What to do next

Go to the Bot Management tab of the Security Reports page and view the protection results and the details of the requests that match the anti-crawler rules. Then, optimize the anti-crawler rules based on the protection results.