How to build a crawler agent service?

Cause

Anyone who has ever been a crawler should know that there are too many websites and data to be captured. If the crawler crawls too fast, it will inevitably trigger the anti crawling mechanism of the website. Almost the same move is to block the IP address. There are 2 solutions:

1. Slow down for the same IP (slow crawling speed)

2. Access using proxy IP (recommended)

The first scheme sacrifices time and speed in exchange for data. However, in general, our time is precious. Ideally, we can obtain the most data in the shortest time. So the second scheme is recommended. Where can I find so many proxy IPs?

640? wx_ fmt=jpeg&tp=webp&wxfrom=5&wx_ lazy

Find Agents

When the app ape doesn't understand it, search for it. Google, Durian, enter the keyword: free proxy IP. The first few pages are almost all websites that provide proxy IP. After opening them one by one, we find that they are almost a list page, displaying dozens or hundreds of IP addresses.

However, if you observe carefully, you will find that the free IP addresses provided by each website are limited. If you use a few, you will find that some of them have also failed. Of course, they are more inclined to buy others' agents, which is how they make money.



As a cunning program ape, of course, you can't kneel down because of this difficulty. Think carefully, since search engines can find so many websites providing agents, each website provides dozens or hundreds. If there are 10 websites, there are hundreds to thousands of IPs together.

Well, what you need to do is to record these websites and capture the IP address with a program. Is it easy?


Test Agent



You should be able to obtain hundreds or thousands of proxy IPs through the way just described.

Wait a minute. With so many IP addresses, does someone really give them to you for free? Of course not. As mentioned earlier, a large part of these agents are already invalid. So what? How do you know which agents are valid and which are unavailable?

Simply hang up these agents, visit a stable website, and then check whether it can be accessed normally. The one that can be accessed normally is available, and the one that cannot be accessed is invalid.

The fastest way is to use the curl command to test whether an agent is available:

#Use the proxy 48.139.133.93:3128 to access the homepage of Netease

curl -x "48.139.133.93:3128"

" http://www.163.com "

Of course, this method is only for the convenience of demonstration. The best way is:

Use a proxy to visit a website in a multi-threaded way, and then output the available proxy.

This is the fastest way to find available agents.

Use Proxy

Now we can find the available agents through the above method. If they are applied to the program, I don't need to say more. Most of them should be used.

For example, just now, you can input the available agents into a file, and each line is an agent

1. Read agent file

2. Randomly select proxy IP and initiate HTTP request

In this way, if there are hundreds of agents, they can basically capture the data of a website for a period of time. It is not a problem to capture tens of thousands of pieces of data.

However, if I want to continuously obtain data from a website, or capture millions or even billions of web page data, that is definitely not possible.

Continuous supply agent

Just now, the method is to grab some proxy websites at one time, and then test whether each proxy is available through the program to get a list of available proxies. However, this is only a one-time transaction, and the proxy volume is often small, which certainly cannot meet the needs in continuous fetching. So how can we continuously find available agents?

1. Find more proxy websites (data base)

2. Regularly monitor these proxy websites to obtain proxies

3. After getting the agent IP, the program automatically detects and outputs the available agents (files or databases)

4. The program loads files or databases, and randomly selects proxy IP addresses to initiate HTTP requests

According to the above method, you can write a program to automatically collect agents, and then the crawler can regularly go to the file/database to obtain and use it. But there is a small problem. How do you know the quality of each agent? In other words, what is the speed of the agent?

1. Record request response time when detecting agents

2. The response time ranges from short to long, and the weighted value is heavier. The utilization rate of short response is higher

3. Limit the maximum number of uses in a certain period of time

The previous points are just the basis. These three points can further optimize your agent program, output a list of agents with priority, and the crawler side uses the agent according to the weight and the maximum number of uses. The advantage of this method is to ensure the use of high-quality agents and prevent the frequent use of an agent to prevent it from being blocked.

640? wx_ fmt=png&tp=webp&wxfrom=5&wx_ lazy=

Servitization

After a series of improvements and optimizations, an available proxy service has been built, which is only based on the file system or database.

If the crawler wants to use these agents, it can only read files or databases, and then select agents to use according to certain rules. This is cumbersome. Can you make it easier for the crawler to use agents? Then you need to make proxy access service-oriented.

Squid, a famous server software, uses its cache_ The peer neighbor agent mechanism can help you do this perfectly.

Set the proxy of the proxy list to the cache of squid_ The peer mechanism can be written in the configuration file according to a certain format.

Squid is a proxy server software. Generally, it is used like this. If the crawler is on machine A and squid is installed on machine B, the website server to be crawled is machine C, and the proxy IP is machine D/E/F

1. Do not use proxy: request for crawler machine A ->website machine C

2. Use agent: crawler machine A ->agent IP machine D/E/F/... ->website machine C

3. Use squid: crawler machine A ->squid (machine B, cache_peer mechanism to manage scheduling agent D/E/F) ->website machine C

The advantage of this is that the crawler side does not need to consider how to load and select available agents. It gives squid a list of agents. According to the rules of the configuration file, it can help you manage and schedule the selection of agents. The most important thing is that the crawler only needs to access the service port of squid to use the proxy!

Further integration

Now that the service has been built, the only difference is integration:

1. Regularly monitor the proxy source website (30 minutes/1 hour), parse out all proxy IPs, and enter them into the database

2. Get all the agents from the database, visit a fixed website, find out the agents that have successfully accessed, and update the database's available tags and response time

3. Load all available agents from the database, and calculate the use weight and maximum use times according to the response time through some algorithm

4. According to the cache of squid_ Peer format, writing configuration file

5. Reload the squid configuration file and refresh the proxy list under squid

6. The crawler specifies the service IP and port of squid for pure crawling

A complete proxy service can be built in this way, and high-quality proxies can be output regularly. The crawler side doesn't care about the collection and testing of the agent, just use the squid's unified service portal to crawl the data.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us