C # crawler used proxy to brush csdn article views

1. How to maintain the proxy IP library?

If you want to use the proxy IP, you must have a certain number of valid proxy IP libraries. In the learning stage, you can only grab it from the free proxy IP website if you play casually. Without a certain number of proxy swipes, the number of roaming articles is very slow. First, you must maintain your own proxy IP library

The Xici proxy and 66ip used before are quite reliable. Xici seems to have anti pickpocketing. I have encountered it once. I don't know whether it is the problem of Xici website or the anti pickpocketing strategy. These two websites capture about 2, 3 usable proxies every minute, which is more objective. Data5u, fast proxy and ip3366 are rarely updated, and their effectiveness is relatively low. Useragent must be set for fast proxy to capture pages, It is found that the IP port obtained after setting is not consistent with the web port. I wonder if it can not be free. Otherwise, people will charge. Of course, the paid agent is not stable, but it is certainly much better than free.



Maintain agency quality

The proxy captured from the web page must be verified before being put into storage. The simplest way is to send a request to see if the status code is 200. The free agents I recommend are the above two Xici agents and 66ip, which are more effective and more numerous than other free agents.

How agents are stored

I use Redis to store these effective agents. Set is the best data structure, and the same IP is not allowed to be stored. The validity of the proxy is unknown. Some may last for tens of seconds, and some may last for tens of minutes. When using the proxy, you should record the IP addresses that cannot be used for many times. When a certain number of times are reached, they should be deleted from the Set. The time limit of the proxy cannot be determined. To use the proxy IP in a timely manner, you can use a timer to retrieve the proxy from Redis.



2. Some common anti crawler mechanisms?

The principle of anti crawler is to judge whether a real user is. Some important data will be mixed using multiple mechanisms, making the cost of the crawler larger or even impossible to crawl. The field settings, IP restrictions, cookies, etc. in the header.

IP Restrictions

In order to prevent crawlers, some websites may limit the access frequency of each IP. The access frequency is speed, which can be the same as that of Thread Sleep to sleep, pause for a while and then crawl; An IP number can be set through the free proxy captured.

Restrictions in Header

User Agent: user agent. This is very simple. You can collect some common browser proxy headers and randomly set User Agent when requesting

Referer: The link from which the target link is accessed can be used to prevent image piracy. Of course, the Refresh can also be forged.

Cookies: After logging in or other user operations, the server will return some cookie information. Without a cookie, it is easy to identify it as a forged request. You can set cookies locally through Js according to some information returned by the server. Of course, this is not so simple in practice, and generally involves the encryption and decryption process. This is a difficulty for reptiles.

3. Use the proxy IP to refresh the browsing volume of csdn articles

The browsing volume of csdn articles is relatively easy. The premise is that you have enough agents. Without more agents, the efficiency will be very slow. In the previous article, we have grabbed the proxy from several free proxy websites, so we won't introduce more here. Here we will use the proxy from the previous article. C # Grab free agents in batches and verify their effectiveness. 1. I use multithreading to send requests in batches, which is more efficient. Each thread allocates a certain number of agents to execute requests on average. 2. Get Redis agents regularly 3. Use System. Collections The ConcurrentDictionary dictionary collection under the Concurrent namespace counts the number of failures. If a certain number of failures are reached, the agent is directly deleted from the library. The main function in the code is implementation. The disadvantage is that there are too few agents and the efficiency is not high.

I read an article last night. The story was very powerful. I was wary of the garbage projects that were cheating everywhere with the open source signboard, such as iBase4J. So I found this article in csdn by the original blogger to expose Nanchong Bashu Culture, a rogue company in Beijing that did not pay salaries. The boss was Wanming. It took a long time to brush, mainly because there were too few agents.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us