Build an anonymous proxy pool with Python

01 written in front

I often hear many people complain that their IP is blocked by the website due to too many crawlers, and they have to frequently use various proxy IPs, but because most of the public proxies on the Internet cannot be used, and it takes money and energy to Applying for a VIP agency, I was blocked after several twists and turns. Hereby write an article on how to use Python to build a proxy pool to reduce time and energy costs and realize the function of automatically obtaining active proxy IP.

02 Operating principle

1. Website proxy acquisition

1. Climb the IP list of the free proxy website to test whether it is available and whether it is a high-profile

2. If they are all, put them into the database, otherwise discard them.

3. Repeat step 2

2. Ensure that invalid agents can be picked out from the agent pool as soon as possible

1. Get the IP from the crawler database

2. Test IP availability and anonymity

3. If available and anonymous, keep, otherwise discard.

4. Repeat step 1

Explanation ①: A crawler program daemon (Daemon) can be established. Those who need it can Google it by themselves, so I won’t introduce it here.

Explanation ②: An external proxy information interface can be established. It doesn’t matter whether you use NodeJS or Flask/Django or PHP to write it, so I won’t introduce it here.

03 realization

Suggested libraries: requests, BeautifulSoup, re, sqlite3.

Among them, the requests library is used to obtain the proxy website page, the two libraries of BeautifulSoup and re are used to obtain the proxy information, and sqlite3 is used to access the information.

If necessary (such as when the proxy website has an anti-crawler strategy), PhantomJS can be used instead of requests, or the corresponding library can be used for data cleaning (such as base64 decoding).

The following briefly shows the code of each part:

04 reflection

I used Python to practice handwriting this project at the beginning of the year. Looking back at the current level, the logic is not rigorous enough, and various functions are too coupled. Many paragraphs need to be rewritten, because the code is running on the campus network, so it needs Considering the stability of the network connection, this creates a confusing relationship between parts of the code.

The method of detecting proxy anonymity through may be effective, but it ignores the X-Forwarded-For HTTP header, so there is a great risk and must be improved.

Verifying the effectiveness of agents in the agent pool requires multi-threading, and the current solution is too inefficient.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us