Get a free agent with xpath and Beautiful Soup

Recently I saw an article: "About reptiles, here is a" China Anxiety Atlas ", which is very thorough about reptile analysis. I suggest you take a look. If you want to watch, you can find it by searching the topic yourself.

If you want to ask me where to search, I will only let you try to find the answer by yourself. Because in the process of your study, you will encounter all kinds of strange questions. Do you have to ask others? There are teachers in the school to answer your doubts. When we go out of the school, we should try to solve problems independently. I often go to Stack Overflow, where people ask questions through their own independent thinking. They see the questioner's understanding of the problem, what methods have been tried to solve the problem, and finally they ask questions without solving the problem. I hope we can do the same.

Among other things, we know that the most important step in the crawler is not the last two steps, that is, parsing data and saving data. The most important thing is to request data.

When crawling small-scale web pages, the data scale is not large, and there is no anti crawling technology on the web pages, our crawlers are still very interesting. Once the web pages we want to crawl are strong in anti crawling, and the amount of data we obtain is very large, we can easily be blocked from the IP address. Once the IP address is blocked, the crawler cannot continue.

Conventional forgery of request headers and construction of parameters can no longer solve our problem.

In the world of crawlers, we are always fighting against the anti crawling technology of web pages. If we block our IP addresses, we can also forge them. Today, we will take you to get some free IP addresses.

What is requested here is a domestic free IP proxy website: http://www.xicidaili.com/nn/

Just ask for one page. Today's code helps you randomly generate an IP address provided on the website.

Request Web Page

classSpider():

'' Obtain an IP address at random ''

def__ get_ page(self):

'' Get data, add request header ''

url =' http://www.xicidaili.com/nn/ '

headers = {

'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36'

' (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

}
r = requests.get(url,headers=headers)

ifr. status_ code ==200:

Print ('crawl succeeded ')

returnr.text

Since this IP proxy website does not have any anti crawling technology, it is basically OK to take a request header.

Parse Web Page

There are two ways to parse web pages

def__ parse_ page(self,html):

'' 'Here two methods are used to obtain the resolution IP and return an IP list' '

'''soup = BeautifulSoup(html,'lxml')

ips = soup.find_ All ('tr ') # Find the node tr first

ip_ list = []

for i in range(1,len(ips)):

ip_ info = ips[i]

tds = ip_ info.find_ all('td')

ip_ list.append(tds[1].text + ':' +tds[2].text)

return ip_ list'''

data = etree.HTML(html)

items = data.xpath('//tr[@class="odd"]')

ip_ list = []

foriteminitems:

ips = item.xpath('./td[2]/text()')[0]

tds = item.xpath('./td[3]/text()')[0]

ip_ list. append(ips +':'+ tds)

returnip_ list

By comparison, I prefer to use xpath. For one thing, it is simple, and the performance is good. Compared with Beautiful Soup, the performance is not so good.

Processing data

Next is a random function to generate a random IP address:

def__ get_ random_ ip(self,ip_list):

proxy_ list = []

foripinip_ list:

#Traverse the IP list and add http://

proxy_ list. append('http://'+ ip)

proxy_ Ip=random. choice (proxy_list) # Return an ip randomly

Proxies={'http ': proxy_ip} # is constructed into a dictionary.

returnproxies

defrun(self):

html =self.__ get_ page()

ip_ list =self.__ parse_ page(html)

proxy =self.__ get_ random_ ip(ip_list)

print(proxy)

if__ name__=='__ main__':

spider = Spider()

spider.run()

After the program runs, the console outputs:

{'http':' http://110.73.8.27:8123 '}

After the program runs, it randomly generates an IP address, which you can use. Of course, you can also import into the program that you need to disguise the IP address.

summary

After reading the article mentioned at the beginning, combined with the news I have heard recently, I am thinking about this question: "Do reptiles break the law?"

Crawlers don't break the law. It depends on what data you capture. Just like Baidu, Google's search engine crawls data on various websites every day to show you. Certainly not against the law.

A big reptile said:

You can climb if you can find it publicly. That's OK. It is dangerous to log in with an account with special permissions. In addition, it is recommended to use proxy IP, one anti crawling IP and one hidden real IP.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us