Headless Web Scraping in Python with Beautiful Soup 4

In this guide, we will be writing our own basic headless web scraping "bot" in Python with Beautiful Soup 4 on an Alibaba Cloud Elastic Compute Service (ECS) instance with CentOS 7.

By Mark Andrews, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud's incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.

Today we will be writing our own basic headless web scraping "bot" in Python with Beautiful Soup. Headless generally means web browsing with no GUI (Graphical User Interface). In this lesson, we will be doing everything through the terminal command line.

We will deploy an Alibaba Cloud Elastic Compute Service (ECS) burstable type t5 nano instance running CentOS 7. We will be utilizing the Requests and Beautiful Soup 4 modules.

Setting Up the CentOS Instance on Alibaba Cloud

You should be familiar with launching an Alibaba Cloud instance running CentOS for this tutorial. If you are not sure how to set up an ECS instance, check out this tutorial. If you have already purchased one, check out this tutorial to configure your server accordingly.

I have deployed a CentOS instance for this lesson as it is a super light-weight OS. For this project the less bloat the better. Basic terminal command line knowledge is recommended as we are not going to be using a GUI (Graphical User Interface) on this project.

Installing Python 3, PIP3 and Nano from the Terminal Command Line

It's always a good idea to update everything on a particular instance. First, let's update all the packages to the latest versions so we won't run into any issues down the road.

sudo yum update

We will be using Python for our basic web scraping "bot". I admire the language for its relative simplicity and of course the wide variety of modules available to play around with. In particular, we will be using the Requests and Beautiful Soup 4 modules.

Usually Python3 is installed by default, but if it's not, install Python 3 and Pip. First we are going to install IUS, which stands for Inline with Upstream Stable. A community project, IUS provides Red Hat Package Manager (RPM) packages for some newer versions of select software. Then move forward with installing python36u and pip.

sudo yum install https://centos7.iuscommunity.org/ius-release.rpm
sudo yum install python36u
sudo yum install python36u-pip

Pip is a package management system used to install and manage software packages, such as those found in the Python Package Index. What is Pip? Pip is a replacement for easy_install.

I've ran into some headaches in the past with installing Pip rather than python36u-pip so be aware that installing pip is for Python 2.7 and python36u-pip is for Python 3.

Nano is a basic text editor that is useful in applications such as this. Let's install Nano.

sudo yum install nano

Install our Python Packages through Pip

Now we need to install our Python packages we will be using today, Requests and Beautiful Soup 4.

We will install these through PIP.

pip36u install requests
pip36u install beautifulsoup4

Requests is a Python module that allows us to navigate to a web page with the Requests .get method.

Requests allows you to send HTTP/1.1 requests, all programatically through the Python script. There's no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic. We will be focusing on the Requests .get method today to grab a web page source.

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

We will be using Beautiful Soup 4 with the standard html.parser in Python to parse and organize the data from the web page source we will be getting with Requests. In this tutorial we will use the Beautiful Soup "prettify" method to organize our data in a more human readable way.

Lets make a folder called "Python_apps". Then, we will change our present working directory to Python_apps.

mkdir Python_apps
cd Python_apps

Writing Our Headless Scraping Bot in Python

Now comes the fun part! We can write our Python Headless Scraper Bot. We will be using Requests to go to a URL and grab the page source. Then we will use Beautiful Soup 4 to parse the HTML source into a semi readable format. After doing this we will save the parsed data to a local file on the instance. Let's get to work.

We will be using Requests to grab the page source and BeautifulSoup4 to format the data to a readable state. We will then use the Python methods of open() and write() to save the page data to our local hard drive. Let's go.

Open up Nano or a text editor of your choice in terminal and make a new file named "bot.py. I find Nano to be perfectly adequate for basic text editing functions.

First add our imports.

############################################################ IMPORTS
import requests
from bs4 import BeautifulSoup

The code below defines several global variables.

The user URL input.
Requests.get method that retrieves the URL input.
Requests.text method to save the text data in a variable.

####### REQUESTS TO GET PAGE : BS4 TO PARSE DATA
#GLOBAL VARS
####### URL FOR SITE TO SCRAPE
url = input("WHAT URL WOULD YOU LIKE TO SCRAPE? ")
####### REQUEST GET METHOD for URL
r = requests.get("http://" + url)
####### DATA FROM REQUESTS.GET
data = r.text

Now let's turn that global var "data" into a BS4 object so we can format it with the BS4 prettify method.

####### MAKE DATA VAR BS4 OBJECT
source = BeautifulSoup(data, "html.parser")
####### USE BS4 PRETTIFY METHOD ON SOURCE VAR NEW VAR PRETTY_SOURCE
pretty_source = source.prettify()

Let's print these variables out in the terminal as well as the local file. This will show us what data is going to be written to the local file before we actually write to it.

print(source)

You will get the source in a big chunk of text first. This is very hard for a human to decipher so we are going to turn to Beautiful Soup for some help in formatting. Let's call the Prettify method to organize the data somewhat better. This will make human readability much better. Then we print the source after the BS4 prettify() method.

print(pretty_source)

After running the code, you should get a prettified format of the HTML source of the imputed page in the terminal at this point.

Now let's save that file to our local hard drive on the Alibaba Cloud ECS instance. For this we need to first open the file in write mode.

To do this we pass the string "w" as the second argument in the open() method.

####### OPEN SOURCE IN WRITE MODE WITH "W" TO VAR LOCAL_FILE
####### MAKE A NEW FILE
local_file = open(url.strip("https://" + "http://") + "_scrapped.txt" , "w")
####### WRITE THE VAR PRETTY_SOUP TO FILE
local_file.write(pretty_source)
### GET RID OF ENCODING ISSUES ##########################################
#local_file.write(pretty_source.encode('utf-8'))
####### CLOSE FILE
local_file.close()

In the above block of code we made a variable that creates and opens a file named after our URL we input earlier with "_scrapped.txt" concatenated on. The first argument for the open method is the file name on local disc. We are stripping the "HTTPS://" and the "HTTP://" from the file name. If we don't strip that the file name is invalid. The second argument is the permission in this case write.

We then write to the variable "local_file" with the .write method passing the "pretty_source" variable as an argument. If we need to encode the text, in this case, in UTF-8 to print to the local file properly, use the commented out line. Then we close the local text file.

Let's run the code and see what happens.

python3.6  bot.py

You will be asked to enter an URL to scrape. Let's try https://www.wikipedia.org. Let the bot work its magic for a minute. We should now have the decently formatted source code from a particular website saved in our local working directory as a .txt file.

The final code for this project should like like this.

print("*" * 30 )
print("""
# 
# SCRIPT TO SCRAPE AND PARSE DATA FROM
# A USER INPUTTED URL. THEN SAVE THE PARSED
# DATA TO THE LOCAL HARD DRIVE.
""")
print("*" * 30 )

############################################################ IMPORTS
import requests
from bs4 import BeautifulSoup

####### REQUESTS TO GET PAGE : BS4 TO PARSE DATA
#GLOBAL VARS
####### URL FOR SITE TO SCRAPE
url = input("ENTER URL TO SCRAPE")

####### REQUEST GET METHOD for URL
r = requests.get(url)

####### DATA FROM REQUESTS.GET
data = r.text

####### MAKE DATA VAR BS4 OBJECT 
source = BeautifulSoup(data, "html.parser")


####### USE BS4 PRETTIFY METHOD ON SOURCE VAR NEW VAR PRETTY_SOURCE
pretty_source = source.prettify()

print(source)

print(pretty_source)

####### OPEN SOURCE IN WRITE MODE WITH "W" TO VAR LOCAL_FILE
####### MAKE A NEW FILE
local_file = open(url.strip("https://" + "http://") + "_scrapped.txt" , "w")
####### WRITE THE VAR PRETTY_SOUP TO FILE
local_file.write(pretty_source)
#local_file.write(pretty_source.decode('utf-8','ignore'))
#local_file.write(pretty_source.encode('utf-8')
####### CLOSE FILE
local_file.close()

Summary

We have learned how to build a basic headless web scraping "bot" in Python with Beautiful Soup 4 on an Alibaba Cloud Elastic Compute Service (ECS) instance with CentOS 7. We used Requests to get a particular web page source code and then parsed the data with Beautiful soup 4. Finally, we save a local text file on the instance with the scraped web page source code. With the Beautiful soup 4 module we can format the text for better human readability.

Community

Headless Web Scraping in Python with Beautiful Soup 4

Setting Up the CentOS Instance on Alibaba Cloud

Installing Python 3, PIP3 and Nano from the Terminal Command Line

Install our Python Packages through Pip

Writing Our Headless Scraping Bot in Python

Summary

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Quick BI

EMAS Superapp