As a data scientist or data analyst, sooner or later you’ll come to a point where you have to collect large amounts of data. Be it a hobby project or a freelance job, when APIs are just not available, one of your best options is web scraping… And one of the best web scraping tools is Beautiful Soup!
To put it simply, web scraping is the automated collection of data from websites (to be more precise, from the HTML content of websites).
Web scraping simply concerns with Extracting data from website. As a programmer in many cases, you will need to extract data from websites therefore Web scraping is a skill you need to have. In this tutorial, you’re going to learn how to perform web scraping in Python using requests and BeautifulSoup libraries. Python is used for a number of things, from data analysis to server programming. And one exciting use-case of Python is Web Scraping. In this article, we will cover how to use Python for web scraping. We'll also work through a complete hands-on classroom guide as we proceed. Manually Opening a Socket and Sending the HTTP Request.
In this article you’ll learn the basics of how to pull data out of HTML. You’ll do that by extracting data from Book Depository’s bestsellers page. To accomplish this, you’ll also have to make use of a little bit of your pandas and Python knowledge.
But enough talking now, let’s walk the walk! 🙂
Beautiful Soup is one of the most commonly used Python libraries for web scraping. You can install it in the usual way from the command line:
Note: if you don’t have a server for your data projects yet, please go through this tutorial first: How to Install Python, SQL, R and Bash (for non-devs)
To get the full Beautiful Soup experience, you’ll also need to install a parser. It is often recommended to use lxml for speed, but if you have other preferences (like Python’s built-in html.parser), then feel free to go with that.
Throughout this article, we’ll use lxml
, so let’s install it (also from the command line):
Nice! One more thing is needed for us to start scraping the web, and it’s the Requests library. With Requests
– wait for it – we can request web pages from websites. Let’s install this library, too:
Now, our setup for web scraping is complete, so let’s scrape our first page, shall we?
In this article, we’ll work with Book Depository’s first page of its bestsellers list. Keep in mind that this list is updated daily, so it is highly likely that you’ll see different books from these:
But don’t lose sleep over it: you’ll fully be able to follow and complete this tutorial regardless of what books you scrape. 🙂
Disclaimer: this article is for educational purposes only and we use Book Depository as an example — because we love their website and their service. Although web scraping isn’t illegal, if you want to do it at scale or to profit from it in any way (e.g. building a startup around a web scraping solution), we recommend speaking to a qualified legal advisor.
Disclaimer 2: We are not affiliated with Book Depository in any way.
So without further ado, let’s fire up a Jupyter Notebook and import the libraries we’ve just installed (except for lxml
, it doesn’t have to be imported):
Now, we’re ready to request our first web page. It’s nothing complicated – we save the URL we want to scrape to the variable url
, then request the URL (requests.get(url)
), and save the response
to the response
variable:
By printing response
, you can see that the HTTP response status code is 200, which means that the request for the URL was successful:
But we need the HTML content of the requested web page, so as the next step we save the content of response
to html
:
Again, print out what we’ve got with print(html)
:
The result is the HTML content of the bestsellers’ page, but it is really hard to read with the human eye… :/
Lucky for us, we’ve got Beautiful Soup and lxml
! 🙂
Let’s create a Beautiful Soup object named soup
with the following line of code:
Remember, we imported Beautiful Soup as bs
, this is the bs()
part of the code. The first parameter of the bs()
method is html
(which was the variable where we saved that hard-to-read HTML content from the fetched bestsellers URL), the second parameter (“lxml”
) is the parser that is used on the html
variable.
The result is soup
(a parsed HTML document) which is much more pleasing to the eye:
Not only is it satisfying to look at, but soup
also gives us a nested data structure of the original HTML content that we can easily navigate and collect data from.
How?
Continue reading to find out.
First, let’s go over some HTML basics (bear with me, we’ll need this to navigate our soup
object).
HTML consists of elements like links, paragraphs, headings, blocks, etc. These elements are wrapped between tags; inside the opening and the closing tag can be found the content of the element.
For instance, a sentence on a web page (in a paragraph (<p>
) element) looks like this (only the content is visible for us humans):
HTML elements may also have attributes that contain additional information about the element. Attributes are defined in the opening tags with the following syntax: attribute name=”attribute value”.
An example:
Now that we’ve learned to speak some basic HTML, we can finally start to extract data from soup
. Just type a tag name after soup
and a dot (like soup.title
), and watch magic unfold:
Let’s try another one (soup.h1
):
If you don’t need the full element, just the text, you can do that, too with .get_text()
:
What if you need only an element’s attribute? No problem:
Or another way to do the same task:
Whether you noticed or not, the soup.any_tag_name
syntax returns only the first element with that tag name. Instead of soup.any_tag_name
, you can also use the .find()
method, and you’ll get the exact same result:
Oftentimes you need not just one, but all elements (for example every link on a page). That’s what the .find_all()
method is good for:
(Actually, .find_all()
is so popular that there’s a shortcut for it: soup(“tag_name”), like soup(“a”)
in our case.)
What you have to know is that while .find()
returns only one element, .find_all()
returns a list of elements, which means you can iterate over it:
(In this example I intentionally printed only the first 5 links.)
After all this you’re equipped with enough knowledge to get some more serious tasks done with Beautiful Soup. So let’s do just that, and continue to the next section. 🙂
Within our soup
object we already have the parsed HTML content of Book Depository’s bestsellers page. The page contains 30 books with information related to them. Of the available data we’ll extract the following:
While working with BeautifulSoup, the general flow of extracting data will be a two-step approach: 1) inspecting in the browser the HTML element(s) we want to extract, 2) then finding the HTML element(s) with BeautifulSoup.
Let’s put this approach into practice.
This one will be really easy – right click on one of the books’ titles, and choose Inspect Element in Firefox (or Inspect in Chrome):
As you can see, the book title is a text in an a
element within an h3
element with the class=”title”
attribute. We can translate this into Beautiful Soup “language”:
And this is what you get after running the above code:
soup.find_all(“h3”)
finds every h3
element on the web page; with class_=”title”
we specify that we specifically search for h3
tags that contain the class_=”title”
attribute (important note: the “_”
in class_=”title”
is not a typo, it is required in Beautiful Soup when selecting class
attributes).
We save the h3
elements to all_h3
, which behaves like a list, so we can loop over it with a for loop
. At each iteration we pull out only the text from the h3
element with .get_text()
, and with the strip=True
parameter we make sure to remove any unnecessary whitespace.
Note: If you print out the book titles, in some cases you’ll see titles that are not present on the bestsellers page. Probably it’s a location based personalization thing.Anyway, don’t worry about it, it’s a normal thing. You won’t always encounter this phenomenon, but I thought it was good for you to know that it could happen from time to time. 🙂
From the previous step we have all book titles from the bestsellers page. But what do we know about their formats – are there more paperback or hardback bestseller books?
Let’s find out by inspecting the book format element:
The format data is a paragraph element with class=”format”
, and it’s inside a div
element with the class=”item-info”
attribute. It’s Beautiful Soup time again:
It’ll look like this:
Nice – there are 26 paperbacks and 4 hardbacks on the page. But what is that .select()
method and where did .find_all()
disappear to?!
.select()
is similar to .find_all()
, but with it you can find HTML elements using CSS selectors (a handy thing to learn as well).
Basically div.item-info p.format
means the following: find all paragraph elements with the class
of format
that are inside of a div
element with the class
of item-info
.
(.select()
is a useful little thing to add to your Beautiful Soup arsenal. 😉 )
The rest of the code uses pandas
to count the occurrences of each book format. If you need a refresher on pandas
, you should definitely click here (after finishing this article, of course).
The same approach applies to extracting the publication dates as well. First, we inspect the element:
Then we write the Beautiful Soup code to collect the data:
Here’s what you’ll get:
As you can see, it is just a basic .find_all()
, nothing new you don’t already know. 😉
But perhaps the rest of the code needs some explaining, so here it goes:
.find_all()
creates a ResultSet
object, we can’t immediately get the text from the elements in (the first) dates
variable,.get_text()
we create a list with only the content of the paragraph elements (like 18 Feb 2021),.get_text()[-4:]
),pandas
series and count the values (.value_counts()
).This one will be a bit trickier, because a book has either only one (selling) price or two prices at the same time (an original and a discounted/selling price):
In either case, we’ll collect only the selling prices, but before we do that, please set the currency to euro at the top right side of the page (the page will automatically refresh itself):
After a quick check, we’ll find the HTML element for the original price:
And the selling/discounted price, too:
To get the selling prices, we’ll need to run the following lines of code:
If you run the above code and print out final_prices
, you’ll see the discounted/selling prices of the books:
And here’s what the code does line by line:
final_prices = []
) to hold the selling prices.prices = soup.find_all('p', class_='price')
prices
: for price in prices
original_price = price.find('span', class_='rrp')
); it’ll look like this: <span class=”rrp”>19,10 €</span>if original_price:
) in the HTML element, then:current_price
(which looks like this: ‘16,88 €’): current_price = str(original_price.previousSibling).strip()
current_price = float(current_price.split('€')[0].replace(',', '.'))
else:
) we replace “,” with “.”, and remove the ” €” part, and convert the result to a float type (it’ll look like this 10.27): current_price = float(price.get_text(strip=True).split('€')[0].replace(',', '.'))
final_prices
list: final_prices.append(current_price)
Of course, if a book is sold out, there’ll be no HTML element that holds the price of that book, so our code won’t include it in the final_prices
list. Here’s what such a book would look like:
Based on final_prices
, we can easily create a basic histogram out of our data with the following lines of code:
Note: For this code to work, don’t forget to import numpy and matplotlib… And if you work in Jupyter Notebook, don’t forget to add this line as well.
And here’s the result:
A quick glance at the histogram, and it is easy to see that the most frequent price of bestsellers is somewhere between 11-12.5 €. Cool, isn’t it? 🙂
Give yourself a big pat on the back, because you’ve just scraped your first web page. 🙂 I hope you had some fun reading this article, but what I hope even more is that you gained skills that you’ll proudly make use of later.
Now, have some well-deserved rest, and let your freshly acquired knowledge sink in.
And when you’re ready – and when the next article in this tutorial series will be published – do come back, because we’ll step up your web scraping game even more. No spoilers, but yeah… we will scrape multiple web pages, we will do deeper analyses — it’s gonna be a lot of fun! 😎
Cheers,
Tamas Ujhelyi
Sometimes we need to extract information from websites. We can extract data from websites by using there available API’s. But there are websites where API’s are not available.
Here, Web scraping comes into play!
Python is widely being used in web scraping, for the ease it provides in writing the core logic. Whether you are a data scientist, developer, engineer or someone who works with large amounts of data, web scraping with Python is of great help.
Without a direct way to download the data, you are left with web scraping in Python as it can extract massive quantities of data without any hassle and within a short period of time.
In this tutorial , we shall be looking into scraping using some very powerful Python based libraries like BeautifulSoup and Selenium.
BeautifulSoup is a Python library for pulling data out of HTML and XML files. But it does not get data directly from a webpage. So here we will use urllib library to extract webpage.
First we need to install Python web scraping BeautifulSoup4 plugin in our system using following command :
$ sudo pip install BeatifulSoup4
$ pip install lxml
OR
$ sudo apt-get install python3-bs4
$ sudo apt-get install python-lxml
So here I am going to extract homepage from a website https://www.botreetechnologies.com
from urllib.request import urlopen
from bs4 import BeautifulSoup
We import our package that we are going to use in our program. Now we will extract our webpage using following.
response = urlopen('https://www.botreetechnologies.com/case-studies')
Beautiful Soup does not get data directly from content we just extract. So we need to parse it in html/XML data.
data = BeautifulSoup(response.read(),'lxml')
Here we parsed our webpage html content into XML using lxml parser.
As you can see in our web page there are many case studies available. I just want to read all the case studies available here.
There is a title of case studies at the top and then some details related to that case. I want to extract all that information.
We can extract an element based on tag , class, id , Xpath etc.
You can get class of an element by simply right click on that element and select inspect element.
case_studies = data.find('div', { 'class' : 'content-section' })
In case of multiple elements of this class in our page, it will return only first. So if you want to get all the elements having this class use findAll()
method.
case_studies = data.find('div', { 'class' : 'content-section' })
Now we have div having class ‘content-section’ containing its child elements. We will get all <h2> tags to get our ‘TITLE’ and <ul> tag to get all children, the <li>
elements.
case_stud.find('h2').find('a').text
case_stud_details = case_stud.find(‘ul’).findAll(‘li’)
Now we got the list of all children of ul
element.
To get first element from the children list simply write:
case_stud_details[0]
We can extract all attribute of a element . i.e we can get text for this element by using:
case_stud_details[2].text
But here I want to click on the ‘TITLE’ of any case study and open details page to get all information.
Since we want to interact with the website to get the dynamic content, we need to imitate the normal user interaction. Such behaviour cannot be achieved using BeautifulSoup or urllib, hence we need a webdriver to do this.
Webdriver basically creates a new browser window which we can control pragmatically. It also let us capture the user events like click and scroll.
Selenium is one such webdriver.
Selenium webdriver accepts cthe ommand and sends them to ba rowser and retrieves results.
You can install selenium in your system using fthe ollowing simple command:
$ sudo pip install selenium
In order to use we need to import selenium in our Python script.
from selenium import webdriver
I am using Firefox webdriver in this tutorial. Now we are ready to extract our webpage and we can do this by using fthe ollowing:
self.url = 'https://www.botreetechnologies.com/'
self.browser = webdriver.Firefox()
Now we need to click on ‘CASE-STUDIES’ to open that page.
We can click on a selenium element by using following piece of code:
self.browser.find_element_by_xpath('//div[contains(@id,'navbar')]/ul[2]/li[1]').click()
Now we are transferred to case-studies page and here all the case studies are listed with some information.
Here, I want to click on each case study and open details page to extract all available information.
So, I created a list of links for all case studies and load them one after the other.
To load previous page you can use following piece of code:
self.browser.execute_script('window.history.go(-1)')
Final script for using Selenium will looks as under:
And we are done, Now you can extract static webpages or interact with webpages using the above script.
Today, more than ever, companies are working with huge amounts of data. Learning how to scrape data in Python web scraping projects will take you a long way. In this tutorial, you learn Python web scraping with beautiful soup.
Along with that, Python web scraping with selenium is also a useful skill. Companies need data engineers who can extract data and deliver it to them for gathering useful insights. You have a high chance of success in data extraction if you are working on Python web scraping projects.
If you want to hire Python developers for web scraping, then contact BoTree Technologies. We have a team of engineers who are experts in web scraping. Give us a call today.