Introduction
Due to the large amount of data (mostly unstructured) publicly available on the internet, web scraping is an essential tool for any developer working with data extraction.
In this tutorial we briefly introduce the concept of web scraping and demonstrate how it can be applied with Python and the Beautiful Soup library.
By the end of this tutorial, you will have a basic understanding of web scraping and how to apply it with Python and Beautiful Soup.
What is Web Scraping
Web Scraping is the process/technique to extract data from websites. This technique is widely used in analysis and data mining. Generally, web craping tools provide functionality to automate data collection, allowing easy analysis and navigation of website components.
Prerequisites
- Basic command line knowledge of your operating system
- Basic knowledge of Python
- Familiarity structure of an HTML document
Installing Dependencies
Before starting, we need to install the dependencies:
Using pip
PIP is Python’s default package manager. To install dependencies using pip use the commands below:
|
|
Using the package manager (Ubuntu and ubuntu based distributions)
Although installing packages on the system as a whole is possible, prefer to install in virtual environments (using virtualenv, for example) to isolate dependencies for each project.
|
|
1. Download target page
To download content from the web page, we can use the requests library as shown below.
|
|
Parse page content with beautiful soup
With the HTML content of the page saved in a variable, we can instantiate Beuatiful Soup and start analyzing its data.
|
|
Extract data
Beautiful Soup allows simple navigation between html elements. For example, here we extract all elements with the <a>
tag and display their HREF link.
|
|
Save the extracted data
With the extracted data, it is usually saved in CSV or in some database. In the example below we save the extracted links in a CSV file using the csv library built into python.
|
|
Conclusion
Here we briefly introduce the concept of web scraping and demonstrate how it can be done in Python with the Beautiful Soup library. With this knowledge you can start digging deeper and better exploring the library to prepare yourself to explore and analyze the enormous amount of data available on the internet.
Credits
Post thumbnail: 3d PNGs by Vecteezy