Python web scraping tutorial pdf
Getting started in web scraping is simple except when it isn’t which is why you are here. Python is one of the easiest ways to get started as it is an object-oriented language. Python’s classes and objects are significantly easier to use than in any other language. Additionally, many libraries exist that make building a tool for web scraping in Python an absolute breeze. In this web scraping Python tutorial, we will outline everything needed to get
started with a simple application. It will acquire text-based data from page sources, store it into a file and sort the output according to set parameters. Options for more advanced features when using Python for web scraping will be outlined at the very end with suggestions for implementation. By following the steps outlined below in this tutorial, you will be able to understand how to do web scraping.
What do we call web scraping?Web scraping is an automated process of gathering public data. A webpage scraper automatically extracts large amounts of public data from target websites in seconds. This Python web scraping tutorial will work for all operating systems. There will be slight differences when installing either Python or development environments but not in anything else. Building a web scraper: Python prepworkThroughout this entire web scraping tutorial, Python 3.4+ version will be used. Specifically, we used 3.8.3 but any 3.4+ version should work just fine. For Windows installations, when installing Python make sure to check “PATH installation”. PATH installation adds executables to the default Windows Command Prompt executable search. Windows will then recognize commands like “pip” or “python”
without requiring users to point it to the directory of the executable (e.g. C:/tools/python/…/python.exe). If you have already installed Python but did not mark the checkbox, just rerun the installation and select modify. On the second screen select “Add to environment variables”. Getting to the librariesWeb scraping with Python is easy due to the many useful libraries available One of the Python advantages is a large selection of libraries for web scraping. These web scraping libraries are part of thousands of Python projects in existence – on PyPI alone, there are over 300,000 projects today. Notably, there are several types of Python web scraping libraries from which you can choose:
Requests libraryWeb scraping starts with sending HTTP requests, such as POST or GET, to a website’s server, which returns a response containing the needed data. However, standard Python HTTP libraries are difficult to use and, for effectiveness, require bulky lines of code, further compounding an already problematic issue. Unlike other HTTP libraries, the Requests library simplifies the process of making such requests by reducing the lines of code, in effect making the code easier to understand and debug without impacting its effectiveness. The library can be installed from within the terminal using the pip command: Requests library provides easy methods for sending HTTP GET and POST requests. For example, the function to send an HTTP Get request is aptly named get():
If there is a need for a form to be posted, it can be done easily using the post() method. The form data can sent as a dictionary as follows:
Requests library also makes it very easy to use proxies that require authentication.
But this library has a limitation in that it does not parse the extracted HTML data, i.e., it cannot convert the data into a more readable format for analysis. Also, it cannot be used to scrape websites that are written using purely JavaScript. Beautiful SoupBeautiful Soup is a Python library that works with a parser to extract data from HTML and can turn even invalid markup into a parse tree. However, this library is only designed for parsing and cannot request data from web servers in the form of HTML documents/files. For this reason, it is mostly used alongside the Python Requests Library. Note that Beautiful Soup makes it easy to query and navigate the HTML, but still requires a parser. The following example demonstrates the use of the html.parser module, which is part of the Python Standard Library. #Part 1 – Get the HTML using Requests
#Part 2 – Find the element
This will print the title element as follows:
Due to its simple ways of navigating, searching and modifying the parse tree, Beautiful Soup is ideal even for beginners and usually saves developers hours of work. For example, to print all the blog titles from this page, the findAll() method can be used. On this page, all the blog titles are in h2 elements with class attribute set to blog-card__content-title. This information can be supplied to the findAll method as follows:
BeautifulSoup also makes it easy to work with CSS selectors. If a developer knows a CSS selector, there is no need to learn find() or find_all() methods. The following is the same example, but uses CSS selectors:
While broken-HTML parsing is one of the main features of this library, it also offers numerous functions, including the fact that it can detect page encoding further increasing the accuracy of the data extracted from the HTML file. What is more, it can be easily configured, with just a few lines of code, to extract any custom publicly available data or to identify specific data types. Our Beautiful Soup tutorial contains more on this and other configurations, as well as how this library works. lxmllxml is a parsing library. It is a fast, powerful, and easy-to-use library that works with both HTML and XML files. Additionally, lxml is ideal when extracting data from large datasets. However, unlike Beautiful Soup, this library is impacted by poorly designed HTML, making its parsing capabilities impeded. The lxml library can be installed from the terminal using the pip command: This library contains a module html to work with HTML. However, the lxml library needs the HTML string first. This HTML string can be retrieved using the Requests library as discussed in the previous section. Once the HTML is available, the tree can be built using the fromstring method as follows:
This tree object can now be queried using XPath. Continuing the example discussed in the previous section, to get the title of the blogs, the XPath would be as follows:
This XPath can be given to the tree.xpath() function. This will return all the elements matching this XPath. Notice the text() function in the XPath. This will extract the text within the h2 elements.
Suppose you are looking to learn how to use this library and integrate it into your web scraping efforts or even gain more knowledge on top of your existing expertise. In that case, our detailed lxml tutorial is an excellent place to start. SeleniumAs stated, some websites are written using JavaScript, a language that allows developers to populate fields and menus dynamically. This creates a problem for Python libraries that can only extract data from static web pages. In fact, as stated, the Requests library is not an option when it comes to JavaScript. This is where Selenium web scraping comes in and thrives. This Python web library is an open-source browser automation tool (web driver) that allows you to automate processes such as logging into a social media platform. Selenium is widely used for the execution of test cases or test scripts on web applications. Its strength during web scraping derives from its ability to initiate rendering web pages, just like any browser, by running JavaScript – standard web crawlers cannot run this programming language. Yet, it is now extensively used by developers. Selenium requires three components:
The selenium package can be installed from the terminal: After installation, the appropriate class for the browser can be imported. Once imported, the object of the class will have to be created. Note that this will require the path of the driver executable. Example for the Chrome browser as follows:
Now any page can be loaded in the browser using the get() method.
Selenium allows use of CSS selectors and XPath to extract elements. The following example prints all the blog titles using CSS selectors:
Basically, by running JavaScript, Selenium deals with any content being displayed dynamically and subsequently makes the webpage’s content available for parsing by built-in methods or even Beautiful Soup. Moreover, it can mimic human behavior. The only downside to using Selenium in web scraping is that it slows the process because it must first execute the JavaScript code for each page before making it available for parsing. As a result, it is unideal for large-scale data extraction. But if you wish to extract data at a lower-scale or the lack of speed is not a drawback, Selenium is a great choice. Web scraping Python libraries compared
For this Python web scraping tutorial, we’ll be using three important libraries – BeautifulSoup v4, Pandas, and Selenium. Further steps in this guide assume a successful installation of these libraries. If you receive a “NameError: name * is not defined” it is likely that one of these installations has failed. WebDrivers and browsersEvery web scraper uses a browser as it needs to connect to the destination URL. For testing purposes we highly recommend using a regular browser (or not a headless one), especially for newcomers. Seeing how written code interacts with the application allows simple troubleshooting and debugging, and grants a better understanding of the entire process. Headless browsers can be used later on as they are more efficient for complex tasks. Throughout this web scraping tutorial we will be using the Chrome web browser although the entire process is almost identical with Firefox. To get started, use your preferred search engine to find the “webdriver for Chrome” (or Firefox). Take note of your browser’s current version. Download the webdriver that matches your browser’s version. If applicable, select the requisite package, download and unzip it. Copy the driver’s executable file to any easily accessible directory. Whether everything was done correctly, we will only be able to find out later on. Finding a cozy place for our Python web scraperOne final step needs to be taken before we can get to the programming part of this web scraping tutorial: using a good coding environment. There are many options, from a simple text editor, with which simply creating a *.py file and writing the code down directly is enough, to a fully-featured IDE (Integrated Development Environment). If you already have Visual Studio Code installed, picking this IDE would be the simplest option. Otherwise, I’d highly recommend PyCharm for any newcomer as it has very little barrier to entry and an intuitive UI. We will assume that PyCharm is used for the rest of the web scraping tutorial. In PyCharm, right click on the project area and “New -> Python File”. Give it a nice name! Importing and using librariesTime to put all those pips we installed previously to use:
PyCharm might display these imports in grey as it automatically marks unused libraries. Don’t accept its suggestion to remove unused libs (at least yet). We should begin by defining our browser. Depending on the webdriver we picked back in “WebDriver and browsers” we should type in:
Picking a URLPython web scraping requires looking into the source of websites Before performing our first test run, choose a URL. As this web scraping tutorial is intended to create an elementary application, we highly recommended picking a simple target URL:
Select the landing page you want to visit and input the URL into the driver.get(‘URL’) parameter. Selenium requires that the connection protocol is provided. As such, it is always necessary to attach “http://” or “https://” to the URL.
Try doing a test run by clicking the green arrow at the bottom left or by right clicking the coding environment and selecting ‘Run’. If you receive an error message stating that a file is missing then turn double check if the path provided in the driver “webdriver.*” matches the location of the webdriver executable. If you receive a message that there is a version mismatch redownload the correct webdriver executable. Defining objects and building listsPython allows coders to design objects without assigning an exact type. An object can be created by simply typing its title and assigning a value.
Lists in Python are ordered, mutable and allow duplicate members. Other collections, such as sets or dictionaries, can be used but lists are the easiest to use. Time to make more objects!
Before we go on with, let’s recap on how our code should look so far:
Try rerunning the application again. There should be no errors displayed. If any arise, a few possible troubleshooting options were outlined in earlier chapters. We have finally arrived at the fun and difficult part – extracting data out of the HTML file. Since in almost all cases we are taking small sections out of many different parts of the page and we want to store it into a list, we should process every smaller section and then add it to the list:
“soup.findAll” accepts a wide array of arguments. For the purposes of this tutorial we only use “attrs” (attributes). It allows us to narrow down the search by setting up a statement “if attribute is equal to X is true then…”. Classes are easy to find and use therefore we shall use those. Let’s visit the chosen URL in a real browser before continuing. Open the page source by using CTRL+U (Chrome) or right click and select “View Page Source”. Find the “closest” class where the data is nested. Another option is to press F12 to open DevTools to select Element Picker. For example, it could be nested as:
Our attribute, “class”, would then be “title”. If you picked a simple target, in most cases data will be nested in a similar way to the example above. Complex targets might require more effort to get the data out. Let’s get back to coding and add the class we found in the source:
Our loop will now go through all objects with the class “title” in the page source. We will process each of them: Let’s take a look at how our loop goes through the HTML:
Our first statement (in the loop itself) finds all elements that match tags, whose “class” attribute contains “title”. We then execute another search within that class. Our next search finds all the tags in the document ( is included while partial matches like are not). Finally, the object is assigned to the variable “name”. We could then assign the object name to our previously created list array “results” but doing this would bring the entire tag with the text inside it into one element. In most cases, we would only need the text itself without any additional tags.
Our loop will go through the entire page source, find all the occurrences of the classes listed above, then append the nested data to our list:
Note that the two statements after the loop are indented. Loops require indentation to denote nesting. Any consistent indentation will be considered legal. Loops without indentation will output an “IndentationError” with the offending statement pointed out with the “arrow”. Exporting the data to CSVPython web scraping requires constant double-checking of the code Even if no syntax or runtime errors appear when running our program, there still might be semantic errors. You should check whether we actually get the data assigned to the right object and move to the array correctly. One of the simplest ways to check if the data you acquired during the previous steps is being collected correctly is to use “print”. Since arrays have many different values, a simple loop is often used to separate each entry to a separate line in the output:
Both “print” and “for” should be self-explanatory at this point. We are only initiating this loop for quick testing and debugging purposes. It is completely viable to print the results directly: So far our code should look like this:
Running our program now should display no errors and display acquired data in the debugger window. While “print” is great for testing purposes, it isn’t all that great for parsing and analyzing data. You might have noticed that “import pandas” is still greyed out so far. We will finally get to put the library to good use. I recommend removing the “print” loop for now as we will be doing something similar but moving our data to a csv file.
Our two new statements rely on the pandas library. Our first statement creates a variable “df” and turns its object into a two-dimensional data table. “Names” is the name of our column while “results” is our list to be printed out. Note that pandas can create multiple columns, we just don’t have enough lists to utilize those parameters (yet). Our second statement moves the data of variable “df” to a specific file type (in this case “csv”). Our first parameter assigns a name to our soon-to-be file and an extension. Adding an extension is necessary as “pandas” will otherwise output a file without one and it will have to be changed manually. “index” can be used to assign specific starting numbers to columns. “encoding” is used to save data in a specific format. UTF-8 will be enough in almost all cases.
No imports should now be greyed out and running our application should output a “names.csv” into our project directory. Note that a “Guessed At Parser” warning remains. We could remove it by installing a third party parser but for the purposes of this Python web scraping tutorial the default HTML option will do just fine. Exporting the data to ExcelPandas library features a function to export data to Excel. It makes it a lot easier to move data to an Excel file in one go.
The new statement creates a DataFrame - a two-dimensional tabular data structure. The column label is “Name,” and the rows include data from the results array. Pandas can span more than one column, though that’s not required here as we only have a single column of data. The second statement transforms the DataFrame into an Excel file (“.xlsx”). The first argument to the function specifies the filename - “names.xlsx”. Followed by the index argument set to false to avoid numbering the rows. Finally, the encoding is set to “utf-8” to support a broader range of characters.
To sum up, the code above creates a “names.xlsx” file with a “Names” column that includes all the data we have in the results array so far. More lists. More!Python web scraping often requires many data points Many web scraping operations will need to acquire several sets of data. For example, extracting just the titles of items listed on an e-commerce website will rarely be useful. In order to gather meaningful information and to draw conclusions from it at least two data points are needed. For the purposes of this tutorial, we will try something slightly different. Since acquiring data from the same class would just mean appending to an additional list, we should attempt to extract data from a different class but, at the same time, maintain the structure of our table. Obviously, we will need another list to store our data in.
Since we will be extracting an additional data point from a different part of the HTML, we will need an additional loop. If needed we can also add another “if” conditional to control for duplicate entries: Finally, we need to change how our data table is formed:
So far the newest
iteration of our code should look something like this:
There are dozens of ways to resolve that error message. From padding the shortest list with “empty” values, to
creating dictionaries, to creating two series and listing them out. We shall do the third option:
Note that data will not be matched as the lists are of uneven length but creating two series is the easiest fix if two data points are needed. Our final code should look something like this:
Running it should create a csv file named “names” with two columns of data. Web scraping with Python best practicesOur first web scraper should now be fully functional. Of course it is so basic and simplistic that performing any serious data acquisition would require significant upgrades. Before moving on to greener pastures, I highly recommend experimenting with some additional features:
If you enjoy video content more, watch our embedded, simplified version of the web scraping tutorial! ConclusionFrom here onwards, you are on your own. Building web scrapers in Python, acquiring data and drawing conclusions from large amounts of information is inherently an interesting and complicated process. If you are interested in our in-house solution, check Web Scraper API for general purpose scraping applications. If you want to find out more about how proxies or advanced data acquisition tools work, or about specific web scraping use cases, such as
web scraping job postings or building a yellow page scraper, check out our blog. We have enough articles for everyone: a more detailed guide on how to avoid blocks when scraping and
tackle pagination, is web scraping legal, an in-depth walkthrough on what is a proxy and many more! About the author Adomas Sulcas Senior PR Manager Adomas Sulcas is a Senior PR Manager at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas. All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license. Related articlesIN THIS ARTICLE
How do I learn web scraping in Python?Learn Web Scraping with Python from Scratch. Installing Python Web Scraping libraries, BeautifulSoup & Requests.. Extracting URLs from a webpage.. Scraping text data from a webpage.. Crawling multiple webpages and scraping data from each of them.. Handling navigation links and move to next pages.. Is web scraping in Python hard?Scraping with Python and JavaScript can be a very difficult task for someone without any coding knowledge. There is a big learning curve and it is time-consuming. In case you want a step-to-step guide on the process, here's one.
Is Python best for web scraping?If you need to start writing code for web scraping, it is definitely worth it to learn Python. The best part is that Python, compared to other programming languages, is easy to learn, clear to read, and simple to write in.
Is web scraping easy?However, web scraping might seem intimidating for some people. Specially if you've never done any coding in your life. However, they are way simpler ways to automate your data gathering process without having to write a single line of code.
|