Web scraping using Python: requests and lxml

Overview

Teaching: 40 min
Exercises: 30 min
Questions
  • How can scraping a web site be automated?

  • How can I download web pages’ HTML in Python?

  • How can I evaluate XPath or CSS selectors in Python?

  • How can I format scraped data as a spreadsheet?

  • How do I build a scraper that will keep working even if the page structure changes?

Objectives
  • Using requests.get and resolving relative URLs with urljoin

  • Traversing HTML and extracting data from it with lxml

  • Creating a two-step scraper to first extract URLs, visit them, and scrape their contents

  • Apprehending some of the things that can break when scraping

  • Storing the extracted data

Recap

Here is what we have learned so far:

This is quite a toolset already, and it’s probably sufficient for a number of use cases, but there are limitations in using the tools we have seen so far. For example, some data may be structured in ways that are too out of the ordinary for visual scrapers, perhaps requiring items to be processed only in certain conditions. There may also be too much data, or too many pages to visit, to simply run the scraper in a web browser, as some visual scrapers operate. Writing a scraper in code may make it easier to maintain and extend, or to incorporate quality assurance and monitoring mechanisms.

Introducing Requests and lxml

We make use of two tools that are not specifically developed for scraping, but are very useful for that purpose (among others). Both of these require a Python installation (Python 2.7, or Python 3.4 and higher; although our example code will focus on Python 3), and each library (requests and lxml and cssselect) needs to be installed as described in Setup.

Requests focuses on the task of interacting with web sites. It can download a web page’s HTML given its URL. It can submit data as if filled out in a form on a web page. It can manage cookies, keeping track of a logged-in session. And it helps handling cases where the web site is down or takes a long time to respond.

lxml is a tool for working with HTML and XML documents, represented as an element tree. It evaluates XPath and CSS selectors to find matching elements. It facilitates navigating from one element to another. It facilitates extracting the text, attribute values or HTML for a particular element. It knows how to handle badly-formed HTML (such as an opening tag that is never closed, or a closing tag that is never opened), although it may not handle it identically to a particular web browser. It is also able to construct new well-formed HTML/XML documents, element by element.

To use CSS selectors, the cssselect package must also be installed.

If all are correctly installed, it should be possible to then write the following Python code without an error occurring:

>>> import requests
>>> import lxml
>>> import cssselect

We will be working in Python. Open a text editor or IDE (such as Spyder) to edit a new file, saved as unsc-scraper.py. Check that you can run the file with Python, e.g. by running the following in a terminal:

$ python unsc-scraper.py

If unsc-scraper.py is empty, this should run but not output anything to the terminal.

Downloading a page with requests

Let’s start by downloading the page of UNSC resolutions for 2016. Enter the following in your file and save:

import requests

response = requests.get('http://www.un.org/en/sc/documents/resolutions/2016.shtml')
print(response.text)

You should see the same as what you would when using a web browser’s View Source feature (albeit less colourful):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
<head>
<meta http-equiv="X-UA-Compatible" content="IE=8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Resolutions adopted by the United Nations Security Council since 1946</title>
...

What’s it doing?

We now have the page content, but as a string of textual characters, not as a tree of elements.

Traversing elements in a page with lxml

The following illustrates loads the response HTML into a tree of elements, and illustrates the xpath and cssselect methods provided on an ElementTree (and each Element thereof), as well as other tree traversal. Running the following code:

import requests
import lxml.html

response = requests.get('http://www.un.org/en/sc/documents/resolutions/2016.shtml')
tree = lxml.html.HTML(response.text)
title_elem = tree.xpath('//title')[0]
title_elem = tree.cssselect('title')[0]  # equivalent to previous XPath
print("title tag:", title_elem.tag)
print("title text:", title_elem.text_content())
print("title html:", lxml.html.tostring(title_elem))
print("title tag:", title_elem.tag)
print("title's parent's tag:", title_elem.getparent().tag)

produces this output:

title tag: title
title text: Resolutions adopted by the United Nations Security Council in 2016
title html: b'<title>Resolutions adopted by the United Nations Security Council in 2016</title>&#13;\n'
title tag: title
title's parent's tag: head

This code begins by building a tree of Elements from the HTML using lxml.html.fromstring(some_html). It then illustrates some operations on the elements. With some element, elem, or the tree:

In the above example, we extract the first (and only) <title> element from the page, show its text, etc., and do the same for its parent, the <head> node. When we print the text of that parent node, we see that it consists of two blank lines. Why?

Apart from basic features of Python, these are all the tools we should need.

UNSC scraper overview

Now that we have some idea of what requests and lxml do, let’s use them to scrape UNSC data. We will modularise our scraper design as follows:

  1. A get_year_urls function will return a list of year URLs to scrape resolutions from.
  2. A function get_resolutions_for_year will return an object like {'date': '1962', 'symbol': 'S/RES/174 (1962)', 'title': 'Admission of new Members to the UN: Jamaica', 'url': 'http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/174(1962)'} for each resolution at the given page URL.
  3. The scraper script will run get_year_urls, and then get_resolutions_for_year for each year and write the resolutions in CSV to the file unsc-resolutions.csv.

Spidering pages of UNSC resolutions

We’ll start by compiling a list of URLs to scrape. We will write a Python function called get_year_urls. Its job is to get the set of URLs listing resolutions, which we will later scrape.

For a start, the following function will extract and return a list of the URLs linked to from the starting page:

def get_year_urls():
    start_url = 'http://www.un.org/en/sc/documents/resolutions/'
    response = requests.get(start_url)
    tree = lxml.html.fromstring(response.text)
    links = tree.cssselect('a')  # or tree.xpath('//a')

    out = []
    for link in links:
        # we use this if just in case some <a> tags lack an href attribute
        if 'href' in link.attrib:
            out.append(link.attrib['href'])
    return out

Calling this function and printing its output should produce something like the following:

print(get_year_urls())
["#mainnav", "#content", "http://www.un.org/en/index.html", "/ar/sc/documents/resolutions/", …, "http://undocs.org/rss/scdocs.xml", "2010.shtml", "2011.shtml", …]

We are faced with two issues:

  1. We only want to get the year-by-year resolutions listings, and should ignore the other links.
  2. Only the URLs starting with http:// can directly be passed into requests.get(url). The others are termed relative URLs and need to be modified to become absolute.

Dealing with relative URLs

Most of the URLs found in href attributes are relative to the page we found them in. We could prefix all those URLs with http://www.un.org/en/sc/documents/resolutions/ to make them absolute, but that doesn’t handle all the cases. Since this is a common need, we can use an existing function, requests.compat.urljoin(base_url, relative_url) which will translate:

From relative URL To absolute URL
#mainnav http://www.un.org/en/sc/documents/resolutions/#mainnav
2010.shtml http://www.un.org/en/sc/documents/resolutions/2010.shtml
/ar/sc/documents/resolutions/ http://www.un.org/ar/sc/documents/resolutions/
http://www.un.org/en/index.html http://www.un.org/en/index.html (unchanged)

Here, the base_url is something like "http://www.un.org/en/sc/documents/resolutions/2010.shtml" and the relative URL is something like "#mainnav". However, beware: the base URL is not always identical to the URL you pass into requests.get(url) for two reasons:

  • When you got the URL it may have redirected you to a different page. URLs are therefore relative to the response URL, stored in response.url, rather than the request URL. For example, requests.get("http://www.un.org/ar/sc/documents/resolutions").url returns "http://www.un.org/ar/sc/documents/resolutions/". Note that this subtly, but importantly, adds a "/" at the end.
  • The HTML on a page can indicate that the base for its relative URLs is something else. (See W3Schools on <base>.) That is, if tree.xpath('//head/base@href') returns something, you should use its first value as the base URL. This does not apply in our case because there is no <base> tag in the page we are scraping.

(A Python scraping framework, Scrapy, recently introduced a way to avoid some of these pitfalls, using response.follow. This is not applicable when using requests and lxml directly.

Challenge: Get absolute URLs for year pages

Complete the get_year_urls by function fixing the two issues listed above: it should resolve the relative URLs and only get URLs corresponding to yearly resolution listings.

Solution

We use a more specific CSS selector, along with urljoin:

def get_year_urls():
    """Return a list of (year_url, year) pairs
    """
    start_url = 'http://www.un.org/en/sc/documents/resolutions/'
    response = requests.get(start_url)
    tree = lxml.html.fromstring(response.text)
    links = tree.cssselect('#content > table a')

    out = []
    for link in links:
        year_url = urljoin(response.url, link.attrib['href'])
        out.append(year_url)

    return out

The following describe alternative solutions:

  1. Use the same CSS selector as above (a), but filter the URLs for those that look like they end in a number followed by .shtml.
  2. Generate all the URLs without downloading the start page, by simply counting from 1946 to the current year, which can be found with import datetime; datetime.datetime.now().year.
  3. Use a different CSS selector for the same content (e.g. table a or td > a).

In the final version of get_year_urls, we make a couple of modifications, to ensure we’re getting what we want, and to return the year number along with the URL (by getting it from link.text_content()):

def get_year_urls():
    """Return a list of (year_url, year) pairs
    """
    start_url = 'http://www.un.org/en/sc/documents/resolutions/'
    response = requests.get(start_url)
    tree = lxml.html.fromstring(response.text)
    tables = tree.cssselect('#content > table')
    # Check you captured something and not more than you expected
    if len(tables) != 1:
        print('Expected exactly 1 table, got {}'.format(len(tables)))
        return []

    table = tables[0]
    links = table.cssselect('a')

    out = []
    for link in links:
        year_url = urljoin(base_url, link.attrib['href'])
        year = link.text_content()
        # TODO: validate that year is actually an appropriate number
        out.append((year_url, year))

    # Check we got something
    if not out:
        print('Expected some year URLs, got none')
    return out

In this implementation, we first extract the <table> element, roughly make sure it’s what we want, and then apply a CSS selector to get the content within it.

It’s a good idea to check you’re getting the kind of data you expect, because:

Here we just use print output as a way to report if something went wrong.

Advanced challenge: Validate year

If the page changes and the year text is not a valid number, we’d like to know about that. Write code that validates the year text as being a four-digit number, and does not add a year with invalid text to out.

Solution

Insert at the TODO above:

if len(year) != 4 or not year.isdigit():
    print("Link text '{}' is not an integer".format(link.text_content()))
    continue

When the URLs to scrape can’t be listed

Sometimes you can list the pages needed to be scraped in advance. Here we can just generate URLs for all years from 1946 until now. Often building a scraper involves analysing the kinds of URLs on a web site and constructing a list of them programmatically.

On the other hand, sometimes you cannot get all the URLs at once, for instance when you need to click a “next page” link (although sometimes these URLs can also be enumerated by identifying patterns in the next page URLs). This means you can’t design your scraper with distinct “collect URLs” and “scrape each URL” phases. Instead you might add each URL to a queue for later processing. (A scraping framework like Scrapy manages this queuing for you.)

We have a list of year pages to scrape. Now we need to scrape the resolutions off each year page.

Scraping a page of UNSC resolutions

At the heart of get_resolutions_for_year is getting a record (a row in the output CSV) for each resolution that contains its details. Looking at 2016, we want:

However, for earlier years such as 1999, the date column is not provided, and we want:

We have a few choices in how to code this up, too:

  1. Match all the symbols with one CSS selector evaluated over the document; match all the titles with another selector; merge them together.
  2. Match all the symbols’ elements with one CSS selector, then iterate over its subsequent sibling elements to get the other fields.
  3. Match all the row elements with one CSS selector, then use a CSS selector within it to get each field.
  4. Match all the row elements with one CSS selector, then use the element’s .getchildren(...) to get each field’s <td> element.

We will take the last approach. Let’s assume that the code for extracting table is basically the same as in get_year_urls:

import requests
import lxml.html

def get_resolutions_for_year(year_url, year):
    """Return a list of resolutions

    Each should be represented as a dict like::

        {'date': ..., 'symbol': ..., 'url': ..., ''title': ..., }
    """
    response = requests.get(year_url)
    tree = lxml.html.fromstring(response.text)
    tables = tree.cssselect('#content > table')
    # Check you captured something and not more than you expected
    if len(tables) != 1:
        print('Expected exactly 1 table, got {}'.format(len(tables)))
        return []
    table = tables[0]
    out = []

    for row_elem in table.cssselect('tr'):
        resolution = {}
        # TODO: extract data for each resolution
        out.append(resolution)

    # Check we got something
    if not out:
        print('Expected some resolutions, got none'.format(year))
    return out


# Test get_resolutions_for_year on 2016
resolutions = get_resolutions_for_year("http://www.un.org/en/sc/documents/resolutions/2016.shtml", "2016")
for resolution in resolutions:
    print(resolution)

We added:

Limit the number of URLs to scrape through while debugging

Eventually, we want our scraper to apply its extraction to all pages of UNSC resolutions. But while we’re working through to the final code that will allow us the extract the data we want from those pages, we only want to run it on one or a few pages at a time.

This will not only run faster and allow us to iterate more quickly between different revisions of our code. It will also not burden the server too much while we’re debugging. This is probably not such an issue for only tens of pages, but it’s good practice, as it can make a difference for larger scraping projects. If you are planning to scrape a massive website with thousands of pages, it’s better to start small. Other visitors to that site will thank you for respecting their legitimate desire to access it while you’re debugging your scraper…

If you have a list of URLs to scrape, such as the output from get_year_urls(), you might simply slice that list. In Python, lists can be sliced using the list[start:end] where start and end are numbers, either of which can be left out:

list[start:end] # items from start through end-1
list[start:]    # items from start through the rest of the array
list[:end]      # items from the beginning through end-1
list[:]         # all items

Thus list[:5] will get the first five elements from list.

The TODO above needs to be filled in with code to get the HTML elements corresponding to symbol, date (where present) and title, and extracting their text.

Running that script as it is will download the page and print an empty dict for each resolution:

{}
{}
{}
{}
{}
{}
{}
...

Let’s start filling in the TODO above, extracting the symbol text for each resolution:

        children = row_elem.getchildren()
        resolution['symbol'] = children[0].text_content()

This gets the first child of the row, i.e. its first cell, extracts its text, and places it in the resolution dict with the key "symbol". Run the script and check the output. We see:

{'symbol': 'Resolutions adopted by the Security Council \r\n      in 2016'}
{'symbol': 'S/RES/2336 \r\n      (2016)'}
{'symbol': 'S/RES/2335 \r\n      (2016)'}
{'symbol': 'S/RES/2334 \r\n      (2016)'}
{'symbol': 'S/RES/2333 \r\n      (2016)'}
{'symbol': 'S/RES/2332 \r\n      (2016)'}
{'symbol': 'S/RES/2331 \r\n      (2016)'}

Challenge: Identify two issues in that output

There are two problems in the output above. What are they?

Solution

  1. The header has been included.
  2. The symbols surprisingly have " \r\n      " in them.

Cleaning the symbols

We can exclude the header with:

        if len(children) == 1:
            # Assume that a row with 1 element is the header
            continue

Another approach would be to replace table.cssselect('tr') with table.cssselect('tr')[1:] to ignore the first row returned by the selector.

To clean up the messy symbols, we have to realise that "\r\n" are special in Python: they indicate a line break (like pressing enter) in text. So what we have here is a sequence of white-space characters including "\r", "\n", and " ". In HTML, a sequence of white-space characters is usually interpreted as a single space. The following substitutes a single space for any white-space sequence in retrieving an element’s text.

def clean_text(element):
    all_text = element.text_content()
    cleaned = ' '.join(all_text.split())
    return cleaned

Make these two changes and run the script again to check it’s working better.

Extracting other fields

We can handle the fact that a date column may or may not be present with:

        if len(children) == 3:
            # there is a date column
            resolution['date'] = clean_text(children[1])
        elif len(children) == 2:
            # adopt the year for the page
            resolution['date'] = year
        else:
            print('Unexpected number of children in row element: {}'.format(len(children)))
            continue

Run this on 2016 and 1999 to check that the output is sensibly getting ‘symbol’ and ‘date’ whether or not the date column is available.

Challenge: fill in the title and url fields

The URL extraction requires finding the <a> element within the symbol cell and extracting its attribute, as in get_year_urls above.

The title text can be extracted like the other fields, except that it is sometimes the second and sometimes the third (but always the last) column.

Hint: You can get the last element of a Python list with [-1]. The CSS selectors :nth-last-child(1) and :nth-last-of-type(1) fulfill a similar purpose.

Solution

        symbol_links = children[0].cssselect('a')
        if len(symbol_links) != 1:
            print('Expected 1 link in the symbol column, got {}'.format(len(symbol_links)))
            continue
        relative_url = symbol_links[0].attrib['href']
        resolution['url'] = requests.compat.urljoin(response.url, relative_url)
        resolution['title'] = clean_text(children[-1])

Putting it all together

All we appear to need now is write some code to call get_resolutions_for_year for each year, and use Python’s standard csv module to change our dicts into CSV. This code can replace the “Test get_resolutions_for_year on 2016” code and drive the overall scraper.

import csv
import time

with open('unsc-resolutions.csv', 'w') as out_file:
    writer = csv.DictWriter(out_file, ['date', 'symbol', 'title', 'url'])
    writer.writeheader()

    # Loop over years
    for year_url, year in get_year_urls():
        time.sleep(0.1)  # Wait a moment

        print('Processing:', year_url)
        year_resolutions = get_resolutions_for_year(year_url, year)

        for resolution in year_resolutions:
            writer.writerow(resolution)

Some explanation:

You have a full scraper. But does it perfectly capture the data?

Quirks and quality assurance

Run the above scraper. Do our print statements highlight any quirks in the web site?

Open the output (unsc-resolutions.csv) in a spreadsheet program like Microsoft Excel. Can you identify any other quirks from the data?

Challenge: debug the issues

Your scraper should have reported:

  • “Expected 1 link in the symbol column, got 0” in 2013;
  • “Expected exactly 1 table, got 2” in 1964 and 1960; and
  • “Expected some resolutions, got none” in 1959.

View those pages (you should not need to view the source) to identify the associated issues: how are those pages different from the ones you initially designed your scraper for? Then fix the scraper to get the complete, clean dataset.

Solution

  1. 2013 has a header row above the data. Because our scraper already skips the row when there is no link in it, the data is clean. We could modify our scraper to silence the error in this year:
    if len(symbol_links) != 1:
        if year != '2013':
            print('Expected 1 link in the symbol column, got {}'.format(len(symbol_links)))
        continue
    
  2. 1964 and 1960 have the page duplicated!

    Replace:

    if len(tables) != 1:
        print('Expected exactly 1 table, got {}'.format(len(tables)))
        return []
    

    With:

    if not tables:
        print('Expected 1 table, got none')
        return []
    if len(tables) > 1:
        print('Taking first of {} tables'.format(len(tables)))
        return []
    
  3. Our system correctly identifies that there are no resolutions in 1959. We could modify our scraper to silence the error in this year:
    if year != '1959':
        print('Expected some resolutions, got none'.format(year))
    

In constructing this lesson, we identified several quirks in the data, where one year differed from another in surprising ways (and there may be more we have not identified!). We have discussed many of these:

These quirks are somewhat peculiar to web sites that are manually edited. However, similar things can happen with database-backed web sites. For instance:

Tips for quirk resilience

Here are some tips about how you could ensure that your scraper will work despite variation.

  1. Look at your scraped data. Look at it more closely. Look at random samples collected over time. Perhaps analyse it in a tool like OpenRefine which will show you the number of distinct/duplicated values in each column. If you are scraping data over a long time, keep a dashboard of diagnostic measures to show you how many fields come back blank, for instance.
  2. Think about cases where your scraper might fail, and apprehend them in code. Validate the extractions in your code. When something differs from expectation, output an informative message onto a log. Make sure the log includes enough information about the context, e.g. which page or part of the page you are scraping at the time.
  3. Only allow an error to halt your scraping operation if that’s really necessary, by wrapping your main scraper code in an exception handler. For example:
            print('Processing:', year_url)
            try:
                year_resolutions = get_resolutions_for_year(year_url, year)
            except Exception:
                # the exception has been caught instead of Python exiting
                print('ERROR while processing', year, ':')
                import traceback
                traceback.print_exc()  # describe the error and what code triggered it
                continue  # skip to the next year
    

    Consider this a last resort: if an error occurs, any resolutions scraped from the error year will not be output.

  4. Write helper functions to make cleaning and error identification easy for you. clean_text is one example. Another useful helper might be:
    def extract_one(list_of_extractions, default=None):
        if len(list_of_extractions) == 0:
            print('Expected some extractions, but got None')
            return default
        if len(list_of_extractions) > 1:
            print('Expected 1 extraction, but got {}'.format(len(list_of_extractions)))
        return list_of_extractions[0]
    

    As well as alerting you to more than one extraction, this avoids triggering an error if your cssselect or xpath query returns an empty list.

A specialised framework like Scrapy helps manage tasks like logging, diagnostics, and handling empty lists of extractions.

Using the data

Challenge: Analyse the data

Perform some interesting analysis of the data, for instance:

  • Plot the number of resolutions per year. Are there interesting periods of increase or lull?
  • Count how often each title occurs.
  • Identify which words are most frequent in the titles.
  • Plot only those resolutions that pertain to membership vs those that do not.
  • Plot only those resolutions mentioning some country of choice (e.g. Israel or Pakistan) in their title.
  • Very advanced: lookup strings of capitalised words (optionally including lowercase words like “of” and “for”) in the Wikipedia or Wikidata API to associate the names with locations. Plot them on a map!

A Pivot Table will be very useful for performing these analyses in Excel or Google Sheets. Similar functionality is provided in Python by Pandas and its pivot and groupby functionality.

Extension challenge: multilingual UNSC resolutions

Run the scraper on resolutions in Arabic (start at http://www.un.org/ar/sc/documents/resolutions/) or Chinese (start at http://www.un.org/zh/sc/documents/resolutions/) and merge the results with English to have columns en_title, ar_title, etc.

Hint: a tool for tabular data, pandas can read in CSV (pandas.read_csv) and can merge together multiple tables on the basis of some matching keys (pandas.concat).

While here we have extracted data that was already in tables into another tabular format, very often what we’re processing doesn’t look like a table on the web site. But the procedure is the same: identify the elements that you wish to extract, and apply a pattern which selects them from the HTML.

You are now ready to write your own scrapers!

Advanced topics and resources

Aside from ethical questions addressed in the next episode, below are a number of advanced topics related to building a web scraper. Most are features of existing specialised scraping frameworks, such as Scrapy, or commercial scraping tools.

So why didn’t we learn Scrapy?

Scrapy provides a great framework for designing, implementing and managing robust and efficient scrapers. However, we get the sense that people who are not very experienced at programming find the declarative paradigm facilitated by Scrapy very foreign.

On the other hand, writing a more procedural scraper as we have here with the nuts and bolts of requests and lxml helps to motivate some of the issues that Scrapy endeavours to solve or ameliorate.

Reference

Key Points

  • requests is a Python library that helps downloading web pages, primarily with requests.get.

  • requests.compat.urljoin(response.url, href) may be used to resolve a relative URL href.

  • lxml is a Python library that parses HTML/XML and evaluates XPath/CSS selectors.

  • lxml.html.fromstring(page_source) will produce an element tree from some HTML code.

  • An element tree’s cssseelct and xpath methods extract elements of interest.

  • A scraper can be divided into: identifying the set of URLs to scrape; extracting some elements from a page; and transforming them into a useful output format.

  • It is important but challenging to be resilient to variation in page structure: one should automatically validate and manually inspect their extractions.

  • A framework like Scrapy may help to build robust scrapers, but may be harder to learn. See the Scrapy tutorial in Extras.