Introduction to web scraping: Selecting content on a web page with XPath

Before we delve into web scraping proper, we will first spend some time introducing some of the techniques that are required to indicate exactly what should be extracted from the web pages we aim to scrape.

The material in this section was adapted from the XPath and XQuery Tutorial written by Kim Pham (@tolloid) for the July 2016 Library Carpentry workshop in Toronto.

Matching patterns

A key part of web scraping is describing to the computer how it should find the content you seek. Several tools have been designed for succinctly describing patterns that can be matched to document structure so that selected content can be efficiently extracted. The most important for web scraping are:

Regular expression: These specify portions of strings of characters (e.g. text, a URL). They can be used to identify, for instance, typical forms of date (yyyy-mm-dd, d/m/yyyy, etc.) or of an email address, or whether a URL is the kind of URL you want to download and scrape.
XPath: These specify parts of a tree-structured document, be it XML or HTML. They can be very specific about which nodes to include or exclude.
CSS selectors: These serve a similar function to XPath, in selecting parts of an HTML document, but were designed for web development (for applying styles such as colour to parts of a document) and so are more commonly known, but also limited in what they can express relative to XPath. Every CSS selector can be translated into an equivalent XPath expression.

XPath (which stands for XML Path Language) is an expression language used to specify parts of an XML document. XPath is often used within technologies aimed at manipulating XML documents, such as XSLT and XQuery. Later in this lesson we enter XPath expressions into web scraping tools, specifying parts of an HTML document to scrape.

Markup Languages

When you view a page in your web browser, this usually involves downloading content encoded in HTML. The browser then renders this content visually.

XML and HTML are markup languages. This means that they use a set of tags or rules to organise and provide information about the data they contain. This structure helps to automate processing, editing, formatting, displaying, printing, etc. that information.

XML documents store data in plain text format. This provides a software- and hardware-independent way of storing, transporting, and sharing data. XML format is an open format, meant to be software agnostic. You can open an XML document in any text editor and the data it contains will be shown as it is meant to be represented. This allows for exchange between incompatible systems and easier conversion of data.

XML and HTML

Note that HTML and XML have a very similar structure, which is why XPath and CSS selectors can be used almost interchangeably to navigate both HTML and XML documents. In a loose sense, HTML is like a particular dialect of XML.

Structure of a marked-up document

An XML document follows basic syntax rules:

An XML document is structured using nodes, which include element nodes, attribute nodes and text nodes.
XML element nodes must have an opening and closing tag, e.g. <catfood> opening tag and </catfood> closing tag. Everything between those tags is contained within the element.
XML tag names are case sensitive, e.g. <catfood> does not equal <catFood>.
Within an element there may be other child elements. These must be properly nested (every child element that is opened must also be closed):

<catfood type="basic">
  <manufacturer>Purina</manufacturer>
  <contact>
    <address class="USA"> 12 Cat Way, Boise, Idaho, 21341</address>
  </contact>
  <date>2019-10-01</date>
</catfood>

Within an element there may also be text nodes. Purina and 2019-10-01 are both text nodes. Another text node contains the white space between <catfood> and <manufacturer>.
XML attribute nodes (like type in <catfood> above) have a name, and a value that must be quoted

Note that there may be multiple elements with a particular tag name:

<product>
  <catfood type="basic"> ... </catfood>
  <catfood type="basic"> ... </catfood>
  <catfood type="premium"> ... </catfood>
</product>

Some of these rules are relaxed in HTML:

tag and attribute names are case insensitive (<catfood type="basic"> equals <catFood Type="basic">)
some elements are closed automatically (e.g. <img> cannot contain any other elements or text)
attribute values do not need to be quoted

HTML can nonetheless be represented as a tree of nodes.

Tree structure

A popular way to represent the structure of an XML or HTML document is the node tree, where each rectangle is a node:

XML node tree

We use the terms parent, child and sibling to describe the hierarchical relationships between nodes:

The top node is called the root (or root node).
Every node has exactly one parent, except the root (which has no parent).
An element (one kind of node) node can have zero, one or several children. Attribute and text nodes have no children.
Siblings are nodes with the same parent.
The sequence of connections from node to node is called a path.
A node’s children and its children’s children, etc., are called its descendants. Similarly, a node’s parent and its parent’s parent, etc., are called its ancestors.

Common HTML elements

In HTML, the tag names aren’t usually as specific in their semantics as manufacturer or address. Here are some of the most common HTML elements:

Tag name	What it is used for
`p`	A paragraph of text
`h1`	A top-level heading
`h2`, `h3`, …	A lower-level heading
`li`	An item in a list
`img`	An image
`tr`	A row in a table
`td`	A cell in a table
`a`	A link
`div`	A block of space on the page (generic)
`span`	A portion of text on the page (generic)
`meta`	Information about the page that is not shown

See the Mozilla Developer Network for a full listing.

XPath Expressions

Using XPath is similar to using advanced search in a library catalogue, where the structured nature of bibliographic information allows us to specify which metadata fields to query. For example, if we want to find books about Shakespeare but not works by him, we can limit our search function to the subject field only.

When we use XPath, we do not need to know in advance what the content we want looks like (as we would with regular expressions, where we need to know the pattern of the data). Since XML documents are structured as a network of nodes, XPath makes use of that structure to navigate through the nodes to select the data we want. We just need to know in which nodes within an XML file the content we want to find resides.

An XPath expression is, somewhat like a search query, a short piece of text that describes which nodes are sought. An XPath expression can be evaluated on a document (or on each of many documents): the XPath evaluator follows the instructions implied by the XPath expression, finds the sought nodes in the document and returns them.

Here are some examples of the sort of things one can express with CSS selectors (and XPath for comparison) based on the document fragments above above:

CSS selector	XPath expression	Description
`address`	`//address`	Get every `address` element (and its contents) in the document
`catfood address`	`//catfood//address`	Get every `address` element somewhere inside a `catfood` element
`catfood[type=basic]`	`//catfood[@type='basic']`	Get every `catfood` element that has a `type` element with value “basic”

Navigating through the HTML node tree using XPath

The following images show a node tree:

HTML Node Tree

XPath is built around describing the relationships between its elements:

Node relationships

Paths in XPath are defined using slashes (/) to separate the steps in a node connection sequence, much like URLs or Unix directories.

In XPath, all expressions are evaluated based on a context node. The context node is the node in which a path starts from. The default context is the root node, indicated by a single slash (/), as in the example above.

The most useful path expressions are listed below:

Expression	Description
`nodename`	Select all nodes with the name “nodename”
`/`	A beginning single slash indicates a select from the root node, subsequent slashes indicate selecting a child node from current node
`//`	Select direct and indirect child nodes in the document from the current node - this gives us the ability to “skip levels”
`.`	Select the current context node
`..`	Select the parent of the context node
`@`	Select attributes of the context node
`[@attribute = 'value']`	Select nodes with a particular attribute value
`text()`	Select the text content of a node
\|	Pipe chains expressions and brings back results from either expression, think of a set union

Drill common XPath expressions using the XPath Diner

The XPath Diner is a fun way to practice writing XPath expressions. It shows XML code, with a corresponding display of food, and challenges you to select certain food, and only that food, by writing an appropriate XPath expression. In the pane on the right, it teaches you about expression syntax relevant to the current challenge.

Evaluating XPath in a web browser

We will use the HTML code that describes this very page you are reading as an example. By default, a web browser interprets the HTML code to determine how to present the various elements of a document, and the code is invisible. To make the underlying code visible, all browsers have a function to display the raw HTML content of a web page.

Display the source of this page

Using your favourite browser, display the HTML source code of this page.

Tip: in most browsers, all you have to do is do a right-click anywhere on the page and select the “View Page Source” option (“Show Page Source” in Safari).

Another tab should open with the raw HTML that makes this page. See if you can locate its various elements, and this challenge box in particular.

Using the Safari browser

If you are using Safari, you must first turn on the “Develop” menu in order to view the page source, and use the functions that we will use later in this section. To do so, navigate to Safari > Preferences and in the Advanced tab select the “Show Develop in menu bar” option.

The HTML structure of the page you are currently reading looks something like this (most text and elements have been removed for clarity):

<!doctype html>
<html lang="en">
  <head>
    (...)
    <title>Selecting content on a web page with XPath</title>
  </head>
  <body>
	 (...)
  </body>
</html>

We can see from the source code that the title of this page is in a title element that is itself inside the head element, which is itself inside an html element that contains the entire content of the page.

Say we wanted to tell a web scraper to look for the title of this page, we would use this information to indicate the path the scraper would need to follow as it navigates through the HTML content of the page to reach the title element. XPath allows us to do that.

We can evaluate XPath expressions directly from within all major modern browsers, by enabling the built-in JavaScript console.

Display the console in your browser

In Firefox, use to the Tools > Web Developer > Web Console menu item.

In Chrome, use the View > Developer > JavaScript Console menu item.

In Safari, use the Develop > Show Error Console menu item. If your Safari browser doesn’t have a Develop menu, you must first enable this option in the Preferences, see above.

Here is how the console looks like in the Firefox browser:

JavaScript console in Firefox

For now, don’t worry too much about error messages if you see any in the console when you open it. The console should display a prompt with a > character (» in Firefox) inviting you to type commands.

The syntax to evaluate an XPath expression on the current page within the JavaScript console is $x("XPATH_QUERY"). For example:

$x("/html/head/title/text()")

This should return something similar to

<- Array [ #text "Selecting content on a web page with XPath" ]

The output can vary slightly based on the browser you are using. For example in Chrome, you have to “open” the return object by clicking on it in order to view its contents.

Let’s look closer at the XPath query used in the example above: /html/head/title/text(). The first / indicates the root of the document. With that query, we told the browser to

`/`	Start at the root of the document…
`html/`	… navigate to the `html` node …
`head/`	… then to the `head` node that’s inside it…
`title/`	… then to the `title` node that’s inside it…
`text()`	and select the text node contained in that element

Using this syntax, XPath thus allows us to determine the exact path to a node.

Select the “Matching patterns” title

Write an XPath query that selects the “Matching patterns” title above and try running it in the console.

Tip: if a query returns multiple elements, the syntax element[1] can be used. Note that XPath uses one-based indexing, therefore the first element has index 1, the second has index 2 etc.
Solution
$x("/html/body/div/h1[1]")
should produce something similar to
<- Array [ <h1#matching-patterns> ]

Before we look into other ways to reach a specific HTML node using XPath, let’s start by looking closer at how nodes are arranged within a document and what their relationships with each others are.

For example, to select all the blockquote elements visible on this page, we can write

$x("html/body/div/blockquote")

This produces an array of objects:

<- Array [ <blockquote.objectives>, <blockquote.callout>, <blockquote.callout>, <blockquote.challenge>, <blockquote.callout>, <blockquote.callout>, <blockquote.challenge>, <blockquote.challenge>, <blockquote.challenge>, <blockquote.keypoints> ]

This selects all the blockquote elements that are under html/body/div. If we want instead to select all blockquote elements in this document, we can use the // syntax instead:

$x("//blockquote")

This produces a longer array of objects:

<- Array [ <blockquote.objectives>, <blockquote.callout>, <blockquote.callout>, <blockquote.challenge>, <blockquote.callout>, <blockquote.callout>, <blockquote.challenge>, <blockquote.solution>, <blockquote.challenge>, <blockquote.solution>, 3 more… ]

Why is the second array longer?

If you look closely into the array that is returned by the $x("//blockquote") query above, you should see that it contains objects like <blockquote.solution> that were not included in the results of the first query. Why is this so?

Tip: Look at the source code and see how the challenges and solutions elements are organised.

We can use the class attribute of certain elements to filter down results. For example, looking at the list of blockquote elements returned by the previous query, and by looking at this page’s source, we can see that the blockquote elements on this page are of different classes (challenge, solution, callout, etc.).

To refine the above query to get all the blockquote elements of the challenge class, we can type

$x("//blockquote[@class='challenge']")

which returns

Array [ <blockquote.challenge>, <blockquote.challenge>, <blockquote.challenge>, <blockquote.challenge> ]

Finding by class name in XPath

In general, an element may have many classes, which are separated by space in the attribute. For example the following element has two classes, author and author-id3451.
<span class="author author-id3451">Jane Bloggs</span>
CSS selectors make selecting by class easy: .author will select any element with the author class; .author-id3451 any element with that class.

XPath is unwieldy for expressing that an attribute should contain a particular whitespace-delimited word. The following is usually sufficient to match any element with an author class:
//*[contains(concat(' ', @class, ' '), ' author ')]

Select the “Matching patterns” title by ID

In a previous challenge, we were able to select the “Matching patterns” title because we knew it was the first h1 element on the page. But what if we didn’t know how many such elements were on the page? In other words, is there a different attribute that allows us to uniquely identify that title element?

Using the path expressions introduced above, rewrite your XPath query to select the “Matching patterns” title without using the [1] index notation.

Tips:

Look at the source of the page or use the “Inspect element” function of your browser to see what other information would enable us to uniquely identify that element.

The syntax for selecting an element like <div id="mytarget"> is div[@id = 'mytarget'].
Solution
$x("/html/body/div/h1[@id='matching-patterns']")
should produce something similar to
<- Array [ <h1#matching-patterns> ]

What makes a good XPath expression?

Essential to web scraping is being able to select the set of elements you want, without selecting any additional elements. Many web scrapers are run periodically on web sites with changing data. One challenge is that the structure of that page can also change or vary across items being scraped.

A good XPath expression may be characterised by:

being specific: not capturing things you don’t need

being robust to change: the expression should ideally still work if some basic things about the page structure change (e.g. an image is inserted; a field, such as an author’s name, is absent). If the expression breaks due to change it is better for it to extract nothing than to extract the wrong thing as it is easy to monitor how many XPaths come back empty.

being easy to read: if things are going to break or change, readable selectors help to avoid redesigning the selector from scratch.

All of these characteristics suggest preferring IDs and classes (and perhaps other attributes) where possible, and being judicious in choosing between a child or a descendant operator.

Select this challenge box

Using an XPath query in the JavaScript console of your browser, select the element that contains the text you are currently reading on this page.

Tips:

In principle, id attributes in HTML are unique on a page. This means that if you know the id of the element you are looking for, you should be able to construct an XPath that looks for this value without having to worry about where in the node tree the target element is located.

The syntax for selecting an element like <div id="mytarget"> is div[@id = 'mytarget'].

Remember that XPath queries are relative to a context node, and by default that node is the root node.

Use the // syntax to select for elements regardless of where they are in the tree.

The syntax to select the parent element relative to a context node is ..

The $x(...) JavaScript syntax will always return an array of nodes, regardless of the number of nodes returned by the query. Contrary to XPath, JavaScript uses zero based indexing, so the syntax to get the first element of that array is therefore $x(...)[0].

Make sure you select this entire challenge box. If the result of your query displays only the title of this box, have a second look at the HTML structure of the document and try to figure out how to “expand” your selection to the entire challenge box.
Solution

Let’s have a look at the HTML code of this page, around this challenge box (using the “View Source” option) in our browser). The code looks something like this:
<!doctype html>
<html lang="en">
  <head>
    (...)
  </head>
  <body>
	<div class="container">
	(...)
	  <blockquote class="challenge">
	    <h2 id="select-this-challenge-box">Select this challenge box</h2>
	    <p>Using an XPath query in the JavaScript console of your browser...</p>
	    (...)
	  </blockquote>
	(...)
	</div>
  </body>
</html>
We know that the id attribute should be unique, so we can use this to select the h2 element inside the challenge box:
$x("//h2[@id = 'select-this-challenge-box']/..")[0]
This should return something like
<- <blockquote class="challenge">
Let’s walk through that syntax:

$x(" This function tells the browser we want it to execute an XPath query.

// Look anywhere in the document…

h2 … for an h2 element …

[@id = 'select-this-challenge-box'] … that has an id attribute set to select-this-challenge-box…

// and select the parent node of that h2 element

")" This is the end of the XPath query.

[0] Select the first element of the resulting array (since $x() returns an array of nodes and we are only interested in the first one).

By hovering on the object returned by your XPath query in the console, your browser should helpfully highlight that object in the document, enabling you to make sure you got the right one:

Advanced XPath syntax

FIXME: All the content below is from the original XPath lesson. Adapt content to use current example.

Operators

Operators are used to compare nodes. There are mathematical operators, boolean operators. Operators can give you boolean (true/false values) as a result. Here are some useful ones:

Operator	Explanation
`=`	Equivalent comparison, can be used for numeric or text values
`!=`	Is not equivalent comparison
`>, >=`	Greater than, greater than or equal to
`<, <=`	Less than, less than or equal to
`or`	Boolean or
`and`	Boolean and
`not`	Boolean not

Examples

Path Expression	Expression Result
html/body/div/h3/@id=’exercises-2’	Does exercise 2 exist?
html/body/div/h3/@id!=’exercises-4’	Does exercise 4 not exist?
//h1/@id=’references’ or @id=’introduction’	Is there an h1 references or introduction?

Predicates

Predicates are used to find a specific node or a node that contains a specific value.

Predicates are always embedded in square brackets, and are meant to provide additional filtering information to bring back nodes. You can filter on a node by using operators or functions.

Examples

Operator	Explanation
`[1]`	Select the first node
`[last()]`	Select the last node
`[last()-1]`	Select the last but one node (also known as the second last node)
`[position()<3]`	Select the first two nodes, note the first position starts at 1, not =
`[@lang]`	Select nodes that have attribute ‘lang’
`[@lang='en']`	Select all the nodes that have a “attribute” attribute with a value of “en”
`[price>15.00]`	Select all nodes that have a price node with a value greater than 15.00

Examples

Path Expression	Expression Result
//h1[2]	Select 2nd h1
//h1[@id=’references’ or @id=’introduction’]	Select h1 references or introduction

Wildcards

XPath wildcards can be used to select unknown XML nodes.

Wildcard	Description
`*`	Matches any element node
`@*`	Matches any attribute node
`node()`	Matches any node of any kind

Examples

Path Expression	Result	//*[@id=”examples-2”]
`//*[@class='solution']`	Select all elements with class attribute ‘solution’

In-text search

XPath can do in-text searching using functions and also supports regex with its matches() function. Note: in-text searching is case-sensitive!

Path Expression	Result
`//author[contains(.,"Matt")]`	Matches on all author nodes, in current node contains Matt (case-sensitive)
`//author[starts-with(.,"G")]`	Matches on all author nodes, in current node starts with G (case-sensitive)
`//author[ends-with(.,"w")]`	Matches on all author nodes, in current node ends with w (case-sensitive)
`//author[matches(.,"Matt.*")]`	regular expressions match 2.0

Complete syntax: XPath Axes

XPath Axes fuller syntax of how to use XPath. Provides all of the different ways to specify the path by describing more fully the relationships between nodes and their connections. The XPath specification describes 13 different axes:

self ‐‐ the context node itself
child ‐‐ the children of the context node
descendant ‐‐ all descendants (children+)
parent ‐‐ the parent (empty if at the root)
ancestor ‐‐ all ancestors from the parent to the root
descendant‐or‐self ‐‐ the union of descendant and self • ancestor‐or‐self ‐‐ the union of ancestor and self
following‐sibling ‐‐ siblings to the right
preceding‐sibling ‐‐ siblings to the left
following ‐‐ all following nodes in the document, excluding descendants
preceding ‐‐ all preceding nodes in the document, excluding ancestors • attribute ‐‐ the attributes of the context node

Path Expression	Result
`/html/body/div/h1[@id='introduction']/following-sibling::h1`	Select all h1 following siblings of the h1 introduction
`/html/body/div/h1[@id='introduction']/following-sibling::*`	Select all h1 following siblings
`//attribute::id`	Select all id attribute nodes

Oftentimes, the elements we are looking for on a page have no ID attribute or other uniquely identifying features, so the next best thing is to aim for neighboring elements that we can identify more easily and then use node relationships to get from those easy-to-identify elements to the target elements.

For example, the node tree image above has no uniquely identifying feature like an ID attribute. However, it is just below the section header “Navigating through the HTML node tree using XPath”. Looking at the source code of the page, we see that that header is a h2 element with the id navigating-through-the-html-node-tree-using-xpath.

$x("//h2[@id='navigating-through-the-html-node-tree-using-xpath']/following-sibling::p[2]/img")

Additions

FIXME: add more XPath functions such as concat() and normalize-space(). FIXME: mention XPath Checker for Firefox FIXME: Firefox sometime cleans up the HTML of a page before displaying it, meaning that the DOM tree we can access through the console might not reflect the actual source code. <tbody> elements are typically not reliable. The Scrapy documentation has more on the topic.

`$x("`	This function tells the browser we want it to execute an XPath query.
`//`	Look anywhere in the document…
`h2`	… for an h2 element …
`[@id = 'select-this-challenge-box']`	… that has an `id` attribute set to `select-this-challenge-box`…
`//`	and select the parent node of that h2 element
`")"`	This is the end of the XPath query.
`[0]`	Select the first element of the resulting array (since `$x()` returns an array of nodes and we are only interested in the first one).

Introduction to web scraping: Selecting content on a web page with XPath

Matching patterns

Markup Languages

XML and HTML

Structure of a marked-up document

Tree structure

Common HTML elements

XPath Expressions

Navigating through the HTML node tree using XPath

Drill common XPath expressions using the XPath Diner

Evaluating XPath in a web browser

Display the source of this page

Using the Safari browser

Display the console in your browser

Select the “Matching patterns” title

Solution

Why is the second array longer?

Finding by class name in XPath

Select the “Matching patterns” title by ID

Solution

What makes a good XPath expression?

Select this challenge box

Solution

Advanced XPath syntax

Operators

Examples

Predicates

Examples

Examples

Wildcards

Examples

In-text search

Complete syntax: XPath Axes

Additions

References