Introduction to web scraping

Web scraping is the process of extracting data from websites. Some data that is available on the web is presented in a format that makes it easier to collect and use it, for example in the form of downloadable comma-separated values (CSV) datasets that can then be imported in a spreadsheet or loaded into a data analysis script. Often however, even though it is publicly available, data is not readily available for reuse. For example, it can be contained in a PDF, or a table on a website, or spread across multiple web pages.

There are a variety of ways to scrape a website to extract information for reuse. In its simplest form, this can be achieved by copying and pasting snippets from a web page, but this can be impractical if there is a large amount of data to be extracted or if it is spread over a large number of pages. Instead, specialized tools and techniques can be used to automate this process by defining what sites to visit, what information to look for, and whether data extraction should stop at the end of a page or follow hyperlinks to repeat the process recursively. Automating web scraping also allows for defining whether the process should be run at regular intervals in order to capture changes in the data.

Prerequisites

As web scraping is a technique to extract data from web pages, it requires some understanding of the technologies that are used to display information on the web. This lesson therefore assumes that learners will have some familiarity with HTML and the Document Object Model (DOM).

The first part of this lesson will use browser extensions to introduce the concepts of web scraping as well as introduce the CSS selector syntax for selecting elements on a web page and requires no further specific knowledge. The second part will introduce the use of specialized libraries to scrape websites by writing custom computer programs and will require some familiarity with the Python programming language.

Software requirements

Refer to the Setup section to install the required software to follow along this lesson.

Under development

Please note that the contents of this lesson are still being actively developed. Any feedback is appreciated, please do not hesitate to contact the maintainers or contribute to the lesson by forking it on GitHub.

Schedule

Setup Download files required for the lesson
00:00 1. What is web scraping? What is web scraping and why is it useful?
What are typical use cases for web scraping?
00:10 2. Selecting content on a web page with CSS selectors How can I select a specific element on web page?
What is a CSS Selector and how can I use it?
01:00 3. Visual scraping using browser extensions How can I get started scraping data off the web?
How can I use CSS selectors to precisely select what data to scrape?
02:10 4. Web scraping using Python: requests and lxml How can scraping a web site be automated?
How can I download web pages’ HTML in Python?
How can I evaluate XPath or CSS selectors in Python?
How can I format scraped data as a spreadsheet?
How do I build a scraper that will keep working even if the page structure changes?
03:20 5. Discussion When is web scraping OK and when is it not?
Is web scraping legal? Can I get into trouble?
How can I make sure I’m doing the right thing?
What can I do with the data that I’ve scraped?
03:35 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.