Introduction to web scraping: Resources

Key Points

What is web scraping?
  • Humans are good at categorizing information, computers not so much.

  • Often, data on a web site is not properly structured, making its extraction difficult.

  • Web scraping is the process of automating the extraction of data from web sites.

Selecting content on a web page with CSS selectors
  • XML and HTML are markup languages. They provide structure to documents.

  • XML and HTML documents are made out of nodes, which form a hierarchy.

  • The hierarchy of nodes inside a document is called the node tree.

  • Relationships between nodes are: parent, child, descendant, sibling.

  • CSS selectors are constructed by specifying properties of the targets combined with properties of their context.

  • IDs, classes and tag names should be preferred as properties for extraction.

  • CSS selectors can be evaluated using the document.querySelectorAll() function.

Visual scraping using browser extensions
  • Data that is relatively well structured (in a table) is relatively easily to scrape.

  • More often than not, web scraping tools need to be told what to scrape.

  • CSS selectors can be used to define what information to scrape, and how to structure it.

  • CSS selectors in scrapers need to be designed careful, as the selector chosen for one page may not work perfectly on another.

  • More advanced data cleaning operations are best done in a subsequent step.

Web scraping using Python: requests and lxml
  • requests is a Python library that helps downloading web pages, primarily with requests.get.

  • requests.compat.urljoin(response.url, href) may be used to resolve a relative URL href.

  • lxml is a Python library that parses HTML/XML and evaluates XPath/CSS selectors.

  • lxml.html.fromstring(page_source) will produce an element tree from some HTML code.

  • An element tree’s cssseelct and xpath methods extract elements of interest.

  • A scraper can be divided into: identifying the set of URLs to scrape; extracting some elements from a page; and transforming them into a useful output format.

  • It is important but challenging to be resilient to variation in page structure: one should automatically validate and manually inspect their extractions.

  • A framework like Scrapy may help to build robust scrapers, but may be harder to learn. See the Scrapy tutorial in Extras.

  • Web scraping is, in general, legal and won’t get you into trouble.

  • There are a few things to be careful about, notably don’t overwhelm a web server and don’t steal content.

  • Be nice. In doubt, ask.