What is web scraping?
|
Humans are good at categorizing information, computers not so much.
Often, data on a web site is not properly structured, making its extraction difficult.
Web scraping is the process of automating the extraction of data from web sites.
|
Selecting content on a web page with CSS selectors
|
XML and HTML are markup languages. They provide structure to documents.
XML and HTML documents are made out of nodes, which form a hierarchy.
The hierarchy of nodes inside a document is called the node tree.
Relationships between nodes are: parent, child, descendant, sibling.
CSS selectors are constructed by specifying properties of the targets combined with properties of their context.
IDs, classes and tag names should be preferred as properties for extraction.
CSS selectors can be evaluated using the document.querySelectorAll() function.
|
Visual scraping using browser extensions
|
Data that is relatively well structured (in a table) is relatively easily to scrape.
More often than not, web scraping tools need to be told what to scrape.
CSS selectors can be used to define what information to scrape, and how to structure it.
CSS selectors in scrapers need to be designed careful, as the selector chosen for one page may not work perfectly on another.
More advanced data cleaning operations are best done in a subsequent step.
|
Web scraping using Python: requests and lxml
|
requests is a Python library that helps downloading web pages, primarily with requests.get .
requests.compat.urljoin(response.url, href) may be used to resolve a relative URL href .
lxml is a Python library that parses HTML/XML and evaluates XPath/CSS selectors.
lxml.html.fromstring(page_source) will produce an element tree from some HTML code.
An element tree’s cssseelct and xpath methods extract elements of interest.
A scraper can be divided into: identifying the set of URLs to scrape; extracting some elements from a page; and transforming them into a useful output format.
It is important but challenging to be resilient to variation in page structure: one should automatically validate and manually inspect their extractions.
A framework like Scrapy may help to build robust scrapers, but may be harder to learn. See the Scrapy tutorial in Extras.
|
Discussion
|
Web scraping is, in general, legal and won’t get you into trouble.
There are a few things to be careful about, notably don’t overwhelm a web server and don’t steal content.
Be nice. In doubt, ask.
|