Going further:

CLOUD-BASED SCRAPING

Example deploy on ScrapingHub

They have a GUI (work in progress) to define elements and build spiders: Portia

CHECK IF ENOUGH TIME FOR LEGAL ASPECTS

ALTERNATIVE: MORPH.IO

CHECK IF ENOUGH TIME FOR LEGAL ASPECTS

SCRAPE DATA FROM PDFs

Use Tabula - Free, open source software - Available for Mac, Windows, Linux - Runs in browser (much like OpenRefine) https://github.com/tabulapdf/tabula

This example uses a library that converts PDF to XML, then does the extraction.

http://www.bl.uk/reshelp/atyourdesk/docsupply/help/replycodes/dirlibcodes/ https://morph.io/ostephens/british_library_directory_of_library_codes British Library maintains list of library codes for its Document Supply service. List of codes in a PDF file -> UNSTRUCTURED