Going further:
CLOUD-BASED SCRAPING
Example deploy on ScrapingHub
Login/Register on https://app.scrapinghub.com/ Use Google account
If not done already, run $ pip install shub
$ shub login Provide your API as found on https://app.scrapinghub.com/account/apikey This writes the API key into a local file: less ~/.scrapinghub.yml Not specific to that repository shub logout to remove it
cd to the root of a scrapy project directory
$ shub deploy Provide the ID of the recently created project
From the web app, run spider Arguments can be provided
Items are found under the Items tab Free project: one concurrent spider, max 2.5GB of data storage, data retention 7 days Items can be downloaded individually or in batch (big green Items button on top) Can also be accessed through API calls http://doc.scrapinghub.com/api/overview.html e.g. curl -u 1e2490bfc15d4e6089e4b842364b5cd1: “https://storage.scrapinghub.com/items/85589/1/1?format=xml”
They have a GUI (work in progress) to define elements and build spiders: Portia
CHECK IF ENOUGH TIME FOR LEGAL ASPECTS
ALTERNATIVE: MORPH.IO
CHECK IF ENOUGH TIME FOR LEGAL ASPECTS
SCRAPE DATA FROM PDFs
Use Tabula - Free, open source software - Available for Mac, Windows, Linux - Runs in browser (much like OpenRefine) https://github.com/tabulapdf/tabula
This example uses a library that converts PDF to XML, then does the extraction.
http://www.bl.uk/reshelp/atyourdesk/docsupply/help/replycodes/dirlibcodes/ https://morph.io/ostephens/british_library_directory_of_library_codes British Library maintains list of library codes for its Document Supply service. List of codes in a PDF file -> UNSTRUCTURED