Name: archivers-harvesting-tools
Owner: Tools for Government Data Archiving
Description: ARCHIVED--Collection of scripts and code snippets for data harvesting after generating the zip starter
Created: 2017-01-31 21:57:05.0
Updated: 2018-01-03 00:59:17.0
Pushed: 2017-07-11 19:01:32.0
Size: 3774
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
A collection of scripts and code snippets for data harvesting after generating the zip starter.
We welcome tools written in any language! Especially if they cover use cases we haven't described. To add a new tool to this repository please review our Contributing Guidelines
Download Zip Starter
from the detail page related to a URL you have checked outtools
directory, e.g. with:r harvesting-tools/TOOLNAME/* RESOURCEUUID/tools/
Each tool in this repo has a fairly specific use case. Your choice will depend on the shape and size of the data you're dealing with. Some datasetswill require more creativity/more elaborate tools. If you write a new tool, please add it to the repo.
If you encounter a page that links to lots of data (for example a “downloads” page), this approach may well work. It's important to only use this approach when you encounter data, for example pdf's, .zip archives, .csv datasets, etc.
The tricky part of this approach is generating a list of urls to download from the page.
Government datasets are often stored on FTP; this script will capture FTP directories and subdirectories.
PLEASE NOTE that the Internet Archive has captured over 100Tb of government FTP resoures since December 2016. Be sure to check the URL using check-ia/url-check, the Wayback Machine Extension, or your own tool that uses the Wayback Machine's API (example 1, example 2 w/ wildcard. If the FTP directory you're looking at has not been saved to the Internet Archive, be sure that it has also been nominated as a web crawl seed.
Whether it has been saved or not, you may decide to download it for chain-of-custody preservation reasons. If so, this script should do what you need.
The last resort of harvesting should be to drive it with a full web browser. It is slower than other approaches such as wget
, curl
, or a headless browser. Additionally, this implementation is prone to issues where the resulting page is saved before it's done loading. There is a ruby example in tools/example-hacks/watir.rb.
For search results from large document sets, you may need to do more sophisticated “scraping” and “crawling” – check out tools built at previous events such as the EIS WARC archiver or the EPA Search Utils for ideas on how to proceed.
If you encounter an API, chances are you'll have to build some sort of custom solution, like epa-envirofacts-scraper or investigate a social angle. For example: asking someone with greater access for a database dump. Be sure to include your code in the tools
directory of your zipfile, and if there is any likelihood of general application, please add to this repo.
the utils
directory is for scripts that have been useful in the past but may not have very general application. But you still might find something youu like!