Name: UtilityScripts
Owner: Transparency Toolkit
Description: Scripts for managing scrapers
Created: 2014-09-23 02:54:31.0
Updated: 2017-12-02 16:42:35.0
Pushed: 2017-07-18 19:31:04.0
Size: 65
Language: Ruby
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
This is a collection of scripts for managing scrapers
Summary
jsongen.rb
- Generates JSONs using a schema you specify. Can be used for
anything, but it's good for making machine-readable lists of search termslinkedin.rb
- Runs the LinkedIn scraper on a set of search terms in a jsoncrypto
- Encrypt and decrypt all files in a directory with GPGconfig
- Scripts for setup and syncing a scraping machinedocuments.rb
- Convert document files to JSONemails.rb
- Convert email files to JSONruby jsongen.rb
Currently this only supports single level JSONs.
To Run:
ruby json.rb
linkedin.rb
To run this, you need a JSON where every item has the following fields: Search Term: The phrase you want to search for Degrees: The number of degrees you want to go out with “people also viewed”
To Run:
ruby linkedin.rb
crypto/
- Encrypt files with encrypt.rb and decrypt with decrypt.rb.
Encrypting
ruby encrypt.rb
Decrypting
ruby decrypt.rb
config/
- Setup and syncing scripts for a scraping machine
Setup & Sync:
stall.sh
nc.sh
apt-get install build-essential pkg-config curl libcurl3 libcurl3-gnutls
url4-openssl-dev rmagic libmagickwand-dev imagemagick graphicsmagick
ler-utils poppler-data ghostscript tesseract-ocr pdftk libreoffice
bundle install
from in the directoryBy default, documents and images will be processed with the GiveMeText tool, but IS NOT GOOD FOR SENSITIVE DOCUMENTS as it sends normal HTTP requests over the internet. However, you can run a custom Tika server for converting documents yourself.
You can process either emails or normal text documents using the following scripts:
Run the script to convert documents in JSON as well as with local Tika instance
documents.rb path/to/your/files/
documents.rb --tika=http://localhost:9998 /path/to/your/documents
Run email script to convert emails to JSON
emails.rb /path/to/your/emails
Attachments
If your emails generated an attachments/
folder, then run the documents.rb
script as described above to convert attachments into JSON as well
documents.rb --tika=http://localhost:9998 /path/to/youre/emails_output/attachments