Name: ROSIEBot
Owner: Center for Open Science
Description: Robotic Open Science Indexing Engine
Created: 2016-06-02 18:17:08.0
Updated: 2016-08-13 17:11:57.0
Pushed: 2016-08-12 20:25:26.0
Homepage: http://cos.io
Size: 13723
Language: Python
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
![rosie alt](https://cloud.githubusercontent.com/assets/15851093/16431535/109ac052-3d4f-11e6-9218-e7a457898492.png '“Eight legged wonder: Crawling, enduring, facing, Despite childish fears!” - Unknown')
Visit the COS Github for more innovations in the openness, integrity, and reproducibility of scientific research.
This software requires Python 3.5 for the aiohttp library. If desired, create a virtualenv:
pip install virtualenv
pip install virtualenvwrapper
mkvirtualenv rosie --python=python3.5
workon rosie
git clone https://github.com/zamattiac/ROSIEBOt.git
Navigate into the new folder ( cd ROSIEBot
)
pip install -r requirements.txt
to install dependency libraries in the virtualenv.
workon rosie
/deactivate
| Project | Registration | User | Institution | |—————|————–|———|————-| |Dashboard | Dashboard | Profile | Dashboard | | Files | Files | | Wiki | Wiki | | Analytics | Analytics | | Registrations | | | Forks | Forks |
The python file cli.py needs to be run in the command line in the rosie virtualenv. This project is optimized for Mac.
Every command consists of the following and the flag for one mode:
on cli.py
See python cli.py --help
for some further usage assistance.
--compile_active
Make a taskfile of all the currently active pages on the OSF. This is useful primarily for –delete, which requires such a file to remove no-longer-existant pages from the mirror.
--scrape
Crawl and scrape the site. Must include date marker --dm=<DATE>
, where <DATE>
is the date of last scrape in the form YYYY-MM-DDTHH:MM:SS.000, eg. 1970-06-15T00:00:00.000
One must specify which categories to scrape:
--nodes
(projects)
--registrations
--users
--institutions
Any or all can be added.
If the nodes flag is used, one must specify which project pages to include:
-d
: dashboard-f
: files page-w
: wiki pages-a
: analytics-r
: list of registrations of the project-k
: list of forks of the project--resume
Pick up where a normal process left off in case of an unfortunate halt. The normal process creates and updates a .json task file with its status, and this must be included with the flag --tf=<FILENAME>
. The filename will be of the form YYYYMMDDHHMM.json and should be visible in the ROSIEBot directory.
--verify
Verify the completeness of the mirror. See below for steps. This process also requires a .json file in the form described in the resume step, and --rn=<INT>
, where <INT>
is the desired number of retries.
Verify that each URL found by the crawler has a corresponding file on the mirror.
Compare the size of each file to the minimum possible size for a complete page.
Rescrape failed pages and try again.
--delete
Remove anything inside a category folder that isn't listed on the API. Requires a compile_active-produced taskfile.
python cli.py --delete --ctf=<TASKFILE>
--index
Creates a search engine index.
Note: Do not run until the static folder is in place in the archive.
Using search: the search button on each page should be replaced with a link to /search.html
Scraped pages require a static folder inside the mirror. Please get a fresh copy from the OSF repo and place directly inside archive/.
Once static is in place, run python cli.py --index
to set up search utility.
This option creates a flat copy of the archive without categorical folders. Nginx configuration is required otherwise.
Make sure whatever utilities you desire (e.g. verify, index) have been run before the copy is made.
Run bash scripts/host_locally.sh
from the ROSIEBot root. Here is your mirror.
zip -r archive.zip archive/
zip -r flat-archive.zip flat-archive/
Including the following location lines provides necessary routing for a non-flat mirror.
See How to set up prerender step 2 for Nginx information, bearing in mind that some parts do not apply
er {
listen 80 default_server;
listen [::]:80 default_server ipv6only=on;
root /path/to/archive;
# index index.html index.htm;
location / {
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
try_files $uri $uri/ /registration/$uri/ /profile/$uri/ /project/$uri/ /project/$uri/home /registration/$uri/home =404;
# index index.html index.htm;
# Uncomment to enable naxsi on this location
# include /etc/nginx/naxsi.rules
}
location /static/ {
alias /path/to/archive/static/;
}
(Future)