berkmancenter/py_classifurlr

Name: py_classifurlr

Owner: Berkman Klein Center for Internet & Society

Description: null

Created: 2017-01-11 23:40:02.0

Updated: 2017-10-05 15:27:03.0

Pushed: 2017-05-22 22:32:55.0

Homepage: null

Size: 79

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Classifurlr

Description

Classifurlr is a tool to automatically determine if a given web page or set of web pages is likely inaccessible.

This tool does not actually fetch any content from the Internet - previously fetched content is fed into it. The given content is moved through a pipeline of classifiers, each of which looks for different signatures of inaccessibility. Each classifier returns how confident it is that the given content is inaccessible. These are then pooled to create a final accessibility verdict.

Right now, the following classifiers are implemented:

Requirements
Getting Started

After making sure you have the requirements, install the rest of the dependencies with:

install -r requirements.txt

The tool can be used in three ways: as a Python module, as a command line program, and as a web service.

To run Classifurlr as a command line tool, simply run:

on classifurlr.py <name of data file>

You can see more options by adding the -h flag to the above command.

The data file should be a JSON file with the following structure:


l: 'http://example.com',
seline: false, // 'page_1',
geDetail: {
  'page_0': {
      asn: 0,
      screenshot: 'data:image/png;base64,',
      errors: [''],
  },
  ...

r: {...}

More details and field definitions for this structure are in the wiki.

The tool will return a JSON document that looks like this:


tatus": "down",
tatusConfidence": 0.52,
lassifier": "classification_pipeline",
onstituents": [
{
  "status": "down",
  "statusConfidence": 0.4,
  "classifier": "page_length"
},
...

Here are the field definitions:

Classifurlr also minimally complies to the WSGI spec with the provided app function. To run the tool as a web service, run something like the following:

corn classifurlr:app
Code Repository

Code is hosted on GitHub at https://github.com/berkmancenter/classifurlr

Tested Configurations

This has been tested with Python 3.6.0 running on Ubuntu 16.04.

Running Tests

Classifurlr comes with a really minimal test suite. To run it, just run:

on classifurlr_test.py
Issue Tracker

TODO

Contributors

jdcc

License

Copyright © 2017 President and Fellows of Harvard College

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.