futurice/spice-hate_speech_detection

Name: spice-hate_speech_detection

Owner: Futurice

Description: A SPICE-program funded project where the goal is to detect hate speech in social media.

Created: 2017-03-14 08:07:30.0

Updated: 2018-05-21 23:35:04.0

Pushed: 2017-10-29 20:00:47.0

Homepage: null

Size: 33

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Automatic hate speech detection

Setup
  1. Install requirements
  2. python3
  3. python packages: pandas, sklearn, fasttext, sqlalchemy, …
  4. Configure collector
  5. Edit hiit_collector.py.example and save it as hiit_collector.py
  6. Configure PostgreSQL
  7. Edit postgre_keys.py.example and save it as postgre_keys.py
  8. Get the data
  9. FastText model for Finnish trained by Facebook using Finnish Wikipedia: Facebook's trained models
Usage:
Collect new data

usage:

`collector.py [-h] [–user USER] [–password PASSWORD]

                [--hostname HOSTNAME] [--outdir OUTDIR]
                [--startdate STARTDATE] [--enddate ENDDATE]

optional arguments: -h, –help show this help message and exit –user USER Username –password PASSWORD Password –hostname HOSTNAME Hostname –outdir OUTDIR Directory to store data –startdate STARTDATE

                    Startdate as YYYY-MM-DD

–enddate ENDDATE Enddate as YYYY-MM-DD`

Example:

./collector.py --startdate 2017-03-01 --enddate 2017-03-15

Train predictor

Example:

./predict.py --inputdir data/incoming --outdir data/output/ --featurename bow --featurefile data/models/feature_extractor_bow.pkl --predictor data/models/fasttext_svm.pkl

Predict hate speech

Example:

./predict.py --inputdir data/incoming --outdir data/output/ --featurename bow --featurefile data/models/feature_extractor_bow.pkl --predictor data/models/bow_svm.pkl

Sync data

Example:

./sync.py --inputdir data/output/

TODO
  1. CNN on Embedding Matrix (c.f Willi)
  2. Stemmings, stop words for BoW
  3. Study SVM factors (with BoW)
  4. Mezadona ? To Models
  5. Plot TSNE manifolds for wikipedia model and twitter model
  6. Highlight hatewords

DONE:

  1. Try Naive Bayes-classifier with BoW
  2. Naive Bayes (Gaussian) did perform comparable to RF, but worse than SVM
  3. With FastText it performed poorly

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.