dssg/repo-scraper

Name: repo-scraper

Owner: Data Science for Social Good

Description: Search for potential passwords/data leaks in a folder or git repo

Created: 2015-11-12 16:04:33.0

Updated: 2018-05-03 18:42:27.0

Pushed: 2015-12-10 20:33:00.0

Homepage:

Size: 74

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

repo_scraper

Check your projects for possible password (or other sensitive data) leaks.

The library exposes two commands:

Both scripts work almost the same from the user point of view, enter check-dir --help or check-repo --help for more details.

Example

Check your dummy-project:

k-dir dummy-project

Output:

king folder dummy-project...

y-project/python_file_with_password.py
T - MATCH ["password = 'qwerty'"]

y-project/dangerous_file.json
T - MATCH ['"password": "super-secret-password"']
How does it work?

Briefly speaking, check-dir lists all files below a folder and applies regular expressions to look for passwords/IPs. Given that a blind search would never end (for example, if the repo constans a 50MB csv file), some filters are applied before the regular expressions are matched:

check-repo works in a slightly different way, one obvious way to check git history is to checkout each commit and apply check-dir. That approach would be really slow since the script would be checking the same files many times. Instead, check-repo checks out the first commit, runs check-dir there and then, moves up one commit at a time and uses git diff to get only the difference between each consecutive pair of commits.

As in check-dir, the script applies some filters before applying regular expressions to prevent getting stuck on big files, note that in this case we are not dealing with files, but with the git diff output, and that prevents us to check for file size directly:

The project has some limitations see NOTES file for information regarding the design of the project and how that limits what the library is able to detect.

Installation
pip install git+git://github.com/dssg/repo-scraper.git -r requirements.txt
Dependencies
Tested with
Usage
cd path/to/your/project
check-dir

See help for more options available:

check-dir --help
Using a IGNORE file with check-dir

Just as with git, you can specify a file to make the program ignore some files/folders. This is specially useful when you have folder with many log files that you are sure do not have sensitive data. The library assumes one glob rule per line.

Adding a IGNORE file will make execution faster, since many regular expressions are matched against all files that have certain characteristics.

Important: Even though the format is very similar, you cannot use the same rules as in your .gitignore file. For more details, see this.

What's done
What's missing

TODO


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.