cncf/gitdm

Name: gitdm

Owner: Cloud Native Computing Foundation (CNCF)

Description: Fork for tracking CNCF projects

Created: 2017-04-17 15:45:10.0

Updated: 2018-05-24 08:34:12.0

Pushed: 2018-05-24 08:34:07.0

Homepage: https://cncf.io

Size: 92780

Language: HTML

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

CNCF gitdm

This is the Cloud Native Computing Foundation's fork of Jon Corbet and Greg KH's gitdm tool for calculating contributions based on developers and their companies. Companies and developers can check if they are correctly attributed at the following links:

Company Developers list

Developers affiliations list

If you find any errors in those lists, please submit a pull request with edits. However, only the Developers affiliations list should be edited manually. The Company Developers list is a computed derivative of the first list.

Removing affiliations

If you do not want to have your email listed here please read how to remove your email.

Testing changes

You can test any changes locally by cloning this repository and regenerating all data by running ./rerun_data.sh.

Then generate config files by running: ./import_affs.sh.

If those two files are out of sync, the tool will notify you about this.

This tool will generate a new email-map file.

Check if your changes processed properly and move the file to cncf-config/email-map (replace)

Sync workflow

Please follow the instructions from SYNC.md.

Running

Use *.sh scripts to run analytics (all*.sh for full analysis and rels*.sh for per release stats)

This program assumes that gitdm resides in: ~/dev/cncf/gitdm/ and that kubernetes is in ~/dev/go/src/k8s.io/kubernetes/

Output files are placed in the kubernetes directory.

To regenerate all statistics just run: ./rerun_data.sh

This is an iterative process: Run any of scripts. Review its output in the kubernetes directory. Iteratively adjust mappings to handle more authors.

You can also run via ./debug.sh to halt in debugger and review the hackers structure and those who were not found. See cncfdm.py:DebugUnknowns

Final report:

Data

Report

Contributing

Pull requests are welcome.

Our mapping is never complete, please see config files in Config files.

File email-map is a direct email to the employer mapping.

There is also a long list of unknown emails. For that, scroll to the section called Developers with unknown affiliation: in all.txt

All of those were searched for in various sources but we were not able to find their affiliation.

Detailed Description

Regenerating all data with ./rerun_data.sh means:

After performing those two steps, cncfdm.py output neds to be analysed. It is done by calling: ./analysis_all.sh (analyses all time results) and then ./analysis_rels.sh (for per-release data)

Data for all 68 repos (currently) which makes the entire Kubernetes project with ./kubernetes_repos.sh script.

Final files generated by first 2 calls (for single repo kubernetes/kubernetes) are in kubernetes/all_time/*.txt and ./kubernetes/v1.X.0-v1.Y.0/*.txt

All scripts are configured to ignore commits related to files from vendor and Godeps directories. This is because external sources are placed here and many commits are just adding external libraries. Accounting for them would make the results less accurate

All of them use a git log call with specific args piped to cncfdm.py call with specific parameters.

See ./run.sh for an example. All other calls use the same commands git log and cncfdm.py with other parameters.

To get a list of parameters for cncfdm.py, see comments inside of the cncfdm.py file describing all possible options.

For more details about how cncfdm.py tool works refer to its sources and other *.py files.

Those files are analysed by ./analysis_all.sh and ./analysis_rels.sh.

The first one calls: ruby analysis.rb all kubernetes/all_time/first_run_patch.txt kubernetes/all_time/run_no_map_patch.txt kubernetes/all_time/run_with_map_patch.txt

The second calls: ruby analysis.rb v1.0_v1.1 kubernetes/*/output_strict_patch.txt kubernetes/*/output_patch.txt kubernetes/*/output_no_map_patch.txt

This ruby tool expects to get 3 files (one with no unknown developers mapping, 2nd with mapping to a domain name and 3rd with mapping to (Unknown).

The output of this analysis.rb tool goes to project/<prefix>_<key>_<type>.csv files. : can be all or v1.X.0-v1.Y.0 - it means that thefile is for all time data or for specific release of kubernetes/kubernetes : can be changeset, employers, lines, signoffs - it means that the file contains data sorted by this desc. : can be sum, top, all:

This data is directly used for “Who writes Kubernetes” report.

./kubernetes_repos.sh script is used to generate all time data for all kubernetes repos.

To use it, you must have all of kubernetes repositories (68 from 3 different organizations) cloned in ~/dev/go/src/k8s/.

Orgs are: kubernetes, kubernetes-incubator, kubernetes-client.

It generates statistics for each single repo via: ./anyrepo.sh ~/dev/go/src/k8s.io/<repo-name> <repo-name>

See details in ./kubernetes_repos.sh. is a directory where a given kubernetes repository is cloned.

To clone a repository, do: cd ~/dev/go/src/k8s/ git clone https://github.com/<one-of-3-kubernetes-orgs>/<kubernetes-repo-name>.git.

one-of-3-kubernetes-orgs: kubernetes, kubernetes-incubator and kubernetes-client

kubernetes-repo-name: please look up all repo names in all kubernetes orgs on GitHub.

./anyrepo.sh just calls cncfdm.py with appropriate args (like exclude vendor dir numstat etc).

There is also ./anyreporange.sh that allows querying a repo for a specific time range (cncfdm.py supports that as well).

Output of this goes to repos/<repo-name>.<ext> : repository name ./anyrepo.sh was called with. : txt, csv, html, out: txt: main data file, csv: dumps list of employers in given repo, html: the same as txt but in HTML format, out: cncfdm.py verbose output messages (for debugging)

Finally, ./kubernetes_repos.sh calls: ./multirepo.sh with all 68 repository directories listed.

It gathers git log on each of them and concatenates all those files and then run cncfdm.py on the concatenated result (see ./multirepo.sh)

Results are saved to repos/combined.<ext> is the same as for anyrepo.sh.

Typical work flow is re-runing ./kubernetes_repos.sh and examining repos/combined.txt for unknown developers.

Research on google, Clearbit, FullContact, github, LinkedIn, Facebook, any other source -> update cncf-config/<filename> and re-run ./kubernetes_repos.sh : usually in this order: email-map, domain-map, a in very rare cases: aliases, gitdm.config-cncf or group mappings in groups/

Also, when running data for a single kubernetes/kubernetes for example with ./all.sh examining developers found in ./kubernetes/all_time/first_run_patch.txt.

After all this data is generated, ./kubernetes_repos.sh concatenates all single repo data into a single output file: repos/merged.out to allow browsing all the data in a single file.

It also generates developers and companies statistics via a ./topdevs.sh call.

It calls a ruby tool on the combined output of all 68 kubernetes repos (saved as CSV) like so: ruby topdevs.rb repos/combined.csv

That tool generates files as follows:

There are clearbit tools in clearbit_tools/ directory.

Look for any files with .rb extension. 3 rounds of commercial Clearbit requests were performed. And they returned quite a lot of data.

But those files are not checked in and are listed in ./.gitignore because we have to pay for that data.

Those tools are used to enrich of cncf-config/email-map mapping. google_other.txt - contains a list of Google developers with email on a domain different than @google.com. ./changesets.csv, ./added.csv, ./removed.csv files contain developers sorted by changesets, added lines, removed lines desc.

A new set of tools to get Clearbit and FullContact data is located in affiliation_finder/ directory. The two tools are described inthe 'Tools to help find unknown affiliations' section of this document.

This is used to generate Top N developers in given criteria.

./new_devs.sh (also used by ./rerun_data.sh) is used to generate statistics about new developers between kubernetes/kubernetes releases.

It calls: ruby new_devs.rb kubernetes/v1.X.0-v1.Y.0/output_strict_patch.csv for all X and Y. new_devs.rb simply generates information about developers who were new between each release and file new_devs.csv, which contains a list of companies who introduced most new developers overall (sorted by # of new developers desc).

That covers a typical usage and data for “Who writes Kubernetes report”

Other tools

Other tools include:

To work on Prometheus contributors before and after joining CNCF:

Prometheus joined CNCF on 2016-05-09.

You need to clone all Prometheus repos into ~/dev/prometheus using ./clone_prometheus.sh

Then you need to get a number of distinct Prometheus contributors before joining CNCF: ./prometheus_repos.sh 2015-05-09 2016-05-08 ~/dev/prometheus/

Result is:

essed 2721 csets from 230 developers
employers found
tal of 1558445 lines added, 353900 removed (delta 1204545)

Now check the number of distinct contributors after 2016-05-09: ./prometheus_repos.sh 2016-05-09 2017-06-01 ~/dev/prometheus/

essed 2817 csets from 346 developers
employers found
tal of 2696196 lines added, 771502 removed (delta 1924694)

We have a change from 230 to 365 which is a 59% increase.

Report

Links to data and generated report are here: ./res/links.txt

CNCF Projects join statistics

Typical update of “Who writes Kubernetes report”

Affiliations of some developers are uncertain despite best effort. These developers are listed in uncertain.csv file.

GitHub users can be pulled using Octokit GiHub API.

To do this, call: ruby ghusers.rb or ./ghusers.sh

Required are:

Tools to help find unknown affiliations

To enhance this json with pre-existing affiliations, call: ./enchance_json.sh

All those tools are automatically called when running the full data regeneration script: ./rerun_data.sh

The first one works with one argument and generates a file clearbit_affiliation_lookup.csv. The argument can be skipped or have a value of 'true' or 'false' - default. Invocation would be clearbit_affiliation_lookup.rb or clearbit_affiliation_lookup.rb false or clearbit_affiliation_lookup.rb true. The argument is used to whether the script's output data should be overwriten (normally data would be appended to the file) and at the same time it will allow previously looked-up email addresses to be checked again.
The execution environment needs to have a proper value for this:

```
Clearbit.key = ENV['CLEARBIT_KEY']
```

It is a secret API key on a Clearbit account which has been set up for subscription. When the file is generated, open it in a csv editor, sort by the 'chance' field. Visually check and correct data in the 'affiliation_suggestion' column. Replace values such as 'http://www.ghostcloud.cn/' with 'Ghostcloud'. If you find affiliations for other developers manually, just change the 'none' value in the 'chance' column to 'high' and provide a value in the 'affiliation_suggestion' column. Columns to the right of 'affiliation_suggestion' are not required.

The second script reads the 'clearbit_affiliation_lookup.csv' file. Data is processed against the cncf-config/email-map file. When done, the 'email-map' file will have new and updated affiliations. The file will be sorted as well. The lookup file will not be altered.

The first one works with one argument and generates a file fullcontact_affiliation_lookup.csv. The argument can be skipped or have a value of 'true' or 'false' - default. Invocation would be fullcontact_affiliation_lookup.rb or fullcontact_affiliation_lookup.rb false or fullcontact_affiliation_lookup.rb true. The argument is used to whether the script's output data should be overwriten (normally data would be appended to the file) and at the same time it will allow previously looked-up email addresses to be checked again.
The execution environment needs to have a proper value for this:

```
config.api_key = ENV['FULLCONTACT_KEY']
```

It is a secret API key on a FullContact account which has been set up for subscription. The columns differ in this file compared to that of Clearbit. If you find affiliations for other developers manually, just change the value in the 'org_1' column. The column by default should have 5 pipe-delimited values. If you do not have the values for the other 4, just type 4 pipes. Columns to the right of 'org_1' are not required.

The second script reads the 'clearbit_affiliation_lookup.csv' file. Data is processed against the cncf-config/email-map file. When done, the 'email-map' file will have new and updated affiliations. The file will be sorted as well. The lookup file will not be altered. The merge scripts export developer work history to fullcontact_developer_historical_irganizations.csv.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.