src-d / hercules
- суббота, 28 сентября 2024 г. в 00:00:02
Gaining advanced insights from Git repository history.
Fast, insightful and highly customizable Git history analysis.
Overview • How To Use • Installation • Contributions • License
Hercules is an amazingly fast and highly customizable Git repository analysis engine written in Go. Batteries are included. Powered by go-git.
Notice (November 2020): the main author is back from the limbo and is gradually resuming the development. See the roadmap.
There are two command-line tools: hercules
and labours
. The first is a program
written in Go which takes a Git repository and executes a Directed Acyclic Graph (DAG) of analysis tasks over the full commit history.
The second is a Python script which shows some predefined plots over the collected data. These two tools are normally used together through
a pipe. It is possible to write custom analyses using the plugin system. It is also possible
to merge several analysis results together - relevant for organizations.
The analyzed commit history includes branches, merges, etc.
Hercules has been successfully used for several internal projects at source{d}. There are blog posts: 1, 2 and a presentation. Please contribute by testing, fixing bugs, adding new analyses, or coding swagger!
The DAG of burndown and couples analyses with UAST diff refining. Generated with hercules --burndown --burndown-people --couples --feature=uast --dry-run --dump-dag doc/dag.dot https://github.com/src-d/hercules
torvalds/linux line burndown (granularity 30, sampling 30, resampled by year). Generated with hercules --burndown --first-parent --pb https://github.com/torvalds/linux | labours -f pb -m burndown-project
in 1h 40min.
Grab hercules
binary from the Releases page.
labours
is installable from PyPi:
pip3 install labours
pip3
is the Python package manager.
Numpy and Scipy can be installed on Windows using http://www.lfd.uci.edu/~gohlke/pythonlibs/
You are going to need Go (>= v1.11) and protoc
.
git clone https://github.com/src-d/hercules && cd hercules
make
pip3 install -e ./python
It is possible to run Hercules as a GitHub Action: Hercules on GitHub Marketplace. Please refer to the sample workflow which demonstrates how to setup.
...are welcome! See CONTRIBUTING and code of conduct.
The most useful and reliably up-to-date command line reference:
hercules --help
Some examples:
# Use "memory" go-git backend and display the burndown plot. "memory" is the fastest but the repository's git data must fit into RAM.
hercules --burndown https://github.com/go-git/go-git | labours -m burndown-project --resample month
# Use "file system" go-git backend and print some basic information about the repository.
hercules /path/to/cloned/go-git
# Use "file system" go-git backend, cache the cloned repository to /tmp/repo-cache, use Protocol Buffers and display the burndown plot without resampling.
hercules --burndown --pb https://github.com/git/git /tmp/repo-cache | labours -m burndown-project -f pb --resample raw
# Now something fun
# Get the linear history from git rev-list, reverse it
# Pipe to hercules, produce burndown snapshots for every 30 days grouped by 30 days
# Save the raw data to cache.yaml, so that later is possible to labours -i cache.yaml
# Pipe the raw data to labours, set text font size to 16pt, use Agg matplotlib backend and save the plot to output.png
git rev-list HEAD | tac | hercules --commits - --burndown https://github.com/git/git | tee cache.yaml | labours -m burndown-project --font-size 16 --backend Agg --output git.png
labours -i /path/to/yaml
allows to read the output from hercules
which was saved on disk.
It is possible to store the cloned repository on disk. The subsequent analysis can run on the corresponding directory instead of cloning from scratch:
# First time - cache
hercules https://github.com/git/git /tmp/repo-cache
# Second time - use the cache
hercules --some-analysis /tmp/repo-cache
The action produces the artifact named
hercules_charts
. Since it is currently impossible to pack several files in one artifact, all the
charts and Tensorflow Projector files are packed in the inner tar archive. In order to view the embeddings,
go to projector.tensorflow.org, click "Load" and choose the two TSVs. Then use UMAP or T-SNE.
docker run --rm srcd/hercules hercules --burndown --pb https://github.com/git/git | docker run --rm -i -v $(pwd):/io srcd/hercules labours -f pb -m burndown-project -o /io/git_git.png
hercules --burndown
labours -m burndown-project
Line burndown statistics for the whole repository. Exactly the same what git-of-theseus does but much faster. Blaming is performed efficiently and incrementally using a custom RB tree tracking algorithm, and only the last modification date is recorded while running the analysis.
All burndown analyses depend on the values of granularity and sampling. Granularity is the number of days each band in the stack consists of. Sampling is the frequency with which the burnout state is snapshotted. The smaller the value, the more smooth is the plot but the more work is done.
There is an option to resample the bands inside labours
, so that you can
define a very precise distribution and visualize it different ways. Besides,
resampling aligns the bands across periodic boundaries, e.g. months or years.
Unresampled bands are apparently not aligned and start from the project's birth date.
hercules --burndown --burndown-files
labours -m burndown-file
Burndown statistics for every file in the repository which is alive in the latest revision.
Note: it will generate separate graph for every file. You don't want to run it on repository with many files.
hercules --burndown --burndown-people [--people-dict=/path/to/identities]
labours -m burndown-person
Burndown statistics for the repository's contributors. If --people-dict
is not specified, the identities are
discovered by the following algorithm:
If --people-dict
is specified, it should point to a text file with the custom identities. The
format is: every line is a single developer, it contains all the matching emails and names separated
by |
. The case is ignored.
Wireshark top 20 devs - overwrites matrix
hercules --burndown --burndown-people [--people-dict=/path/to/identities]
labours -m overwrites-matrix
Beside the burndown information, --burndown-people
collects the added and deleted line statistics per
developer. Thus it can be visualized how many lines written by developer A are removed by developer B.
This indicates collaboration between people and defines expertise teams.
The format is the matrix with N rows and (N+2) columns, where N is the number of developers.
--people-dict
is not specified, it is always 0).The sequence of developers is stored in people_sequence
YAML node.
Ember.js top 20 devs - code ownership
hercules --burndown --burndown-people [--people-dict=/path/to/identities]
labours -m ownership
--burndown-people
also allows to draw the code share through time stacked area plot. That is,
how many lines are alive at the sampled moments in time for each identified developer.
torvalds/linux files' coupling in Tensorflow Projector
hercules --couples [--people-dict=/path/to/identities]
labours -m couples -o <name> [--couples-tmp-dir=/tmp]
Important: it requires Tensorflow to be installed, please follow official instructions.
The files are coupled if they are changed in the same commit. The developers are coupled if they
change the same file. hercules
records the number of couples throughout the whole commit history
and outputs the two corresponding co-occurrence matrices. labours
then trains
Swivel embeddings - dense vectors which reflect the
co-occurrence probability through the Euclidean distance. The training requires a working
Tensorflow installation. The intermediate files are stored in the
system temporary directory or --couples-tmp-dir
if it is specified. The trained embeddings are
written to the current working directory with the name depending on -o
. The output format is TSV
and matches Tensorflow Projector so that the files and people
can be visualized with t-SNE implemented in TF Projector.
46 jinja2/compiler.py:visit_Template [FunctionDef]
42 jinja2/compiler.py:visit_For [FunctionDef]
34 jinja2/compiler.py:visit_Output [FunctionDef]
29 jinja2/environment.py:compile [FunctionDef]
27 jinja2/compiler.py:visit_Include [FunctionDef]
22 jinja2/compiler.py:visit_Macro [FunctionDef]
22 jinja2/compiler.py:visit_FromImport [FunctionDef]
21 jinja2/compiler.py:visit_Filter [FunctionDef]
21 jinja2/runtime.py:__call__ [FunctionDef]
20 jinja2/compiler.py:visit_Block [FunctionDef]
Thanks to Babelfish, hercules is able to measure how many times each structural unit has been modified. By default, it looks at functions; refer to Semantic UAST XPath manual to switch to something else.
hercules --shotness [--shotness-xpath-*]
labours -m shotness
Couples analysis automatically loads "shotness" data if available.
hercules --shotness --pb https://github.com/pallets/jinja | labours -m couples -f pb
tensorflow/tensorflow aligned commit series of top 50 developers by commit number.
hercules --devs [--people-dict=/path/to/identities]
labours -m devs -o <name>
We record how many commits made, as well as lines added, removed and changed per day for each developer. We plot the resulting commit time series using a few tricks to show the temporal grouping. In other words, two adjacent commit series should look similar after normalization.
This plot allows to discover how the development team evolved through time. It also shows "commit flashmobs"
such as Hacktoberfest. For example, here are the revealed
insights from the tensorflow/tensorflow
plot above:
tensorflow/tensorflow added and changed lines through time.
hercules --devs [--people-dict=/path/to/identities]
labours -m old-vs-new -o <name>
--devs
from the previous section allows to plot how many lines were added and how many existing changed
(deleted or replaced) through time. This plot is smoothed.
kubernetes/kubernetes efforts through time.
hercules --devs [--people-dict=/path/to/identities]
labours -m devs-efforts -o <name>
Besides, --devs
allows to plot how many lines have been changed (added or removed) by each developer.
The upper part of the plot is an accumulated (integrated) lower part. It is impossible to have the same scale
for both parts, so the lower values are scaled, and hence there are no lower Y axis ticks.
There is a difference between the efforts plot and the ownership plot, although changing lines correlate
with owning lines.
It can be clearly seen that Django comments were positive/optimistic in the beginning, but later became negative/pessimistic.hercules --sentiment --pb https://github.com/django/django | labours -m sentiment -f pb
We extract new and changed comments from source code on every commit, apply BiDiSentiment
general purpose sentiment recurrent neural network and plot the results. Requires
libtensorflow.
E.g. sadly, we need to hide the rect from the documentation finder for now
is negative and
Theano has a built-in optimization for logsumexp (...) so we can just write the expression directly
is positive. Don't expect too much though - as was written, the sentiment model is
general purpose and the code comments have different nature, so there is no magic (for now).
Hercules must be built with "tensorflow" tag - it is not by default:
make TAGS=tensorflow
Such a build requires libtensorflow
.
hercules --burndown --burndown-files --burndown-people --couples --shotness --devs [--people-dict=/path/to/identities]
labours -m all
Hercules has a plugin system and allows to run custom analyses. See PLUGINS.md.
hercules combine
is the command which joins several analysis results in Protocol Buffers format together.
hercules --burndown --pb https://github.com/go-git/go-git > go-git.pb
hercules --burndown --pb https://github.com/src-d/hercules > hercules.pb
hercules combine go-git.pb hercules.pb | labours -f pb -m burndown-project --resample M
YAML does not support the whole range of Unicode characters and the parser on labours
side
may raise exceptions. Filter the output from hercules
through fix_yaml_unicode.py
to discard
such offending characters.
hercules --burndown --burndown-people https://github.com/... | python3 fix_yaml_unicode.py | labours -m people
These options affects all plots:
labours [--style=white|black] [--backend=] [--size=Y,X]
--style
sets the general style of the plot (see labours --help
).
--background
changes the plot background to be either white or black.
--backend
chooses the Matplotlib backend.
--size
sets the size of the figure in inches. The default is 12,9
.
(required in macOS) you can pin the default Matplotlib backend with
echo "backend: TkAgg" > ~/.matplotlib/matplotlibrc
These options are effective in burndown charts only:
labours [--text-size] [--relative]
--text-size
changes the font size, --relative
activate the stretched burndown layout.
It is possible to output all the information needed to draw the plots in JSON format.
Simply append .json
to the output (-o
) and you are done. The data format is not fully
specified and depends on the Python code which generates it. Each JSON file should
contain "type"
which reflects the plot kind.
--first-parent
as a workaround.hercules
' output
for the Linux kernel in "couples" mode is 1.5 GB and takes more than an hour / 180GB RAM to be
parsed. However, most of the repositories are parsed within a minute. Try using Protocol Buffers
instead (hercules --pb
and labours -f pb
).# Debian, Ubuntu
apt install libyaml-dev
# macOS
brew install yaml-cpp libyaml
# you might need to re-install pyyaml for changes to make effect
pip uninstall pyyaml
pip --no-cache-dir install pyyaml
If the analyzed repository is big and extensively uses branching, the burndown stats collection may fail with an OOM. You should try the following:
--skip-blacklist
to avoid analyzing the unwanted files. It is also possible to constrain the --language
.--hibernation-distance 10 --burndown-hibernation-threshold=1000
. Play with those two numbers to start hibernating right before the OOM.--burndown-hibernation-disk --burndown-hibernation-dir /path
.--first-parent
, you win.src-d/go-git
to go-git/go-git
. Upgrade the codebase to be compatible with the latest Go version.