antlr/codebuff

Name: codebuff

Owner: Antlr Project

Description: Language-agnostic pretty-printing through machine learning (uh, like, is this possible? YES, apparently).

Created: 2016-01-06 19:49:54.0

Updated: 2018-05-14 13:35:20.0

Pushed: 2017-06-15 23:51:08.0

Homepage:

Size: 6222

Language: Java

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

CodeBuff smart formatter

By Terence Parr (primary developer), Fangzhou (Morgan) Zhang (help with initial development), Jurgen Vinju (co-author of academic paper, help with empirical results and algorithm discussions).

kaby76 has done a C# port.

Abstract

Code formatting is not particularly exciting but many researchers would consider it either unsolved or not well-solved. The two well-established solutions are:

  1. Build a custom program that formats code for specific a language with ad hoc techniques, typically subject to parameters such as “always put a space between operators“.
  2. Define a set of formal rules that map input patterns to layout instructions such as “line these expressions up vertically“.

Either techniques are painful and finicky.

This repository is a step towards what we hope will be a universal code formatter that uses machine learning to look for patterns in a corpus and to format code using those patterns.

It requires Java 8. See pom.xml for dependencies (e.g., ANTLR 4.x, …).

Whoa! It appears to work. Academic paper, Towards a Universal Code Formatter through Machine Learning accepted to SLE2016. Sample output is in the paper or next section. Video from Terence's presentation.

Sample output

All input is completed squeezed of whitespace/newlines so only the output really matters when examining CodeBuff output. You can check out the output dir for leave-one-out formatting of the various corpora. But, here are some sample formatting results.

SQL
CT *
 DMartLogging
E DATEPART(day, ErrorDateTime) = DATEPART(day, GetDate())
  AND DATEPART(month, ErrorDateTime) = DATEPART(month, GetDate())
  AND DATEPART(year, ErrorDateTime) = DATEPART(year, GetDate())
R BY ErrorDateTime
DESC
ql
CT
CASE WHEN SSISInstanceID IS NULL
    THEN 'Total'
ELSE SSISInstanceID END SSISInstanceID
, SUM(OldStatus4) AS OldStatus4
, SUM(Status0) AS Status0
, SUM(Status1) AS Status1
, SUM(Status2) AS Status2
, SUM(Status3) AS Status3
, SUM(Status4) AS Status4
, SUM(OldStatus4 + Status0 + Status1 + Status2 + Status3 + Status4) AS InstanceTotal

(
    SELECT
        CONVERT(VARCHAR, SSISInstanceID)             AS SSISInstanceID
        , COUNT(CASE WHEN Status = 4 AND
                          CONVERT(DATE, LoadReportDBEndDate) <
                          CONVERT(DATE, GETDATE())
                    THEN Status
                ELSE NULL END)             AS OldStatus4
        , COUNT(CASE WHEN Status = 0
                    THEN Status
                ELSE NULL END)             AS Status0
        , COUNT(CASE WHEN Status = 1
                    THEN Status
                ELSE NULL END)             AS Status1
        , COUNT(CASE WHEN Status = 2
                    THEN Status
                ELSE NULL END)             AS Status2
        , COUNT(CASE WHEN Status = 3
                    THEN Status
                ELSE NULL END)             AS Status3
COUNT ( CASE WHEN Status = 4 THEN Status ELSE NULL END ) AS Status4
        , COUNT(CASE WHEN Status = 4 AND
                          DATEPART(DAY, LoadReportDBEndDate) = DATEPART(DAY, GETDATE())
                    THEN Status
                ELSE NULL END)             AS Status4
    FROM dbo.ClientConnection
    GROUP BY SSISInstanceID
) AS StatusMatrix
P BY SSISInstanceID
Java
ic class Interpreter {
...
public static final Set<String> predefinedAnonSubtemplateAttributes = new HashSet<String>() {
                                                                          {
                                                                              add("i");
                                                                              add("i0");
                                                                          }
                                                                      };

public int exec(STWriter out, InstanceScope scope) {
    final ST self = scope.st;
    if ( trace ) System.out.println("exec("+self.getName()+")");
    try {
        setDefaultArguments(out, scope);
        return _exec(out, scope);
    }
    catch (Exception e) {
        StringWriter sw = new StringWriter();
        PrintWriter pw = new PrintWriter(sw);
        e.printStackTrace(pw);
        pw.flush();
        errMgr.runTimeError(this,
                            scope,
                            ErrorType.INTERNAL_ERROR,
                            "internal error: "+sw.toString());
        return 0;
    }
}

protected int _exec(STWriter out, InstanceScope scope) {
    final ST self = scope.st;
    int start = out.index(); // track char we're about to write
    int prevOpcode = 0;
    int n = 0; // how many char we write out
    int nargs;
    int nameIndex;
    int addr;
    String name;
    Object o, left, right;
    ST st;
    Object[] options;
    byte[] code = self.impl.instrs;        // which code block are we executing
    int ip = 0;
    while ( ip<self.impl.codeSize ) {
        if ( trace|| debug ) trace(scope, ip);
        short opcode = code[ip];
        //count[opcode]++;
        scope.ip = ip;
        ip++; //jump to next instruction or first byte of operand
        switch ( opcode ) {
            case Bytecode.INSTR_LOAD_STR:
                // just testing...
                load_str(self, ip);
                ip += Bytecode.OPND_SIZE_IN_BYTES;
                break;
            case Bytecode.INSTR_LOAD_ATTR:
                nameIndex = getShort(code, ip);
                ip += Bytecode.OPND_SIZE_IN_BYTES;
                name = self.impl.strings[nameIndex];
                try {
                    o = getAttribute(scope, name);
                    if ( o== ST.EMPTY_ATTR ) o = null;
                    }
                catch (STNoSuchAttributeException nsae) {
                    errMgr.runTimeError(this, scope, ErrorType.NO_SUCH_ATTRIBUTE, name);
                    o = null;
                }
                operands[++sp] = o;
                break;

ANTLR
renceType : classOrInterfaceType | typeVariable | arrayType ;

sOrInterfaceType
:   (   classType_lfno_classOrInterfaceType
    |   interfaceType_lfno_classOrInterfaceType
    )
    (   classType_lf_classOrInterfaceType
    |   interfaceType_lf_classOrInterfaceType
    )*
;

sModifier
:   annotation
|   'public'
|   'protected'
|   'private'
|   'abstract'
|   'static'
|   'final'
|   'strictfp'
;

Specifier
:   (   'void'
    |   'char'
    |   'short'
    |   'int'
    |   'long'
    |   'float'
    |   'double'
    |   'signed'
    |   'unsigned'
    |   '_Bool'
    |   '_Complex'
    |   '__m128'
    |   '__m128d'
    |   '__m128i'
    )
|   '__extension__' '(' ('__m128' | '__m128d' | '__m128i') ')'
|   atomicTypeSpecifier
|   structOrUnionSpecifier
|   enumSpecifier
|   typedefName
|   '__typeof__' '(' constantExpression ')' // GCC extension
;
Build complete jar

To make a complete jar with all of the dependencies, do this from the repo main directory:

n clean compile install

This will leave you with artifact target/codebuff-1.4.19.jar or whatever the version number is and put the jar into the usual maven local cache.

Formatting files

To use the formatter, you need to use class org.antlr.codebuff.Tool. Commandline usage:

Output goes to standard out unless you use -o.

va -jar target/codebuff-1.4.19.jar  \
   -g org.antlr.codebuff.ANTLRv4 \
   -rule grammarSpec \
   -corpus corpus/antlr4/training \
   -files g4 \
   -indent 4 \
   -comment LINE_COMMENT \
   T.g4
ash
va -jar target/codebuff-1.4.19.jar \
   -g org.antlr.codebuff.Java \
   -rule compilationUnit \
   -corpus corpus/java/training/stringtemplate4 \
   -files java \
   -comment LINE_COMMENT \
   T.java

These examples work for the grammars specified because they are already inside the complete jar. For parsers compiled outside of the jar, you might need to do something like:

 java -cp target/codebuff-1.4.19.jar:$CLASSPATH \
   org.antlr.codebuff.Tool  \
   -g org.antlr.codebuff.ANTLRv4 \
   -rule grammarSpec -corpus corpus/antlr4/training \
   -files g4 -indent 4 -comment LINE_COMMENT T.g4
Grammar requirements

All whitespace should go to the parser on a hidden channel. For example, here is a rule that does that:

:   [ \t\r\n\f]+ -> channel(HIDDEN) ;

Comments should also:

K_COMMENT
:   '/*' .*? ('*/' | EOF)  -> channel(HIDDEN)
;

_COMMENT
:   '//' ~[\r\n]*  -> channel(HIDDEN)
;

You can have line comments match newlines if you want.

Speed tests

The paper cites some speed tests for training and formatting time for

First, here is my machine configuration:

Memory speed seems to make a big difference given how much we have to trawl through memory—The tests shown below were done with 1867 MHz DDR3 RAM. We set an initial 4G RAM, 1M stack size. First build everything:

n clean compile install

Then you can run the speed tests as shown in following subsections.

ANTLR corpus
va -Xmx4G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.Speed -antlr corpus/antlr4/training/Java8.g4
ed 12 files in 172ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 353ms formatting = 340ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 188ms formatting = 161ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 145ms formatting = 153ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 130ms formatting = 129ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 123ms formatting = 113ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 114ms formatting = 116ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 93ms formatting = 90ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 80ms formatting = 90ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 73ms formatting = 88ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 72ms formatting = 71ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 71ms formatting = 69ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 71ms formatting = 73ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 76ms formatting = 63ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 70ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 69ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 73ms formatting = 70ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 68ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 71ms formatting = 66ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 70ms
r training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 73ms formatting = 72ms
an of [5:19] training 72ms
an of [5:19] formatting 70ms
Guava corpus, Java grammar
va -Xms4G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.Speed -java_guava corpus/java/training/guava/cache/LocalCache.java
ed 511 files in 1949ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1984ms formatting = 2669ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1747ms formatting = 3166ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1784ms formatting = 2811ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1507ms formatting = 1742ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1499ms formatting = 2832ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1582ms formatting = 2663ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1499ms formatting = 2807ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1561ms formatting = 2815ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1521ms formatting = 2136ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1545ms formatting = 2811ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1501ms formatting = 2800ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1506ms formatting = 2581ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1494ms formatting = 2838ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1494ms formatting = 2789ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1497ms formatting = 2621ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1501ms formatting = 2714ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1506ms formatting = 2816ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1512ms formatting = 2733ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1515ms formatting = 2587ms
_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1508ms formatting = 2430ms
an of [5:19] training 1506ms
an of [5:19] formatting 2733ms
Guava corpus, Java8 grammar

Load time here is very slow (2.5min) because the Java8 grammar is meant to reflect the language spec. It has not been optimized for performance. Once the corpus is loaded, training and formatting times are about the same as for Java grammar.

va -Xms4G -Xss1M -cp target/codebuff-1.4.19.jar \
   org.antlr.codebuff.validation.Speed \
   -java8_guava corpus/java/training/guava/cache/LocalCache.java
ed 511 files in 159947ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 2238ms formatting = 23312ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1913ms formatting = 2368ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1855ms formatting = 2277ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1856ms formatting = 2267ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1868ms formatting = 2348ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1890ms formatting = 2263ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1866ms formatting = 2328ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1855ms formatting = 2247ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1856ms formatting = 2243ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1871ms formatting = 2204ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1863ms formatting = 2244ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1850ms formatting = 2212ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1861ms formatting = 2215ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1877ms formatting = 2257ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1843ms formatting = 2249ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1842ms formatting = 2205ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1869ms formatting = 2343ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1864ms formatting = 2225ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1851ms formatting = 2260ms
8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1871ms formatting = 2200ms
an of [5:19] training 1863ms
an of [5:19] formatting 2244ms
Generating graphs from paper

In the Towards a Universal Code Formatter Through Machine Learning paper, we have three graphs to support our conclusions. This sections shows how to reproduce them. (Note that these jobs take many minutes to run; maybe up to 30 minutes for one of them on a fast box.)

The Java code generates python code that uses matplotlib. The result of running the python is a PDF of the graph (that also pops up in a window).

Box plot with median error rates

To generate:

do this:

n clean compile install
va -Xms8G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.LeaveOneOutValidator

e python code to python/src/leave_one_out.py
 python/src
thon leave_one_out.py &
Plot showing effect of corpus size on error rate

To generate:

do this:

n clean compile install
va -Xms8G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.SubsetValidator

e python code to python/src/subset_validator.py
 python/src
thon subset_validator.py &
Plot showing effect of varying model parameter k

To generate:

do this:

n clean compile install
va -Xms8G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.TestK

e python code to python/src/vary_k.py
 python/src
thon vary_k.py &

This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.