We designed Automata, a tool to holistically grade computer programs. A critical component was to determine whether a program was semantically `close’ to a correct answer. This was traditionally done either by evaluating test-suites or by solving constraint equations using SAT solvers. The former is an error-prone metric while the latter infeasible for real-time systems. We designed a grammar for generating features from the abstract syntax trees of programs. They captured data and control dependencies between variables, which we showed to provide significant additional information. Supervised models learnt on these features had a correlation of ~0.9 with expert raters, rivaling inter-rater correlations.
A particular bottleneck in our KDD ‘14 work was in building predictive models for every programming question we designed. It is a time consuming and human-intensive effort. Extending our work, we conceived a way to transform program features from each question onto a question-independent space. This question-independent space quantified a program’s correctness without the help of labeled data. Such a metric allowed us to compare programs attempting different questions on the same scale. We showed that supervised models learnt on these transformed features were able to predict as well as those learnt on question-specific features
Test cases evaluate whether a computer program is doing what it’s supposed to do. There are various ways to generate them – automatically based on specifications, say by ensuring code coverage or by subject matter experts (SMEs) who think through conditions based on the problem specification. We asked ourselves whether there was something we could learn by looking at how student programs responded to test cases. Could this help us design better test cases or find flaws in them? By looking at such responses from a data-driven perspective, we wanted to know whether we could .a. design better test cases .b. understand whether there existed any clusters in the way responses on test cases were obtained and .c. whether we could discover salient concepts needed to solve a particular programming problem, which would then inform us of the right pedagogical interventions. We built a cool tool which helped us look at statistics on over 2500 test cases spread across over fifty programming problems attempted by nearly 18,000 students and job-seekers in a span of four weeks!