Supervised learning with less effort
Manually annotating Stack Overflow posts is tedious work. How could we optimize the human effort while doing this task? Active learning focuses human effort where the classifier expects to benefit the most. If humans work on enough such examples, then the classifier can automatically continue on its own: self-training! Our main observation is that reaching agreement on non-trivial annotation tasks is difficult, despite carefully developed annotation guidelines – and active learning will amplify the challenge.
This experience report originates in an excellent MSc. thesis project in the Orion project. In Orion, we aim to support component selection in software evolution. What to build and what to buy? And which component to choose when there are several alternatives available? We refer to this as supporting architectural decision making. One approach to support such decision is to crowdsource them, i.e., “using contributions from Internet users to obtain needed services or ideas”. In software engineering, the obvious choice is to run text mining on Stack Overflow – look for example at the recent survey by Mao et al. (2017) and the systematic mapping study by Farias et al. (2016). Our idea in Orion is to collect crowdsourced architectural decisions in a knowledge repository.
Crowdsourcing Challenges
To use crowdsourcing on Stack Overflow, the first challenge is to identify the relevant discussion threads. In some cases it’s enough to just filter by using the available tags, but often that isn’t enough. In Orion, we attempted to collect discussion threads related to components’ computational performance. In this case, looking at tags such as “performance” and the name of a specific component (e.g., a lib of some sorts) gives you a lot of false positives – the data needs to be filtered further to be useful. One way to do this is to train a classifier on manually annotated Stack Overflow posts.
Large sets of manually annotated data is the key to successful machine learning. However, annotating data is very labor-intensive. A common solution to acquire annotated data is services such as Amazon Mechanical Turk… However, when annotation requires more expert knowledge it isn’t possible.
Active learning as a solution?
Active Learning (AL) is a semi-automated approach to establish a training set. The idea is to reduce the overall human effort by focusing on annotating examples that maximize the gained learning, i.e., the examples for which the classifier is the most uncertain. This is referred to as uncertainty sampling: humans annotators continuously work on batches of data that is expected to be the most difficult for the classifier. AL has also been used in several studies for creating large training sets for speech, but how would it work for text mining Stack Overflow? By the way, thanks also go to Neil Ernst for a recent blog post on AL in software engineering that also discusses our paper.
The goal of AL is illustrated in the bean plots on top of this blog post. In the middle of the figure, there is a horizontal line separating positive and negative examples. To the left, there are loads of data around the line. The idea with AL is to focus the human annotation activity on these very uncertain examples, resulting in the figure to the right: the classifier is more certain about the remaining examples.
Self-training is the second contribution of this paper. If AL leads to the classifier being certain about a multitude of examples not yet annotated my humans, why not pretending they were actually annotated? Let the classifier train on these data as well! Can we reduce the amount of work needed by humans to reach a useful classification accuracy? This is the idea behind self-training. In the case of SVM classification, treat the examples far from the hyperplane as if they were assessed by humans – free training!
A study on active learning and self-training
We designed a study to explore how useful AL and self-training is when classifying Stack Overflow posts related to computational performance of software components. We used a standard SVM classifier. The figure below shows our overall process: two human annotators started by annotating 970 posts and developing an annotation guideline. We then alternated annotating batches of 100 posts in 16 iterations. After 8 iterations we evaluated our annotation guidelines by letting a group of researchers try them out on a sample of our already annotated set. We then continued until iteration 16 and did the two annotators re-annotated a common set of posts. Finally we evaluated some self-training.
It turned out that annotation in non-trivial tasks such as ours is not easy at all. Despite having quite elaborate annotation guidelines, there were a lot of disagreements – much more than expected! The group annotation exercise after iteration 8 showed that the interpretation of the guidelines varied a lot. However, what really surprised us, was that our agreement in the pair annotation after 16 iterations yielded even less agreement! Despite discussing the process along the way, we had not aligned our interpretations… We rather got more certain about how to do it, but unfortunately without a shared understanding. It is obvious that the annotation guidelines, and the annotators’ interpretation of them, must carefully evolve during the process.
Results – Did it work?
Since the two annotators disagreed so much, we decided to split the evaluation of AL. The figure below shows annotator 1 to the left, and the distribution of not-yet-annotated Stack Overflow posts. Remember that the goal of AL is to reduce the number of examples close to the SVM hyperplane (0 in the figure). For annotator 1, the effect is not really evident, but for annotator 2 (to the right) the bean plots close to 0 is thinner. We can thus conclude that we could observe the intended effect of AL. On the other hand, what really matters, did the classification accuracy improve? Our results didn’t show any such improvements – the learning curves remained fairly stable. Probably the increase in training data was not big enough.
Regarding the self-training, our goal was to find an example where it resulted in an improvement. In the paper, we report one such case corresponding to adding 50% of the not-yet-annotated negative examples and 5% of the positive examples (this means self-training on negative examples at least 0.88 distance units away from the hyperplane or 1.76 for the positive examples, see the figure above). This choice of self-training resulted in 4.3% better classification accuracy and 7.9% better F1-score. We thus managed to show that self-training indeed can improve classification accuracy when working with Stack Overflow posts.
Implications for Research
- Annotation guidelines must evolve throughout the active learning process, as well as the annotators’ interpretation of them. Finding agreement might be difficult, and active learning will make it even harder!
- We recommend designing an active learning process with partly overlapping iterations (5-25%) – it is important to detect discrepancies early.
- Self-training can be used in combination with active learning to bootstrap SVM classifiers.
Markus Borg, Iben Lennerstad, Rasmus Ros, and Elizabeth Bjarnason. On Using Active Learning and Self-Training when Mining Performance Discussions on Stack Overflow, In Proc. of the 21th Evaluation and Assessment in Software Engineering Conference (link, preprint)
Abstract
Abundant data is the key to successful machine learning. However, supervised learning requires annotated data that are often hard to obtain. In a classification task with limited resources, Active Learning (AL) promises to guide annotators to examples that bring the most value for a classifier. AL can be successfully combined with self-training, i.e., extending a training set with the unlabeled examples for which a classifier is the most certain. We report our experiences on using AL in a systematic manner to train an SVM classifier for Stack Overflow posts discussing performance of software components. We show that the training examples deemed as the most valuable to the classifier are also the most difficult for humans to annotate. Despite carefully evolved annotation criteria, we report low inter-rater agreement, but we also propose mitigation strategies. Finally, based on one annotator's work, we show that self-training can improve the classification accuracy. We conclude the paper by discussing implication for future text miners aspiring to use AL and self-training.