Switching to Git: the Good, the Bad, and the Ugly
Sascha Just, Kim Herzig, Jacek Czerwonka and Brendan Murphy
Since its introduction 10 years ago, GIT has taken the world of version control systems (VCS) by storm. Its success is partly due to creating opportunities for new usage patterns that empower developers to work more efficiently. However, the resulting change in both user behavior and the way GIT stores changes impacts data mining and data analytics procedures [6], [13]. While some of these unique characteristics can be managed by adjusting mining and analytical techniques, others can lead to severe data loss and the inability to audit code changes, e.g. knowing the full history of changes of code related to security and privacy functionality. Thus, switching to GIT comes with challenges to established development process analytics. This paper is based on our experience in attempting to provide continuous process analysis for Microsoft product teams who switching to GIT as their primary VCS. We illustrate how GIT’s concepts and usage patterns create a need for changing well-established data analytic processes. The goal of this paper is to raise awareness how certain GIT operations may damage or even destroy information about historical code changes necessary for continuous data development process analytics. To that end, we provide a list of common GIT usage patterns with a description of how these operations impact data mining applications. Finally, we provide examples of how one may counteract the effects of such destructive operations in the future. We further provide a new algorithm to detect integration paths that is specific to distributed version control systems like GIT, which allows us to reconstruct the information that is crucial to most development process analytics.