Noshin Tahsin

and 2 more

Atoms of confusion, or simply “atoms,” are pieces of code that lead to misunderstanding while being interpreted. Previous research has shown that the presence of atoms has an effect on code readability. Aside from simple misunderstanding in lab setting, atoms of confusion are common and meaningful in open source C and C++ projects, and are thus removed by bug-fix commits. However, due to syntactical differences between language paradigms, the prevalence of atoms may vary in projects written in other languages (e.g. Java), which is yet to be explored. In this study, the first step is taken towards investigating the prevalence of 12 different atoms in the 13 most popular open-source Java projects. The relationship between the presence of atoms and aspects of code maintainability is also studied. Results show that, atoms are 4.7 time more prevalent in Java projects compared to open source C/C++ projects based on occurrence per line. For a total of 1085223 atoms in our corpus, they occur once every 4.8 lines. Some atoms are very obscure (e.g. the Logic As Control Flow atom which occurs once in 440060 lines). Some atoms are frequently occurring (e.g. the Infix Operator Precedence atom which occurs once in 6.4 lines). Impact of the presence of atoms on code maintainability is also explored. Besides, correlation between atoms are investigated. Results indicate that object oriented metrics contribute less in atom prevalence, whereas fine grained code-metrics have relatively better association.