Saed Alrabaee (Concordia University)
Noman Saleem
Stere Preda
Lingyo Wang
Mourad Debbabi (Concordia University)

Abstract

A critical aspect of malware forensics is authorship analysis. The successful outcome of such analysis is usually determined by the reverse engineer’s skills and by the volume and complexity of the code under analysis. To assist reverse engineers in such a tedious and error-prone task, it is desirable to develop reliable and automated tools for supporting the practice of malware authorship attribution. In a recent work, machine learning was used to rank and select syntax-based features such as n-grams and flow graphs. The experimental results showed that the top ranked features were unique for each author, which was regarded as an evidence that those features capture the author’s programming styles. In this paper, however, we show that the uniqueness of features does not necessarily correspond to authorship. Specifically, our analysis demonstrates that many “unique” features selected using this method are clearly unrelated to the authors’ programming styles, for example, unique IDs or random but unique function names generated by the compiler; furthermore, the overall accuracy is generally unsatisfactory. Motivated by this discovery, we propose a layered Onion Approach for Binary Authorship Attribution called OBA2. The novelty of our approach lies in the three complementary layers: preprocessing, syntax-based attribution, and semantic-based attribution. Experiments show that our method produces results that not only are more accurate but have a meaningful connection to the authors’ styles.