Comparing programs is hard. Comparing executable versions of them is harder still. And when the executables are malicious the challenges are amplified further. Yet comparing executables is important for combating malware since most new versions are a relatively simple variation or modification of a previous version. The VILO project aims to develop new methods for comparing, indexing, and organizing executables.
The main theme of research in VILO is to discover and leverage reuse within malware. Most malware programs one may find are derived from existing code: from previous versions or published exploits. While code reuse is an advantage to black hats, the VILO project seeks to turn their advantage into a weakness that can be exploited in defense. It seeks to make fundamental advances in matching code in executables.
VILO identifies shared code in executables by comparing n-perms of their machine instruction opcodes. An n-perm is just like a traditional n-gram, except ordering isn’t considered for matching purposes. The extracted n-perm features are appropriately weighted (using tf-idf) to ensure that interesting instances of code reuse are identified while matches on universally common code, such as function prologues and epilogues, are disregarded. A metric representing the amount of code shared by two executables is found by computing their cosine similarity. To classify a new unknown malware into its family, we find its 1st-nearest-neighbor and label the new malware the same as the known one. Using these techniques, we have achieved nearly 90% classification accuracy, even when the system is only allowed to be trained on relatively few (<20) executables per family.
An additional application of VILO we previously explored was in the construction of malware phylogenies, i.e., graphs of relationships analogous to the “tree of life”’ in biology. Through VILO, we demonstrated some success in computing accurate evolutionary lineages for families of binary executables.