Malware evolves much as any other type of software evolves. New “features” are added, fixes are made, etc. Perhaps a distinguishing quality of malware evolution is that many of the “release” versions are only different by a small amount, typically to throw off the scanners and filters that look for characteristics of the previous versions. A phylogenic graph of the possible derivations and relationships between samples can be helpful in understanding the evolution. The figure below illustrates part of a phylogeny (from this paper).
|Part of a malware phylogeny (“family tree”)|
In the malware phylogeny project we are investigating methods for creating useful phylogenic graphs. A key problem is accounting for the various transformations made that obfuscate the provenance of the code. These obfuscations need to be accounted for in the comparison process. For example, new samples may be released in which the order of the code is permuted — through function motion, code block reordering, statement reordering, and so on. One approach we have explored (used to generate the figure above) is to use a feature-matching approach that allows for such permutations of ordering.
We are also investigating the problem of evaluating malware phylogeny generation systems. Without solid evaluation method it is impossible to know how to improve the state of the art. Matt Hayes’s research project involves constructing two different models for generating artificial evolution histories, which can be then used for systematic testing and comparison of phylogeny models via objective tree distance metrics. We are also trying out phylogeny model generators on tough test cases (see our VB presentation).