In order to show how machine learning fits into the big picture of malware analysis, we obviously need to first know what this picture is. While there are many different ways of looking at it, I am going to approach it by first defining who the various classes of “analysts” are (the reason for the quotes will become apparent shortly), the tasks they perform, and the knowledge they need to generate.
There are three general classes or categories of malware “analysts”: the protector, the signature generator, and the detective.
This class of analyst is the front line in defense against malware: the systems and people who detect and respond to intrusions. This can be an AV product on a users workstation, an Intrusion detection system, a sysadmin, or a security officer. The protector has limited need for knowledge generation and only a small number of tasks to perform. The tasks performed by the protector are very straightforward.
- Detect intrusions as soon as possible (preferably before they are successful).
- Repair damage from successful intrusions (including, but not limited to removing malware).
The only knowledge needed by the protector is what an intrusion looks like, how to remove the detected intrusion, and the method to repair the damage caused by the intrusion. This is not knowledge that the protector must generate themselves, however. This knowledge is generated by the signature generator to be utilized by the protector.
The Signature Generator
The signature generator, then, is the one responsible for generating the knowledge needed by the protector. This knowledge usually takes the form of intrusion or malware signatures, rules about what “bad” looks like, and repair scripts for recovering from detected intrusions. Based upon the typical analysis pipeline of the signature generator, below are the tasks that the signature generator does. While these tasks are listed roughly in the order that they must be accomplished, not all of these steps will be performed every time.
- Determine if the program to be analyzed has been analyzed before.
- Assign analysis priority. Higher priority programs are analyzed first.
- Compare the analyzed program to known programs. Knowing what the analyzed program is similar to can greatly aid in determining what it is.
- Determine what the program is. This can be as vague as malicious/benign, as specific as “part of the zeus botnet,” or anything in between.
- Create a signature for detecting the program.
- Create repair script.
In order to accomplishes the above tasks, the analyst will need to generate the following knowledge:
- A profile of the analyzed program, what it “looks like”. This usually includes static feature about the file itself (hashes of programs, list of system calls present, strings present, etc) and behavioral features (files opened, registry keys changed, domains contacted).
- A list of actions taken by the malicious program that must be repaired, along with the method of accomplishing the repair.
- The method or signature that is capable of detecting the malicious program without also also detecting benign programs in the process (false positives).
Finally we come to the detective. The detective focuses on only a few malware of special interest at a time and dives deep into the malware, intent on gaining a full understanding of the malware. Since the detective is focused on gaining knowledge about the analyzed programs, we focus on what this desired knowledge might be. Any tasks performed by the detective are done to gain the desired knowledge.
Attempting to create a list of all possible types of knowledge the detective may wish to gain about malware would be the height of folly as this list would be unbounded. However, I believe I have created a list of the most commonly desired types of knowledge.
- What the protection mechanisms are.
- The intent of the malware.
- What new techniques are in this program.
- Any code sharing between known malware and this malware.
- A profile of the author of the malware.
We now have our big picture of malware analysis to fit machine learning into. It is useful to note that the categories I have created are not meant to be representative of entirely different analysts, but rather different hats a single analyst can wear. It is very common, for example, for the signature generator to act as the detective at times. We can even imagine a human analyst who would fall under all three categories. Take the example of an information security officer at a large corporation. He is the protector as it is his job to detect and respond to intrusions. When a new intrusion is found, he then acts as the detective to learn how big of a threat it is, and finally he acts as the signature generator by writing firewall and IDS rules to prevent a repeat intrusion.