Mar 6, 2017

Machine Learning in Cyber Security Domain - 7: IDS/IPS with ML



Intrusion Detection and Intrusion Prevention Systems (IDS / IPS) basically analyze data packets and determine whether it is an attack or not. After analyzing part, the system is able to take some precautions according to the result. IDS/IPSs can be considered as two main categories based on operational logic; (1) Signature Based IDS, (2) Anomaly Based IDS.
Signature Based IDS works with attack signature which is created with the information of known vulnerabilities. Signatures contain detailed information about attacks. This type of systems has high accuracy rate for known attacks, but they cannot detect unknown attacks.

Because of this fact, new signatures must be created when new attacks are discovered and this signature must be imported to the system immediately. Whereas these systems are not resistant to 0-Day Attacks,  anomaly Based IDS is able to detect 0-Day attacks, but also has high false alarm rate.
Signature Based IDS's operation logic is based on the basic classification problem. Incoming events are compared with signatures, if a match found then an alert occurs, otherwise it means no malicious event is found. So, Signature Based IDS has low flexibility and it uses low-level machine learning structures. Conversely, Anomaly Based IDS has high flexibility and it uses high-level machine learning structures. So, in this chapter, Anomaly Based IDS is explained heavily, and more detailed information about this structure is given.
Screen Shot 2016-12-07 at 10.40.55.pngIn generally, types of Anomaly are divided into three main categories such as; point anomaly, contextual anomaly, and collective anomaly. In addition, there are four types of attack defined in academic researchers. Each type of attacks has specific behavior. In the figure given in the right-hand side, characteristic behaviors of the attack are given.
For the academic purpose, there are lots of datasets available on the internet for public usage. KDD99 which is firstly created in 1998 and last updated in 2008, is one of the most commonly known datasets in academic literature. This dataset has 7-week network traffic which has connection based data. Supervised and unsupervised approaches can be applied to create these systems.
Until today, there are a lot of academic research developed using both supervised and unsupervised techniques. Researchers also are using the combination of these techniques in recent years and they gain high accuracy rate. These results are discussable because, the dataset which is used in training phase -mostly KDD99- out of date. Therefore, new attacks which have been discovered after creating the dataset can not be imported to this database easily. Researchers can not decide whether these new attacks can be recognized or not.
In supervised approaches, the system is working with labeled events which are occurred in the network. These approaches are similar with Signature Based IDS but not the same, only difference is that attack events which are used in training phase is created by network flow data. As we mentioned before, attack signatures are used in Signature-Based IDS/IPS, but in Anomaly Based IDS/IPS, network flow data is used. Until now, there a lot of supervised techniques used in the literature, but most commonly known algorithms are Support Vector Machine, Bayesian Network, Artificial Neural Network, Decision Tree, and k-Nearest Neighbor. The biggest advantage of these type of approaches is that they recognize well known malicious activities with high accuracy and low false alarm rate. The disadvantage of these types of approaches is that they have a weak recognize capability of 0-Day attacks.
In unsupervised approaches, the dataset doesn’t consist of any class information. Such approaches like this based on two main assumptions. One of these is that the user profile can not change in high quantity in a short time, and the other one is that malicious activity causes an abnormal change in network flow. Operational logic is based on clustering whole network activity data and as a result, a certain number of classes are created by the algorithm. Some of these classes have a huge event count, whereas others have a very small event count. According to the assumptions which are explained above, the classes with huge event represents normal user activity such as web browsing or e-mail traffic, unlikely the other classes represent malicious activities which have produced by attackers. The advantage of these type of approaches is having strong ability to detect 0-Day Attacks. The disadvantage of these type of approaches is that attackers can produce network traffic intelligently and they can bypass IDS/IPS systems, and another disadvantage is that high false alarm rate occurs. It means normal user activity can be recognized as malicious activity. This problem is very important and there are a lot of academic researchers developed to overcome this undesirable results.
Both techniques have advantages and disadvantages, to combine advantages in an efficient way, and eliminate disadvantages completely, some hybrid approaches are developed. A part of detection mechanism is working with the supervised algorithm, and another part is working with the unsupervised algorithm. In recent years most of the researches focus on hybrid detection approaches.
Snort is a free and open source network intrusion prevention system (NIPS) and network intrusion detection system (NIDS) and used all around the world. Snort's open source network-based intrusion detection system (NIDS) has the ability to perform real-time traffic analysis and packet logging on Internet Protocol (IP) networks. Snort performs protocol analysis, content searching, and matching. These basic services have many purposes including application-aware triggered quality of service, to de-prioritize bulk traffic when latency-sensitive applications are in use. Snort can be configured in three main modes: sniffer, packet logger, and network intrusion detection. In sniffer mode, the program will read network packets and display them on the console. In packet logger mode, the program will record packets to the disk. In intrusion detection mode, the program will monitor network traffic and analyze it against a rule set defined by the user. The program will then perform a specific action based on what has been identified (Source wikipedia).
A signature is defined as any detection method that relies on distinctive marks or characteristics being present in exploits. These signatures are specifically designed to detect known exploits as they contain distinctive marks; such as ego strings, fixed offsets, debugging information, or any other unique marking that may or may not be actually related to exploiting a vulnerability. In these type of detection systems, events are classified after the first detection, since actual public exploits are necessary for these type of detection systems to work. Anti-Virus companies utilize this type of technology for protecting their customers from virus outbreaks. As we have seen over the years, this type of protection has only limited protection capabilities since a signature can be written after a system is infected by a virus. (Source Snort)
Rules-based approaches have a different methodology for performing detection, they have the advantage of 0-day detection. So it makes rules-based approaches more enhanced. Unlike signatures, rules are based on detecting the actual vulnerability, not an exploit or a unique piece of data. Developing a rule requires a strong understanding of how the vulnerability actually works. (Source Snort).
Traditional signature-based IDS/IPSs are using signatures of attacks in order to detect these attacks. But, detecting only well-known attacks can not provide systems safe completely. An intelligent IDS/IPS must detect 0-day attacks.
Attacks are changed with little variations in time, so the attacks -which we call new- are actually not new, these attacks has little variation from older attacks, different but not much. Rules provide a flexible definition of attacks, so we can detect 0-day attacks which have little variation from older attacks.

There is one more thing about this topic. Attackers have developed new techniques to bypass IDS/IPS systems day by day. There are some tools to create some malicious network activity events, but these events seem to be produced by real users. Of course, in response to this,  IDS/IPS systems are being updated to recognize this type of attacks.