May 3, 2017

Phishing Domain Detection with Machine Learning

What is Phishing?

Phishing is a form of fraud in which the attacker tries to learn sensitive information such as login credentials or account information by sending as a reputable entity or person in email or other communication channels.
Typically a victim receives a message that appears to have been sent by a known contact or organization. The message contains malicious software targeting the user's computer or has links to direct victims to malicious websites in order to trick them into divulging personal and financial information, such as passwords, account IDs or credit card details.
Phishing is popular between attackers, since it is easier to trick someone into clicking a malicious link which seems legitimate than trying to break through a computer’s defense systems. The malicious links within the body of the message are designed to make it appear that they go to the spoofed organization using that organization’s logos and other legitimate contents.

In this article, phishing domain (or Fraudulent Domain) characteristics, its distinguishing features from legitimate domains, why it is so important to detect this domains and how they can be detected using machine learning and natural language processing techniques are explained.


Many users unwittingly click phishing domains every day and every hour. The attackers are targeting both the users and the companies. According to the 3rd Microsoft Computing Safer Index Report, released in February 2014, the annual worldwide impact of phishing could be very high as $5 billion.
What is the reason of this cost?
The main reason is the lack of awareness of users. But security defenders must take precautions  for users to not confront these harmful sites. Preventing huge amount of costs are starting with making people conscious and building strong security mechanisms which are able to detect and prevent phishing domains.

Characteristics of Phishing Domains


Lets check the URL structure for the clear understanding of how attackers think when they create a phishing domain.
Uniform Resource Locator (URL) is created to address web pages. The figure below shows relevant parts in the structure of a typical URL.
It begins with a protocol used to access the page. The fully qualified domain name identifies the server who hosts the web page. It consists of a registered domain name (second-level domain) and suffix which we refer to as top-level domain. The domain name portion is constrained since it has to be registered with a domain name Registrar. A Host name consists of a subdomain name and a domain name. An phisher has full control over the subdomain portions and can set any value to it. The URL may also have a path and file components which, too, can be changed by the phisher at will. Subdomain name and path are fully controllable by the phisher. We use the term FreeURL to refer to those parts of the URL in continuation of the article.
  

The attacker can register any domain name that has not been registered before. This part of URL can be set only once. The phisher can change FreeURL at any time to create a new URL. Unique part of the web site is domain, that’s why the security defenders struggle to detect  phishing domains. When a domain detected as a fraudulent, it is easy to prevent this domain before an user access to it.
Some threat intelligence companies detect and publish fraudulent web pages or IPs as blacklists, thus preventing these harmful assets by others is getting easier. (cymon, firehol)

The attacker must intelligently choose the domain names because the aim should be convincing the users,and then setting the FreeURL to make detection difficult. Lets analyze an example given below.



Although the real domain name is active-userid.com, the attacker tried to make the domain look like paypal.com by adding FreeURL. When users see paypal.com at the beginning of the URL, they can trust the site and connect it, then can share their sensitive information to the this fraudulent site. This is a frequently used method by attackers.
Other methods that are often used by attackers are Cybersquatting and Typosquatting.
Cybersquatting (also known as domain squatting), is registering, trafficking in, or using a domain name with bad faith intent to profit from the goodwill of a trademark belonging to someone else. The cybersquatter may offer selling the domain to a person or company who owns a trademark contained within the name at an inflated price or may use it for fraudulent purposes such as phishing. For example, the name of your company is “abcompany” and you register as abcompany.com. Then phishers can register abcompany.net, abcompany.org, abcompany.biz and they can use it for fraudulent purpose.
Typosquatting, also called URL hijacking, is a form of cybersquatting which relies on mistakes such as typographical errors made by Internet users when inputting a website address into a web browser or based on typographical errors that are hard to notice while quick reading. URLs which are created with Typosquatting looks like a trusted domain. A user may accidentally enter an incorrect website address or click a link which looks like a trusted domain, and in this way, they may visit an alternative website owned by a phisher.
A famous example of Typosquatting is goggle.com, an extremely dangerous website. Another similar thing is yutube.com, which is similar to goggle.com except it targets Youtube users. Similarly, www.airfrance.com has been typosquatted as www.arifrance.com, diverting users to a website peddling discount travel. Some other examples; paywpal.com, microroft.com, applle.com, appie.com.

Features Used for Phishing Domain Detection


There are a lot of algorithms and a wide variety of data types for phishing detection in the academic literature and commercial products. A phishing URL and the corresponding page have several features which can be differentiated from a malicious URL. For example; an attacker can register long and confusing domain to hide the actual domain name (Cybersquatting, Typosquatted). In some cases attackers can use direct IP addresses instead of using the domain name. This type of event is out of our scope, but it can be used for the same purpose. Attackers can also use short domain names which are irrelevant to legitimate brand names and don’t have any FreeUrl addition. But these type of web sites are also out of our scope, because they are more relevant to fraudulent domains instead of phishing domains.
Beside URL-Based Features, different kinds of features which are used in machine learning algorithms in the detection process of academic studies are used. Features collected from academic studies for the phishing domain detection with machine learning techniques are grouped as given below.
  1. URL-Based Features
  2. Domain-Based Features
  3. Page-Based Features
  4. Content-Based Features

URL-Based Features


URL is the first thing to analyze a website to decide whether it is a phishing or not. As we mentioned before, URLs of phishing domains have some distinctive points. Features which are related to these points are obtained when the URL is processed. Some of URL-Based Features are given below.
  • Digit count in the URL
  • Total length of URL
  • Checking whether the URL is Typosquatted or not. (google.com → goggle.com)
  • Checking whether it includes a legitimate brand name or not (apple-icloud-login.com)
  • Number of subdomains in URL
  • Is TLD  one of the commonly used one?

Domain-Based Features


The purpose of Phishing Domain Detection is detecting phishing domain names. Therefore, passive queries related to the domain name, which we want to classify as phishing or not, provide useful information to us. Some of useful Domain-Based Features are given below.
  • Is domain name or it’s IP adress in blacklists of well-known reputation services?
  • How many days passed since the domain was registered?
  • Is the registrant name hidden?

Page-Based Features


Page-Based Features are using information about pages which are calculated reputation ranking services. Some of these features give information about how much reliable a web site is. Some of Page-Based Features are given below.
  • Global Pagerank
  • Country Pagerank
  • Position at the Alexa Top 1 Million Site
Some Page-Based Features give us information about user activity on target site. Some of these features are given below. Obtaining these types of features is not easy. There are some paid services for obtaining these types of features.
  • Estimated Number of Visits for the domain on a daily, weekly, or monthly basis
  • Average Pageviews per visit
  • Average Visit Duration
  • Web traffic share per country.
  • Count of reference from Social Networks to the given domain,
  • Category of the domain
  • Similar websites etc.

Content-Based Features


Obtaining these types of features requires active scan to target domain. Page contents are processed for us to detect whether target domain is used for phishing or not. Some processed information about pages are given below.
  • Page Titles
  • Meta Tags
  • Hidden Text
  • Text in the Body
  • Images etc.
By analysing  these information, we can gather information such as;
  • Is it required to login to website
  • Website category
  • Information about audience profile etc.


All of features explained above are useful for phishing domain detection. In some cases, it may not be useful to use some of these, so there are some limitations for using these features. For example, it may not be logical to use some of the features such as Content-Based Features for the developing fast detection mechanism which is able to analyze the number of domains between 100.000 and 200.000. Another example, if we want to analyze new registered domains Page-Based Features is not very useful. Therefore, the features that will be used by the detection mechanism depends on the purpose of the detection mechanism. Which features to use in the detection mechanism should be selected carefully.

Detection Process

Detecting Phishing Domains is a classification problem, so it means we need labeled data which has samples as phish domains and legitimate domains in the training phase. (if the terms about ML do not look familiar to you, we recommend you to read Technical Review section of this article.) The dataset which will be used in the training phase is a very important point to build successful detection mechanism. We have to use samples whose classes are precisely known. So it means, the samples which are labeled as phishing must be absolutely detected as phishing. Likewise the samples which are labeled as legitimate must be absolutely detected as legitimate. Otherwise, the system will not work correctly if we use samples that we are not sure about. For this purpose, some public datasets are created for phishing. Some of the well-known ones are PhishTank and TechHelpList. These data sources are used commonly in academic studies.

Collecting legitimate domains is another problem. For this purpose, site reputation services are commonly used. These services analyze and rank available websites. This ranking may be global or may be country-based. Ranking mechanism depends on a wide variety of features. The websites which have high rank scores are legitimate sites which are used very frequently. One of the well-known reputation ranking service is Alexa. Researchers are using top lists of Alexa for legitimate sites.
When we have raw data for phishing and legitimate sites, the next step should be processing these data and extract meaningful information from it to detect fraudulent domains. The dataset to be used for machine learning must actually consist these features. So, we must process the raw data which is collected from Alexa, Phishtank or other data resources, and create a new dataset to train our system with machine learning algorithms. The feature values should be selected according to our needs and purposes and should be calculated for every one of them.
There so many machine learning algorithms and each algorithm has its own working mechanism. In this article, we have explained Decision Tree Algorithm, because I think, this algorithm is a simple and powerful one.
Initially, as we mentioned above, phishing domain is one of the classification problem. So, this means we need labeled instances to build detection mechanism. In this problem we have two classes: (1) phishing and (2) legitimate.
When we calculate the features that we’ve selected our needs and purposes, our dataset looks like in figure below. In our examples, we selected 12 features, and we calculated them. Thus we generated a dataset which will be used in training phase of machine learning algorithm.
Decision Tree can be considered as an improved nested-if-else structure. Each features will be checked one by one. To ensure that Decision Tree Algorithm build a tree model. An example tree model is given below.
Generating a tree is the main structure of detection mechanism. Yellow and elliptical shaped ones represent features and these are called node. Green and angular ones represent classes and these are called leaf. The “length” is checked when an example arrives and then the other features are checked according to the result. When the journey of the samples is completed, the class that a sample belongs to will become clear.
Now, the most important question about Decision Trees is not answered yet. The question is that which feature will be located as the root? and which ones must come after the root? Choosing features intelligently effects efficiency and success rate of algorithms directly.
So, how does decision tree algorithm select features?
Decision Tree uses a information gain measure which indicates how well a given feature separates the training examples according to their target classification. The name of the method is Information Gain. Mathematical equation of information gain method is given below.
High Gain score means that the feature has a high distinguishing ability. Because of this, the feature which has maximum gain score is selected as the root. Entropy is a statistical measure from information theory that characterizes (im-)purity of an arbitrary collection S of examples. Mathematical equation of Entropy is given below.
Original Entropy is a constant value, Relative Entropy is changeable. Low Relative Entropy Score means high purity, likewise high Relative Entropy Score means low purity. As we move down the tree, we want to increase the purity, because high purity on the leaf implies high success rate.
In the training phase, dataset is divided into two part by comparing the feature values. In our example we have 14 samples. “+” sign represent phishing class, and “-” sign represent legitimate class. We divided these samples into two parts according to the length feature. Seven of them settle right, the other seven of them settle left. As shown in the figure below, right part of tree has high purity, so it means low Entropy Score (E), likewise left part of tree has low purity and high Entropy Score (E).  All calculation has done according to the equations given above. Information Gain Score about the length feature is 0,151.
The Decision Tree Algorithm calculates this information for every features and select features with maximum Gain scores. To growth the tree, leafs are changed as a node which represents a feature. When the tree grow downside, all leafs will have high purity. When the tree is big enough, the training process is completed.
The Tree created by selecting the most distinguishing features represents model structure for our detection mechanism. Creating mechanism which has high success rate depends on training dataset. For the generalization of system success, the training set must be consisted of a wide variety of samples taken from a wide variety of data sources. Otherwise, our system may working with high success rate on our dataset, but it can not work successfully on real world data.

How Does Normshield Do It

Normshield analyze daily registered domains using Natural Language Processing(NLP) and other machine learning techniques. Besides the details described above, we use much more technical features and process them using machine learning algorithms.