Catch phishing with web scraping
Phishing is, unfortunately, profitable, difficult to detect and relatively easy to engage. With accelerated digital transformations across the globe, phishing is set to experience continued explosive growth.
According to Phishlabs, the number of phishing attempts in the first quarter of 2021 increased by almost 50%. There’s no reason to believe he’ll stop climbing either.
This means increased levels of damage and digital risk. To counter such an increase, new phishing detection approaches must be tested or current ones improved. One way to improve on existing approaches is to use web scraping.
Phishers would have a hard time completely replicating the original website. Placing all URLs identically, replicating images, baking domain age, etc. would take more effort than most people would be willing to put in.
Also, a perfect spoof would likely have a lower success rate due to the possibility of the target getting lost (by clicking on an unrelated URL). Finally, as with any other scam, tricking everyone is not necessary, so the perfect retort would be a wasted effort in most cases.
However, phishers are not stupid. Or at least those who succeed in it are not. They always do their best to create a believable replica with the least amount of effort required. It may not be effective against the tech-savvy, but even a perfect replica may not be effective against the distrustful. In short, phishing is all about being “just good enough”.
Therefore, due to the nature of the business, there are always one or two glaring holes to be discovered. Two good ways to get a head start are to either look for similarities between frequently hacked websites (e.g. fintech, SaaS, etc.) and suspected phishing websites, or to collect patterns of known attacks and progress from there.
Unfortunately, with the volume of phishing websites popping up daily and intent on targeting less tech-savvy people, solving the problem may not be as straightforward as it first appears. Of course, as is often the case, the answer is automation.
There have been more methods developed over the years. A 2018 review article by ScienceDirect lists URL-based detection, layout recognition, and content-based detection. The former often lag behind phishers, as databases update more slowly than new websites appear. Layout recognition is based on human heuristics and is therefore more prone to failure. Content-based detection is computationally heavy.
We will pay a little more attention to layout recognition and content-based detection, as these are complicated processes that benefit greatly from web scraping. At the time, a group of researchers created a framework to detect phishing websites called CANTINA. It was a content-based approach that checked data such as TF-IDF ratios, domain age, suspicious URLs, improper use of punctuation marks, and more. However, the study was published in 2007 when possibilities for automation were limited.
Web scraping can significantly improve the frame. Instead of manually trying to find outliers, automated apps can crawl websites and download relevant content from them. Important details such as those described above can be extracted from the content, analyzed and evaluated.
Build a net
CANTINA, developed by the researchers, had one drawback – it was only used to prove a hypothesis. For these purposes, a database of phishing and legitimate websites had been compiled. The status of both was known a priori.
Such methods are suitable for proving a hypothesis. They are not so good in practice where we don’t know the status of websites in advance. Practical applications of projects similar to CANTINA would require a significant amount of manual effort. At some point, these apps would no longer be considered “practical”.
Theoretically, however, content-based recognition seems like a strong contender. Phishing websites must reproduce the content almost identically to the original. Any incongruities such as misplaced images, misspellings, missing pieces of text can trigger suspicion. They can never stray too far from the original, which means metrics like TF-IDF should be similar by necessity.
The downside of content-based recognition has been the slow and expensive manual work. However, web scraping shifts most manual effort to full automation. In other words, it allows us to use existing detection methods on a much larger scale.
First, instead of manually collecting URLs or pulling them from an already existing database, scraping can quickly create its own. They can be collected through any content that has hyperlinks or links to these supposed phishing websites in any form.
Second, a scraper can crawl through a collection of URLs faster than any human ever could. The manual overview has advantages, such as the ability to see the structure and content of a website as it is instead of fetching raw HTML code.
Visual representations, however, are of little use if we use mathematical detection methods such as link depth and TF-IDF. They can even serve as a distraction, driving us away from important details due to heuristics.
The analysis also becomes a detection track. The parsers frequently crash if layout or design changes occur on the website. If there are any unusual parsing errors against the same process performed on the parent websites, these may serve as an indication of a phishing attempt.
Ultimately, web scraping doesn’t produce completely new methods, at least as far as I know, but it does enable older ones. It offers a way to scale methods that might otherwise be too expensive to implement.
Cast a net
With proper web scraping infrastructure, millions of websites can be accessed daily. Like a scraper collects source HTML, we have all the textual content stored where we want it. A few scans later, the plain text content can be used to calculate TF-IDF. A project would probably start by collecting all the important metrics from popular phishing targets and move on to detection.
Moreover, there is a lot of interesting information that we can extract from the source. All internal links can be visited and stored in an index to create a representation of the overall link depth.
It is possible to detect phishing attempts by building a website tree through indexing with a web crawler. Most phishing websites will be superficial for the reasons described earlier. On the other hand, phishing attempts copy the websites of well-established companies. These will have great link depths. The shallow depth itself could be an indicator of a phishing attempt.
Nevertheless, the collected data can then be used to compare TF-IDF, keywords, link depth, domain age, etc., to legitimate website metrics. Incompatibility would be a source of suspicion.
There is one caveat that must be decided “on the run” – what margin of difference is cause to investigate? A line in the sand has to be drawn somewhere and, at least initially, it will have to be quite arbitrary.
Additionally, there is an important consideration for IP addresses and locations. Some content on a phishing website may only be visible to IP addresses from a specific geographic location (or not from a specific geographic location). Working around such problems, in regular circumstances, is difficult, but proxies offer an easy solution.
Since a proxy always has an associated location and IP address, a large enough pool will provide global coverage. Whenever a geographic block is encountered, a simple proxy switch is enough to clear the obstacle.
Finally, web scraping, by its nature, makes it possible to discover a lot of data on a specific subject. Most of it is unstructured, something usually fixed by analysis, and unlabeled, something usually fixed by humans. Structured and labeled data can provide excellent ground for machine learning models.
End the phishing
Building an automated phishing detector through web scraping produces a lot of data to evaluate. Once evaluated, the data would generally lose its value. However, as with recycling, this information can be reused with some tweaking.
Machine learning models have the disadvantage of requiring huge amounts of data to start making predictions of acceptable quality. Yet, if phishing detection algorithms started using web scraping, this amount of data would be produced naturally. Of course, labeling might be required, which would require considerable manual effort.
Regardless of this, the information would already be structured to produce acceptable results. Although all machine learning models are black boxes, they are not entirely opaque. We can predict that data structured and labeled in a certain way will produce certain results.
For clarity, machine learning models could be thought of as the application of mathematics to physics. Some mathematical models seem to adapt exceptionally well to natural phenomena such as gravity. Gravitational attraction can be calculated by multiplying the gravitational constant by the mass of two objects and dividing the result by the distance between them squared. However, if we only knew the required data, it would give us no understanding of gravity itself.
The machine learning models are much the same. A certain data structure produces expected results. However, how these models arrive at their predictions will be unclear. At the same time, at all stages, the rest is as expected. Thus, apart from marginal cases, the “black box” character does not harm the results too much.
Additionally, machine learning models appear to be among the most effective methods for detecting phishing. Some automated crawlers with ML implementations could achieve 99% accuracy, according to research from Springer Link.
The future of web scraping
Web scraping seems to be the perfect complement to all current phishing solutions. After all, most cybersecurity involves vast arrays of data to make the right protection decisions. Phishing is no different. At least through the lens of cybersecurity.
There seems to be a holy trinity in cybersecurity waiting to be harnessed to its full potential: analytics, web scraping, and machine learning. There have been a few attempts to combine two of the three together. However, I have yet to see all three exploited to their full potential.