Crawling – PortSwigger
The explore phase of an analysis involves navigating through the application, following links, submitting forms, and logging in as needed, to catalog the content of the application and the navigation paths within it. . This seemingly simple task presents a variety of challenges that Burp’s crawler is able to overcome, to create an accurate map of the application.
By default, Burp’s crawler navigates around a target application using Burp’s browser, clicking links and submitting inputs when possible. It builds a map of the app’s content and functionality in the form of a directed graph, representing the different locations in the app and the links between those locations:
The crawler makes no assumptions about the URL structure used by the application. Locations are identified (and later re-identified) based on their content, not the URL that was used to reach them. This allows the crawler to reliably handle modern applications that put ephemeral data, such as CSRF tokens or cache busters, into URLs. Even though the entire URL of each link changes on every occasion, the crawler still constructs an accurate map:
The approach also allows the crawler to manage applications that use the same URL to reach different locations depending on the state of the application or the user’s interaction with it:
As the crawler navigates and expands the coverage of the target application, it follows the edges of the graph that have not been completed. These represent links (or other navigation transitions) that have been observed in the app but not yet visited. But the crawler never “jumps” to a pending link and visits it out of context. Instead, it either navigates directly from its current location or it returns to the starting location and navigates from there. This reproduces as closely as possible the actions of a normal user browsing the site:
Crawling in a way that makes no assumptions about the structure of the URL is very good at dealing with modern web applications, but can potentially lead to problems by seeing “too much” content. Modern websites often contain a mass of superfluous navigation paths (via footers, hamburger menus, etc.), which means that everything is directly linked to everything else. Burp’s crawler uses various techniques to solve this problem: it fingerprints links to previously visited places to avoid visiting them redundantly; it explores in a broad order that prioritizes the discovery of new content; and it has configurable thresholds that limit the scope of exploration. These measures also make it possible to correctly handle “infinite” applications, such as calendars.
When Burp’s crawler navigates around a target application using Burp’s Browser, it is able to automatically handle virtually any session management mechanism that modern browsers can use. There is no need to record macros or set up session management rules telling Burp how to get a session or verify that the current session is valid.
The crawler employs multiple crawler “agents” to parallelize its work. Each agent represents a distinct user of the application navigating with its own browser. Each agent has its own cookie box, which is updated when the application sends it a cookie. When an agent returns to the starting location to start exploring from there, their cookie box is cleared, to simulate a completely new browser session.
The requests the crawler makes while browsing are built dynamically based on the previous response, so CSRF tokens in URLs or form fields are handled automatically. This allows the crawler to properly navigate functions that use complex session management, without any configuration by the user:
Detect application state changes
Modern web applications are highly dynamic and it is common for the same application function to return different content on different occasions as a result of actions taken by the user in between. Burp’s crawler is able to detect application state changes that result from actions it performed while crawling.
In the example below, navigating the path
BC causes the application to transition from state 1 to state 2. Link D goes to a logically different location in state 1 than in state 2. So the path
AD goes to the empty cart, while
ABCD goes to the populated cart. Rather than simply concluding that link D is non-deterministic, the crawler is able to identify the state change path that link D depends on. This allows the crawler to reliably reach the location of the filled basket at future, to access the other functions that are available from there:
Login to app
Burp’s crawler begins with an unauthenticated phase in which no credentials are submitted. Once this is complete, Burp will have discovered all of the login and self-registration features in the app.
If the application supports self-registration, Burp will attempt to register a user. You can also configure the crawler to use one or more pre-existing connections.
The crawler then proceeds to an authenticated phase. It will visit the login function multiple times and submit:
- Self-registered account credentials (if applicable).
- The credentials for each pre-existing account configured.
- Fake credentials (these can achieve cool functions like account recovery).
For each set of credentials submitted to the login, Burp will then analyze the discovered content behind the login. This allows the crawler to capture the different functions available to different types of users:
Exploration of volatile content
Modern web applications frequently contain volatile content, where the “same” location or function will return responses that differ significantly on different occasions, not necessarily as a result of user action. This behavior may result from factors such as social media channel feeds or user comments, online advertising, or truly random content (post of the day, A/B testing, etc.).
Burp’s crawler is able to identify many instances of volatile content and correctly re-identify the same location on different visits, despite the different responses. This allows the crawler to focus their attention on the “essential” elements of an application response set, which is probably most important in terms of discovering the main navigation paths to interesting content and features in the application. application :
In some cases, visiting a given link on different occasions will return responses that are too different to be treated as “identical”. In this situation, Burp’s crawler will capture both versions of the answer as two different locations and draw a non-deterministic edge in the graph. Provided the extent of non-determinism in the application is not too large, Burp can still crawl the associated content and reliably find its way to the content behind the non-deterministic link:
Explore with Burp’s browser (browser-powered analytics)
By default, if your machine appears to support it, Burp will use Burp’s browser for all browsing of your target websites and applications. This approach offers several major advantages, allowing Burp Scanner to handle most client-side technologies that modern browsers can use.
If you prefer, you can also manually enable or disable browser-based scanning in your scanning configuration. You can find this option under Explore Options > Miscellaneous > Burp Browser Options.