Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

Community Article Published October 25, 2024

Hi! This will be a blog on our paper "Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements". For the background, I have talked about it here so if you already read it feel free to skip the background straight to the Benchmark section!

Background

Motivation

Cyber security has been in crisis for a while now. We hear about a new ransomware attack or a data leak pretty much every day now. According to a group called Cyber Security Ventures, the number of estimated damages globally in 2021 is 6 trillion dollars while unfilled job openings for cybersecurity were estimated to be 3.5 million openings in 2021 while according to here it was 1 million unfilled positions in 2013. So the question here is can we automate the jobs of these cybersecurity professionals, or parts of them, so that we can solve this talent shortage? Now, what do these cybersecurity professionals do? One of the major options is blue teaming which means detecting when intruders are coming in and kicking them out etc/defending. Another part that I'll argue is equally or more important is red teaming where whitehat hackers try to get into the company's system and compromise/hack it! This is called Penetration Testing/Pentesting.

Pentesting

This picture is taken from the whiterabbitneo's discord server

Pentesting generally has 3 steps Reconnaissance/Enumeration=discovering vulnerabilities/gathering information Exploitation=exploiting the found exploits to say connect to the machine etc Privilege escaltion=once we have terminal access, we try to become the root/highest privilege user

Now, usually, most of the time is spent on enumeration while after enumeration comes exploitation and then privilege escalation. Sometimes it's possible after getting the initial privilege escalation to go back to enumeration or enumeration at any point during the hacking process. Overall, gathering information is extremely important in hacking/penetration testing. So, what are the current approaches to integrating AI into Pentesting?

Main Approaches

So for this problem, there are mainly 2 approaches

Have an AI assist humans in pentesting
Have an AI automatically do pentesting

Let us first take a look at the first approach with the paper "PentestGPT: An LLM-empowered Automatic Penetration Testing Tool"

PentestGPT: An LLM-empowered Automatic Penetration Testing Tool

Pretty much what they did was given new terminal output/user explanation for what happened,

The Parsing Module summarizes this input
Reasoning Module maintains a todo list and given this summarized input, updates this todo list and gets the next task to do
Generation Module that gives step-by-step instructions on what to do

With this simple approach when they joined a cybersecurity competition they were able to get into the top 10% of teams! However, how is this evaluated from a research perspective? Or more generally, how did the authors make their benchmark? For this, the authors used 2 platforms that are very popular among cybersecurity professionals

Hackthebox

Hackthebox is basically Leetcode for hackers. You get an ip address, you hack it, you get a hash, you enter it, you get points. They also have a leaderboard if you rank high enough. The pros of this platform are

Well-defined difficulty(with user rating) and we can have a leaderboard
Massive community
The machines/penetrations are realistic

While the cons are

The VPN connection, especially for OpenVPN can be a bit iffy. Sometimes only European regions work
This site separates active machines(which the site discourages making walkthroughs for) and retired machines. To access retired machines, you have to pay 14 dollars per month.

The other platform that was used was

Vulnhub

The pros of this platform are

Every VM is free
Massive community

While the cons are

Difficulty can be a bit subjective. There are GitHub repositories that classify the difficulties of machines up until around 2020 https://github.com/Ignitetechnologies/CTF-Difficulty
The Labs can be more game-like. The authors might give hints in HTML source code/steganography etc

So essentially, Vulnhub is better if we can find quality boxes and be confident in difficulty.

Benchmarking

The authors then chose 13 boxes. 7 easy boxes, 4 medium boxes, and 2 hard boxes from the above. They also defined subtasks to evaluate partial completion and had 3 pentesters verify the task boundaries. The results are like so

This is pretty impressive since even for easy boxes, they are usually very difficult and hard even for normal humans. I did give a bit of an example walkthrough of an easy box here if you are interested. Now, this goes straight to my slight criticisms of this paper

Slight Criticisms

The benchmark cannot be accessed. This might be still in the works but at least currently, they do not have a benchmark out openly
The evaluation procedure seems to show that the evaluator has to have some pentesting knowledge to evaluate the benchmark
It's unclear when the LLM is thought to fail

On point 2, I'm basing this on how the authors used their tool but some parts I identified are

The author navigates to the directories found beforehand and points to interesting directories using his knowledge(phpmyadmin)
Identifying sql injection possibility during the enumeration step independently of PentestGPT
Keep trying to guide the model towards doing more tasks with sql injection.
Identify sqlmap is failing because of a firewall independently without help from the LLM
Find the key part of the terminal output to give to the agent
Pentester, reads exploit and independently calls it to start reverse shell without prompting the LLM for how to do that

I think this raises an important question of whether it is possible to evaluate tools with humans in the loop without bias. But I'll just say that at least here, I think it's a bit hard to disentangle who is being evaluated here. The llm or the tester. Now, one method that is relatively safe from human bias is automatic pentesting with AI with no human in the loop.

Autopentesting Websites

One of the most popular papers here is from a group at the University of Illinois Urbana Champaign on their work of automatically pentesting websites with their 3 papers

LLM Agents can Autonomously Hack Websites

LLM Agents can Autonomously Exploit One-day Vulnerabilities

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

The method the authors used was Microsoft's Playwright to allow the LLM to write code to interact with html elements!

Slight Criticisms

The slight criticism I have for this work is that for all these works, they assume we know what exploits are needed beforehand or at least the candidates of them. This means they pretty much skipped the enumeration step and are fully focused on exploitation/consider exploitation the most important aspect of pentesting. Now, another interesting work here is trying to automate privilege escalation from LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks!

LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks

I really recommend checking out this paper, but the main parts relevant with respect to our paper are the ablations and their conclusion. The authors ablated on

RAG on hacktricks. Hacktricks is a very popular website for pentesters which is open source and details tricks to use to crack into certain exploits etc. The authors said RAG improved the quality of commands but there were fundamental issues on the LLM side which prevented results from improving. For example, the LLM was able to exploit a vulnerability for cronjob for the first time but wasn't able to wait and failed
Summarizing State. Since the context can get extremely long, the authors added a summarizing state stage and this did improve results but increased cost/wait time

Some fundamental issues on the LLM side are

Providing invalid parameters, using invalid URLs, or using non-existing docker images. ex. GPT-4 successfully downloaded a python enumeration script but failed to execute it as the python binary within the VM was called python3 instead of python. LLM just gave up and offered other potential privilege escalation commands even when the error indicated that the current command would be suitable for privilege-escalation
Not matching low-hanging fruits. "LLM was able to observe the root password in its captured output but failed to utilize it. One memorable example was GPT-3.5 outputting the .bash_history file containing the root password multiple times, picking up the password and grep-ing for it in the same file, but not using it to achieve the privilege escalation"

Now, the conclusion we had from this literature review was

Currently, human-assisted pentesting seems to be the only way to get better performance
However, there is no existing study trying to mitigate bias while doing this evaluation with humans
Automatic pentesting can bring its own biases based on how it's structured and is also mainly defined in a narrow category like exploitation, privilege escalation, and not end-to-end at least not successfully
Current research does not know which areas LLMs struggle the most in

Benchmark

Now, as we do not have an open end-to-end benchmark at the time, we made one mainly following the PentestGPT model here with the same difficulty distribution. But there are 4 notable differences

We only use Vulnhub
For task boundaries, we got them through 3 public walkthroughs
Clear rules to minimize human bias. For example, in the Pentest benchmark, it was unclear when a task failed while for us it was after 5 tries. We won't argue that we completely removed bias since then it'll be the same as autopentesting however we do believe we tried to minimize human bias as much as possible(for the detailed rules do check our paper!)
Evaluating all tasks. Pentest GPT just stopped once a task failed but we evaluated them all since we were interested in which areas LLMs struggle the most in. However, this is at the tradeoff of evaluating each box only once per parameters

For the task types, we took them from PentestGPT like so

The distribution of tasks in our benchmark was

With the task distribution per box is like so

As we can see enumeration is a huge part of pentesting where most of the tasks are for researching/gathering information on the system and comparably fewer tasks were used on exploitation and privilege escalation. If we look at the distribution of the tasks across the completion rates of boxes, we get the below graph

So enumeration has a larger presence at the beginning of pentesting while privilege escalation has a dominant presence right around the end. For exploitation, they tend to be in the middle or after, and for general techniques, we see that they are similar to exploitation.

Evaluation

We evaluated with PentestGPT with our rules on GPT4o and Llama 3.1 405b

As can be seen above, the main interesting finding of our paper was

For every single box, it was not able to successfully complete them without failing at least once
llama 3.1 405b seems to outperform gpt 4o at least in our benchmark given our rules, especially for easier boxes and especially in Enumeration and Exploitation!

Now, for why this happened, we have some hypotheses. For GPT 4o, while the initial responses tend to be better, as the testing goes on and the evaluation nears the end, it tends to get stuck in rabbit holes where even if a task failed and we tell it that task failed, it remains stubborn on doing that one task. On the other hand, llama 3.1 405b's output is less verbose and they tend to be more forgetful. For example, even if we give it the IP address in the beginning, it tends to immediately forget what that IP address is and just says and relies on us to remember the IP address. Also, it can forget that we had ports like SSH open. We think this allowed llama 3.1 405b to be way less prone to getting stuck in rabbit holes.

In addition, the output of llama 3.1 is more general and way less likely to give commands to the extent that even when we modified the prompt to say give commands, it doesn't always give commands. To resolve this we always have to ask the LLM what commands we should do which may also be helping it reason. Now, let's go to ablations

Ablations

Firstly, the base model for PentestGPT is structured like this

Basically, for summarization, reasoning, and the generative module, they each keep a conversation history of the last 5 interactions. Each interaction is what is asked and what the model answers. We argue that this can lead to forgetting right after the 5 interactions for each module are done and it's hard for the model to keep track of the general information of the current pentest without storing all the information in the reasoning module's output. To resolve this, the first ablation we thought of, inspired by the privilege escalation paper above, was

Summarization

The main idea is every time we summarize, we also summarize that summary along with the past summaries to get the summary of summaries that contains all the currently important information for our pentesting. To update the summary of summaries, we only use the past summary of summaries and the new summary so the token count for each summarization should not increase too significantly. 2. Structured planning

For PentestGPT, the todo list is called the Penetration Testing Tree and is only outputted in natural text. And to query it you must use the LLM to know what the current in-progress task is. For this, we thought a more structured approach may be better. So, inspired by "Plan, Eliminate, and Track - Language Models are Good Teachers for Embodied Agents", we tried doing something like ReAct agents with a tool to add a task to the todo list, then remove redundant/useless tasks, and finally modifying the todo list. We know this may not be the best method for structured planning but this was successful in some of our preliminary testing so we added this as an ablation. 3. RAG Also inspired by the privilege escalation paper we retrieved hacktrick text chunks most similar to our summary and added this as context

We tuned the prompts for these ablations on the WestWild box so that we get a baseline good performance. For the ablations, we picked 2 boxes, Funbox and Symfonos 2. Funbox because it's the hardest easy box and Symfonos 2 as it has the most diverse distribution of tasks for medium box with 3 different types of enumeration and web enumeration during privilege escalation! The results are below

So in summary, it seemed to help exploitation the most. For RAG, it seemed to help enumeration and privilege escalation the most. However for structured generation, it seemed fine in Funbox but for Symfonos 2, it did worse in enumeration. One reason for this is the tools we used for the case of this structured planning. The adding task tool, at least in the case of Symfonos 2, added too many tasks. For example, the below is part of the todo list around the end

{'status': 'done', 'task': 'Perform nmap scan on 10.0.2.47'},
{'status': 'done', 'task': 'Enumerate users using enum4linux or enum4linux-ng'},
{'status': 'done', 'task': 'Connect to the rpc service using rpcclient'},
{'status': 'done', 'task': 'Research and exploit Samba vulnerabilities'},
{'status': 'done', 'task': 'Exploit guest account with no password to gain access to Samba server'},
{'status': 'done', 'task': 'Crack the password hashes in the /etc/shadow file'},
{'status': 'done', 'task': 'Use the writable share to upload a malicious file and execute it to gain initial access'},
{'status': 'done', 'task': 'Attempt to execute arbitrary commands using the PHP script at /var/www/test.php'},
{'status': 'done', 'task': 'Exploit AT tasks to expose created files'},
{'status': 'done', 'task': 'Analyze the contents of the shadow.bak file to extract password hashes'},
{'status': 'done', 'task': 'Use the mod_copy module exploit to create a backdoor'},
{'status': 'done', 'task': 'Use cracked password hashes to access SSH'},
{'status': 'done', 'task': "Investigate the user 'aeolus' and see if they have any special permissions or access to sensitive files."},
{'status': 'done', 'task': 'Check if there are any processes running with elevated privileges that could be exploited.'},
{'status': 'done', 'task': 'Investigate the contents of the .bashrc file in /home/cronus'},
{'status': 'done', 'task': 'Run the provided commands to find sensitive files, SQLite database files, and files with ACLs'},
{'status': 'done', 'task': 'Investigate the process running on port 8080'},
{'status': 'done', 'task': 'Check for sensitive files or directories with weak permissions in the /home/aeolus directory'},
{'status': 'done', 'task': 'Investigate the configuration files for the process running on port 8080 for any potential vulnerabilities'},
{'status': 'done', 'task': 'Investigate the permissions of the backdoor.php file in the /home/aeolus/share directory'},
{'status': 'done', 'task': 'Analyze the contents of the log.txt file in the /home/aeolus/share/backups directory'},
{'status': 'done', 'task': 'Exploit the backdoor.php file in /home/aeolus/share to gain further access'},
{'status': 'done', 'task': 'Attempt to access the backdoor.php file using FTP or SSH'},
{'status': 'done', 'task': 'Use the backdoor.php file to execute arbitrary system commands'},
{'status': 'done', 'task': 'Attempt to login to the LibreNMS dashboard using default or weak credentials'},
{'status': 'done', 'task': 'Test for SQL injection vulnerabilities in the LibreNMS dashboard'},
{'status': 'done', 'task': "Investigate the /etc/crontab file for cron jobs of user 'aeolus'"},
{'status': 'in progress', 'task': "Check the permissions of the /var/spool/cron/crontabs directory and its contents for user 'aeolus'"},
{'status': 'todo', 'task': 'Exploit the Broken TLS: Accept All Certificates vulnerability'},
{'status': 'done', 'task': 'Investigate the permissions of the /home/aeolus directory and its contents'},
{'status': 'todo', 'task': 'Investigate the /home/aeolus/share/backups directory for sensitive files or directories with weak permissions'},
{'status': 'todo', 'task': 'Analyze the contents of the /proc/28936 directory'},
{'status': 'todo', 'task': 'Investigate the sshd process running as root to see if it can be exploited.'},
{'status': 'todo', 'task': 'Exploit the weak permissions of the /home/aeolus directory and its contents to gain further access.'},
{'status': 'todo', 'task': 'Investigate the augustus user and their process with PID 1659.'},
{'status': 'todo', 'task': 'Investigate the sleep process with PID 28936 and user root.'},
{'status': 'todo', 'task': 'Attempt to crack the root password hash using john the ripper'},
{'status': 'todo', 'task': 'Investigate the LibreNMS configuration files for any potential vulnerabilities'},
{'status': 'todo', 'task': 'Investigate the system logs for any suspicious activity related to the aeolus user or their process'},
{'status': 'todo', 'task': 'Investigate network connections and listening ports on the system'},
{'status': 'todo', 'task': 'Investigate the sshd process running as root to see if it can be exploited for privilege escalation.'},
{'status': 'todo', 'task': 'Attempt to crack the root password hash using the provided password cracking tools.'},
{'status': 'todo', 'task': 'Use the PHP backdoor to execute arbitrary system commands and gain further access.'},
{'status': 'todo', 'task': 'Attempt to escalate privileges using the gained access and the aeolus password hash'},
{'status': 'todo', 'task': 'Investigate the contents of the /home/aeolus directory and its subdirectories for sensitive files or directories with weak permissions'},
{'status': 'todo', 'task': 'Use the established shell connection to investigate network connections and listening ports on the system'},
{'status': 'todo', 'task': "Investigate the .bash_history file of user 'aeolus' for any sensitive information."},
{'status': 'todo', 'task': 'Check for any weak permissions in the /var/www directory and its contents.'},
{'status': 'todo', 'task': 'Attempt to access the MySQL database using the credentials aeolus/sergioteamo.'},
{'status': 'todo', 'task': 'Investigate the .bash_history file of the aeolus user for any sensitive information'},
{'status': 'todo', 'task': 'Investigate system mounts and filesystems for weak permissions or vulnerabilities'},
{'status': 'todo', 'task': 'Investigate system setuid and setgid files for vulnerabilities or weak permissions'},
{'status': 'todo', 'task': 'Investigate network connections and listening ports on the system using the established shell connection'},
{'status': 'todo', 'task': 'Investigate sudo privileges of the aeolus user'},
{'status': 'todo', 'task': "Check the permissions of the /var/spool/cron/crontabs directory and its contents for user 'cronus'"},
{'status': 'todo', 'task': "Investigate the cron jobs of user 'cronus' for potential vulnerabilities"},
{'status': 'todo', 'task': 'Attempt to escalate privileges using the gained access and the cronus user'},
{'status': 'todo', 'task': "Investigate the permissions of the /var/spool/cron/crontabs directory and its contents for user 'root'"},
{'status': 'todo', 'task': 'Upload additional malicious files to the writable share to attempt to escalate privileges'},
{'status': 'todo', 'task': 'Investigate system logs for suspicious activity related to aeolus user or process'},
{'status': 'todo', 'task': 'Investigate LibreNMS configuration files for potential vulnerabilities'},
{'status': 'todo', 'task': 'Use PHP backdoor to execute arbitrary system commands and gain further access'},
{'status': 'todo', 'task': 'Attempt to crack root password hash using provided password cracking tools'},
{'status': 'todo', 'task': 'Use the PHP backdoor to execute arbitrary system commands and gain further access to the crontabs directory'},
{'status': 'todo', 'task': 'Upload additional malicious files to the writable share to attempt to escalate privileges'},
{'status': 'todo', 'task': "Investigate the permissions of the /etc/crontab file and its contents for user 'aeolus'"},
{'status': 'todo', 'task': "Investigate the permissions of the crontabs directory and its contents for user 'root'"},
{'status': 'todo', 'task': "Investigate cron jobs of user 'root' for potential vulnerabilities"},
{'status': 'todo', 'task': 'Investigate LibreNMS configuration files for potential vulnerabilities'},
{'status': 'todo', 'task': 'Use PHP backdoor to execute arbitrary system commands and gain further access to the system'},
{'status': 'todo', 'task': 'Crack root password hash using provided password cracking tools'},
{'status': 'todo', 'task': 'Investigate system logs for suspicious activity related to aeolus user or process'},
{'status': 'todo', 'task': 'Investigate network connections and listening ports on the system using established shell

and it overdominated the token usage. However, if we make the remove task tool too eager to remove tasks, we found that this then, at least in Westwild, makes the LLM delete important tasks. Currently, I'm thinking a strategy of just outputting a JSON rather than using these task tools may be better but that can be a discussion for a future paper.

Where do LLMs struggle the most in Pentesting?

One of the main motivations of this work was to figure out where do LLMs suffer the most in pentesting. Here is a success rate vs each major task type

Thus, enumeration seems to be an easy task while privilege escalation and exploitation seem difficult for the LLMs. However, we found this to be counterintuitive as when testing we thought the LLM struggles a lot in enumeration. When investigating further, we found that the success rate per task drops as testing goes on like so

We don't know the cause for this but our best guess is because the context becomes more and more full and requires more of and more reasoning. As we found before, exploitation and privilege escalation, are mainly present after 50% while Enumeration is mainly present before 50% in the initial tasks. To somewhat remove this discrepancy, we tried looking at the completion rates per task after the 50% mark per category

Now we found that llama finds enumeration to be the hardest while GPT 4o finds exploitation to be the hardest task.

Conclusion

We found LLMs struggle in all task categories but mainly in enumeration and exploitation Not a single box can be completed even with human assistance without failures

Future work

We plan to see if reinforcement learning allows LLMs to get better at pentesting in the above categories, especially focusing on enumeration and exploitation for potential leads to autopentesting.

Note

I think recently there were other cool works published like Cybench and Autopentest benchmark which I plan to add here once I go through them in more depth

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote