license: cc-by-nc-4.0
language:
- en
tags:
- cybersecurity
widget:
- text: >-
Native API functions such as <mask> may be directly invoked via system
calls (syscalls). However, these features are also commonly exposed to
user-mode applications through interfaces and libraries.
example_title: Native API functions
- text: >-
One way to explicitly assign the PPID of a new process is through the
<mask> API call, which includes a parameter for defining the PPID.
example_title: Assigning the PPID of a new process
- text: >-
Enable Safe DLL Search Mode to ensure that system DLLs in more restricted
directories (e.g., %<mask>%) are prioritized over DLLs in less secure
locations such as a user’s home directory.
example_title: Enable Safe DLL Search Mode
- text: >-
GuLoader is a file downloader that has been active since at least December
2019. It has been used to distribute a variety of <mask>, including
NETWIRE, Agent Tesla, NanoCore, and FormBook.
example_title: GuLoader is a file downloader
SecureBERT+
SecureBERT+ is an enhanced version of SecureBERT, trained on a corpus five times larger than its predecessor and leveraging the computational power of 8×A100 GPUs.
This model delivers an average 6% improvement in Masked Language Modeling (MLM) performance compared to SecureBERT, representing a significant advancement in language understanding and representation within the cybersecurity domain.
Dataset
SecureBERT+ was trained on a large-scale corpus of cybersecurity-related text, substantially expanding the coverage and depth of the original SecureBERT training data.
Using SecureBERT+
SecureBERT+ is available on the Hugging Face Hub.
Load the Model
from transformers import RobertaTokenizer, RobertaModel
import torch
tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")
inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
Masked Language Modeling Example
Use the code below to predict masked words in text:
#!pip install transformers torch tokenizers
import torch
import transformers
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")
def predict_mask(sent, tokenizer, model, topk=10, print_results=True):
token_ids = tokenizer.encode(sent, return_tensors='pt')
masked_pos = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().tolist()
words = []
with torch.no_grad():
output = model(token_ids)
for pos in masked_pos:
logits = output.logits[0, pos]
top_tokens = torch.topk(logits, k=topk).indices
predictions = [tokenizer.decode(i).strip().replace(" ", "") for i in top_tokens]
words.append(predictions)
if print_results:
print(f"Mask Predictions: {predictions}")
return words
Limitations & Risks
Domain-Specific Scope: SecureBERT+ is optimized for cybersecurity text and may not generalize as well to unrelated domains.
Bias in Training Data: The training corpus was collected from online sources and may contain biases, outdated knowledge, or inaccuracies.
Potential Misuse: While designed for defensive research, the model could be misapplied to generate adversarial content or obfuscate malicious behavior.
Resource-Intensive: The larger dataset and model training process require significant compute resources, which may limit reproducibility for smaller research teams.
Evolving Threats: The cybersecurity landscape evolves rapidly. Without regular retraining, the model may not capture emerging threats or terminology.
Users should apply SecureBERT+ responsibly, with appropriate oversight from cybersecurity professionals.
Reference
@inproceedings{aghaei2023securebert,
title={SecureBERT: A Domain-Specific Language Model for Cybersecurity},
author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab},
booktitle={Security and Privacy in Communication Networks:
18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings},
pages={39--56},
year={2023},
organization={Springer}
}