---
license: cc-by-nc-4.0
language:
- en
tags:
- cybersecurity
widget:
- text: >-
    Native API functions such as <mask> may be directly invoked via system
    calls (syscalls). However, these features are also commonly exposed to
    user-mode applications through interfaces and libraries.
  example_title: Native API functions
- text: >-
    One way to explicitly assign the PPID of a new process is through the
    <mask> API call, which includes a parameter for defining the PPID.
  example_title: Assigning the PPID of a new process
- text: >-
    Enable Safe DLL Search Mode to ensure that system DLLs in more restricted
    directories (e.g., %<mask>%) are prioritized over DLLs in less secure
    locations such as a user’s home directory.
  example_title: Enable Safe DLL Search Mode
- text: >-
    GuLoader is a file downloader that has been active since at least December
    2019. It has been used to distribute a variety of <mask>, including
    NETWIRE, Agent Tesla, NanoCore, and FormBook.
  example_title: GuLoader is a file downloader
---

# SecureBERT+  

**SecureBERT+** is an enhanced version of [SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT), trained on a corpus **five times larger** than its predecessor and leveraging the computational power of **8×A100 GPUs**.  

This model delivers an **average 6% improvement** in Masked Language Modeling (MLM) performance compared to SecureBERT, representing a significant advancement in language understanding and representation within the cybersecurity domain.  

---

## Dataset  
SecureBERT+ was trained on a large-scale corpus of cybersecurity-related text, substantially expanding the coverage and depth of the original SecureBERT training data.  

![dataset](https://cdn-uploads.huggingface.co/production/uploads/6340b0bd77fd972573eb2f9b/pO-v6961YI1D0IPcm0027.png)  

---

## Using SecureBERT+  

SecureBERT+ is available on the [Hugging Face Hub](https://huggingface.co/ehsanaghaei/SecureBERT_Plus).  

### Load the Model
```python
from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")

inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
```

# Masked Language Modeling Example

Use the code below to predict masked words in text:
```python
#!pip install transformers torch tokenizers

import torch
import transformers
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")

def predict_mask(sent, tokenizer, model, topk=10, print_results=True):
    token_ids = tokenizer.encode(sent, return_tensors='pt')
    masked_pos = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().tolist()
    words = []

    with torch.no_grad():
        output = model(token_ids)

    for pos in masked_pos:
        logits = output.logits[0, pos]
        top_tokens = torch.topk(logits, k=topk).indices
        predictions = [tokenizer.decode(i).strip().replace(" ", "") for i in top_tokens]
        words.append(predictions)
        if print_results:
            print(f"Mask Predictions: {predictions}")

    return words
```

# Limitations & Risks

Domain-Specific Scope: SecureBERT+ is optimized for cybersecurity text and may not generalize as well to unrelated domains.

Bias in Training Data: The training corpus was collected from online sources and may contain biases, outdated knowledge, or inaccuracies.

Potential Misuse: While designed for defensive research, the model could be misapplied to generate adversarial content or obfuscate malicious behavior.

Resource-Intensive: The larger dataset and model training process require significant compute resources, which may limit reproducibility for smaller research teams.

Evolving Threats: The cybersecurity landscape evolves rapidly. Without regular retraining, the model may not capture emerging threats or terminology.

Users should apply SecureBERT+ responsibly, with appropriate oversight from cybersecurity professionals.

# Reference
```
@inproceedings{aghaei2023securebert, 
title={SecureBERT: A Domain-Specific Language Model for Cybersecurity}, 
author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab}, 
booktitle={Security and Privacy in Communication Networks: 
18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings}, 
pages={39--56}, 
year={2023}, 
organization={Springer} 
}
```