SecureBERT_Plus / README.md
ehsanaghaei's picture
Update README.md
a553540 verified
metadata
license: cc-by-nc-4.0
language:
  - en
tags:
  - cybersecurity
widget:
  - text: >-
      Native API functions such as <mask> may be directly invoked via system
      calls (syscalls). However, these features are also commonly exposed to
      user-mode applications through interfaces and libraries.
    example_title: Native API functions
  - text: >-
      One way to explicitly assign the PPID of a new process is through the
      <mask> API call, which includes a parameter for defining the PPID.
    example_title: Assigning the PPID of a new process
  - text: >-
      Enable Safe DLL Search Mode to ensure that system DLLs in more restricted
      directories (e.g., %<mask>%) are prioritized over DLLs in less secure
      locations such as a user’s home directory.
    example_title: Enable Safe DLL Search Mode
  - text: >-
      GuLoader is a file downloader that has been active since at least December
      2019. It has been used to distribute a variety of <mask>, including
      NETWIRE, Agent Tesla, NanoCore, and FormBook.
    example_title: GuLoader is a file downloader

SecureBERT+

SecureBERT+ is an enhanced version of SecureBERT, trained on a corpus five times larger than its predecessor and leveraging the computational power of 8×A100 GPUs.

This model delivers an average 6% improvement in Masked Language Modeling (MLM) performance compared to SecureBERT, representing a significant advancement in language understanding and representation within the cybersecurity domain.


Dataset

SecureBERT+ was trained on a large-scale corpus of cybersecurity-related text, substantially expanding the coverage and depth of the original SecureBERT training data.

dataset


Using SecureBERT+

SecureBERT+ is available on the Hugging Face Hub.

Load the Model

from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")

inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

Masked Language Modeling Example

Use the code below to predict masked words in text:

#!pip install transformers torch tokenizers

import torch
import transformers
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")

def predict_mask(sent, tokenizer, model, topk=10, print_results=True):
    token_ids = tokenizer.encode(sent, return_tensors='pt')
    masked_pos = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().tolist()
    words = []

    with torch.no_grad():
        output = model(token_ids)

    for pos in masked_pos:
        logits = output.logits[0, pos]
        top_tokens = torch.topk(logits, k=topk).indices
        predictions = [tokenizer.decode(i).strip().replace(" ", "") for i in top_tokens]
        words.append(predictions)
        if print_results:
            print(f"Mask Predictions: {predictions}")

    return words

Limitations & Risks

Domain-Specific Scope: SecureBERT+ is optimized for cybersecurity text and may not generalize as well to unrelated domains.

Bias in Training Data: The training corpus was collected from online sources and may contain biases, outdated knowledge, or inaccuracies.

Potential Misuse: While designed for defensive research, the model could be misapplied to generate adversarial content or obfuscate malicious behavior.

Resource-Intensive: The larger dataset and model training process require significant compute resources, which may limit reproducibility for smaller research teams.

Evolving Threats: The cybersecurity landscape evolves rapidly. Without regular retraining, the model may not capture emerging threats or terminology.

Users should apply SecureBERT+ responsibly, with appropriate oversight from cybersecurity professionals.

Reference

@inproceedings{aghaei2023securebert, 
title={SecureBERT: A Domain-Specific Language Model for Cybersecurity}, 
author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab}, 
booktitle={Security and Privacy in Communication Networks: 
18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings}, 
pages={39--56}, 
year={2023}, 
organization={Springer} 
}