ehsanaghaei commited on
Commit
f8f3361
·
verified ·
1 Parent(s): 4c48ccd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -55
README.md CHANGED
@@ -5,30 +5,47 @@ language:
5
  tags:
6
  - cybersecurity
7
  widget:
8
- - text: "Native API functions such as <mask>, may be directed invoked via system calls/syscalls, but these features are also often exposed to user-mode applications via interfaces and libraries.."
 
 
 
9
  example_title: Native API functions
10
-
11
- - text: "One way of explicitly assigning the PPID of a new process is via the <mask> API call, which supports a parameter that defines the PPID to use."
 
12
  example_title: Assigning the PPID of a new process
13
-
14
- - text: "Enable Safe DLL Search Mode to force search for system DLLs in directories with greater restrictions (e.g. %<mask>%) to be used before local directory DLLs (e.g. a user's home directory)"
 
 
15
  example_title: Enable Safe DLL Search Mode
16
-
17
- - text: "GuLoader is a file downloader that has been used since at least December 2019 to distribute a variety of <mask>, including NETWIRE, Agent Tesla, NanoCore, and FormBook."
 
 
18
  example_title: GuLoader is a file downloader
19
  ---
20
- # SecureBERT+
21
- This model represents an improved version of the [SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT) model, trained on a corpus eight times larger than its predecessor, leveraging the computational power of 8xA100 GPUs. This version, known as SecureBERT+, brings forth an average improvment of 9% in the performance of the Masked Language Model (MLM) task. This advancement signifies a substantial stride towards achieving heightened proficiency in language understanding and representation learning within the cybersecurity domain.
22
 
 
 
 
23
 
24
- SecureBERT is a domain-specific language model based on RoBERTa which is trained on a huge amount of cybersecurity data and fine-tuned/tweaked to understand/represent cybersecurity textual data.
25
 
26
- ## Dataset
27
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6340b0bd77fd972573eb2f9b/pO-v6961YI1D0IPcm0027.png)
28
 
29
- ## Load Model
30
- SecureBER+T has been uploaded to [Huggingface](https://huggingface.co/ehsanaghaei/SecureBERT_Plus) framework.
31
 
 
 
 
 
 
 
 
 
 
32
  ```python
33
  from transformers import RobertaTokenizer, RobertaModel
34
  import torch
@@ -42,76 +59,61 @@ outputs = model(**inputs)
42
  last_hidden_states = outputs.last_hidden_state
43
  ```
44
 
45
- ## Fill Mask (MLM)
46
- Use the code below to predict the masked word within the given sentences:
47
 
 
48
  ```python
49
- #!pip install transformers
50
- #!pip install torch
51
- #!pip install tokenizers
52
 
53
  import torch
54
  import transformers
55
- from transformers import RobertaTokenizer, RobertaTokenizerFast
56
 
57
  tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
58
  model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")
59
 
60
- def predict_mask(sent, tokenizer, model, topk =10, print_results = True):
61
  token_ids = tokenizer.encode(sent, return_tensors='pt')
62
- masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
63
- masked_pos = [mask.item() for mask in masked_position]
64
  words = []
 
65
  with torch.no_grad():
66
  output = model(token_ids)
67
 
68
- last_hidden_state = output[0].squeeze()
69
-
70
- list_of_list = []
71
- for index, mask_index in enumerate(masked_pos):
72
- mask_hidden_state = last_hidden_state[mask_index]
73
- idx = torch.topk(mask_hidden_state, k=topk, dim=0)[1]
74
- words = [tokenizer.decode(i.item()).strip() for i in idx]
75
- words = [w.replace(' ','') for w in words]
76
- list_of_list.append(words)
77
  if print_results:
78
- print("Mask ", "Predictions: ", words)
79
-
80
- best_guess = ""
81
- for j in list_of_list:
82
- best_guess = best_guess + "," + j[0]
83
 
84
  return words
 
85
 
 
86
 
87
- while True:
88
- sent = input("Text here: \t")
89
- print("SecureBERT: ")
90
- predict_mask(sent, tokenizer, model)
91
-
92
- print("===========================\n")
93
- ```
94
 
95
- Other model variants:
96
 
97
- [SecureGPT](https://huggingface.co/ehsanaghaei/SecureGPT)
98
 
99
- [SecureDeBERTa](https://huggingface.co/ehsanaghaei/SecureDeBERTa)
100
 
101
- [SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT)
102
 
 
103
 
104
  # Reference
 
105
  @inproceedings{aghaei2023securebert,
106
  title={SecureBERT: A Domain-Specific Language Model for Cybersecurity},
107
  author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab},
108
  booktitle={Security and Privacy in Communication Networks:
109
- 18th EAI International Conference, SecureComm 2022, Virtual Event,
110
- October 2022,
111
- Proceedings},
112
  pages={39--56},
113
  year={2023},
114
- organization={Springer} }
115
-
116
-
117
-
 
5
  tags:
6
  - cybersecurity
7
  widget:
8
+ - text: >-
9
+ Native API functions such as <mask> may be directly invoked via system
10
+ calls (syscalls). However, these features are also commonly exposed to
11
+ user-mode applications through interfaces and libraries.
12
  example_title: Native API functions
13
+ - text: >-
14
+ One way to explicitly assign the PPID of a new process is through the
15
+ <mask> API call, which includes a parameter for defining the PPID.
16
  example_title: Assigning the PPID of a new process
17
+ - text: >-
18
+ Enable Safe DLL Search Mode to ensure that system DLLs in more restricted
19
+ directories (e.g., %<mask>%) are prioritized over DLLs in less secure
20
+ locations such as a user’s home directory.
21
  example_title: Enable Safe DLL Search Mode
22
+ - text: >-
23
+ GuLoader is a file downloader that has been active since at least December
24
+ 2019. It has been used to distribute a variety of <mask>, including
25
+ NETWIRE, Agent Tesla, NanoCore, and FormBook.
26
  example_title: GuLoader is a file downloader
27
  ---
 
 
28
 
29
+ # SecureBERT+
30
+
31
+ **SecureBERT+** is an enhanced version of [SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT), trained on a corpus **eight times larger** than its predecessor and leveraging the computational power of **8×A100 GPUs**.
32
 
33
+ This model delivers an **average 9% improvement** in Masked Language Modeling (MLM) performance compared to SecureBERT, representing a significant advancement in language understanding and representation within the cybersecurity domain.
34
 
35
+ ---
 
36
 
37
+ ## Dataset
38
+ SecureBERT+ was trained on a large-scale corpus of cybersecurity-related text, substantially expanding the coverage and depth of the original SecureBERT training data.
39
 
40
+ ![dataset](https://cdn-uploads.huggingface.co/production/uploads/6340b0bd77fd972573eb2f9b/pO-v6961YI1D0IPcm0027.png)
41
+
42
+ ---
43
+
44
+ ## Using SecureBERT+
45
+
46
+ SecureBERT+ is available on the [Hugging Face Hub](https://huggingface.co/ehsanaghaei/SecureBERT_Plus).
47
+
48
+ ### Load the Model
49
  ```python
50
  from transformers import RobertaTokenizer, RobertaModel
51
  import torch
 
59
  last_hidden_states = outputs.last_hidden_state
60
  ```
61
 
62
+ # Masked Language Modeling Example
 
63
 
64
+ Use the code below to predict masked words in text:
65
  ```python
66
+ #!pip install transformers torch tokenizers
 
 
67
 
68
  import torch
69
  import transformers
70
+ from transformers import RobertaTokenizerFast
71
 
72
  tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
73
  model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")
74
 
75
+ def predict_mask(sent, tokenizer, model, topk=10, print_results=True):
76
  token_ids = tokenizer.encode(sent, return_tensors='pt')
77
+ masked_pos = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().tolist()
 
78
  words = []
79
+
80
  with torch.no_grad():
81
  output = model(token_ids)
82
 
83
+ for pos in masked_pos:
84
+ logits = output.logits[0, pos]
85
+ top_tokens = torch.topk(logits, k=topk).indices
86
+ predictions = [tokenizer.decode(i).strip().replace(" ", "") for i in top_tokens]
87
+ words.append(predictions)
 
 
 
 
88
  if print_results:
89
+ print(f"Mask Predictions: {predictions}")
 
 
 
 
90
 
91
  return words
92
+ ```
93
 
94
+ # Limitations & Risks
95
 
96
+ Domain-Specific Scope: SecureBERT+ is optimized for cybersecurity text and may not generalize as well to unrelated domains.
 
 
 
 
 
 
97
 
98
+ Bias in Training Data: The training corpus was collected from online sources and may contain biases, outdated knowledge, or inaccuracies.
99
 
100
+ Potential Misuse: While designed for defensive research, the model could be misapplied to generate adversarial content or obfuscate malicious behavior.
101
 
102
+ Resource-Intensive: The larger dataset and model training process require significant compute resources, which may limit reproducibility for smaller research teams.
103
 
104
+ Evolving Threats: The cybersecurity landscape evolves rapidly. Without regular retraining, the model may not capture emerging threats or terminology.
105
 
106
+ Users should apply SecureBERT+ responsibly, with appropriate oversight from cybersecurity professionals.
107
 
108
  # Reference
109
+ ```
110
  @inproceedings{aghaei2023securebert,
111
  title={SecureBERT: A Domain-Specific Language Model for Cybersecurity},
112
  author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab},
113
  booktitle={Security and Privacy in Communication Networks:
114
+ 18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings},
 
 
115
  pages={39--56},
116
  year={2023},
117
+ organization={Springer}
118
+ }
119
+ ```