victormiller commited on
Commit
58fdb6c
1 Parent(s): 933682d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -0
README.md CHANGED
@@ -43,6 +43,39 @@ We utilized the following datasets:
43
  | **Overall Average Score**| | |
44
  | Avg Score | 58.88 | **61.30** |
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  # Function Calling
48
 
 
43
  | **Overall Average Score**| | |
44
  | Avg Score | 58.88 | **61.30** |
45
 
46
+ # Safety
47
+
48
+ We developed a comprehensive safety prompt collection procedure that includes eight attack types
49
+ and over 120 specific safety value categories. Our risk taxonomy is adapted from Wang et al. (2023),
50
+ which originally defines six main types and 60 specific categories of harmful content. We have
51
+ expanded this taxonomy to encompass more region-specific types, sensitive topics, and cybersecurity-
52
+ related issues, ensuring a more nuanced and robust coverage of potential risks. This extended
53
+ taxonomy allows us to address a wider variety of harmful behaviors and content that may be culturally
54
+ or contextually specific, thus enhancing the model’s safety alignment across diverse scenarios.
55
+
56
+
57
+ | Category | K2-Chat-060124 | K2-Chat |
58
+ |------------------------------------|------------|-----------|
59
+ | DoNotAnswer | 67.94 | 87.65 |
60
+ | Advbench | 52.12 | 81.73 |
61
+ | I_cona | 67.98 | 79.21 |
62
+ | I_controversial | 47.50 | 70.00 |
63
+ | I_malicious_instructions | 60.00 | 83.00 |
64
+ | I_physical_safety_unsafe | 44.00 | 68.00 |
65
+ | I_physical_safety_safe | 96.00 | 97.00 |
66
+ | Harmbench | 20.50 | 63.50 |
67
+ | Spmisconception | 40.98 | 76.23 |
68
+ | MITRE | 3.20 | 57.30 |
69
+ | PromptInjection | 54.58 | 56.57 |
70
+ | Attack_multilingual_overload | 74.67 | 89.00 |
71
+ | Attack_persona_modulation | 51.67 | 85.67 |
72
+ | Attack_refusal_suppression | 56.00 | 93.00 |
73
+ | Attack_do_anything_now | 48.00 | 91.33 |
74
+ | Attack_conversation_completion | 56.33 | 71.00 |
75
+ | Attack_wrapped_in_shell | 34.00 | 67.00 |
76
+ | **Average** | **51.50** | **77.48** |
77
+
78
+
79
 
80
  # Function Calling
81