victormiller
commited on
Commit
•
58fdb6c
1
Parent(s):
933682d
Update README.md
Browse files
README.md
CHANGED
@@ -43,6 +43,39 @@ We utilized the following datasets:
|
|
43 |
| **Overall Average Score**| | |
|
44 |
| Avg Score | 58.88 | **61.30** |
|
45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
# Function Calling
|
48 |
|
|
|
43 |
| **Overall Average Score**| | |
|
44 |
| Avg Score | 58.88 | **61.30** |
|
45 |
|
46 |
+
# Safety
|
47 |
+
|
48 |
+
We developed a comprehensive safety prompt collection procedure that includes eight attack types
|
49 |
+
and over 120 specific safety value categories. Our risk taxonomy is adapted from Wang et al. (2023),
|
50 |
+
which originally defines six main types and 60 specific categories of harmful content. We have
|
51 |
+
expanded this taxonomy to encompass more region-specific types, sensitive topics, and cybersecurity-
|
52 |
+
related issues, ensuring a more nuanced and robust coverage of potential risks. This extended
|
53 |
+
taxonomy allows us to address a wider variety of harmful behaviors and content that may be culturally
|
54 |
+
or contextually specific, thus enhancing the model’s safety alignment across diverse scenarios.
|
55 |
+
|
56 |
+
|
57 |
+
| Category | K2-Chat-060124 | K2-Chat |
|
58 |
+
|------------------------------------|------------|-----------|
|
59 |
+
| DoNotAnswer | 67.94 | 87.65 |
|
60 |
+
| Advbench | 52.12 | 81.73 |
|
61 |
+
| I_cona | 67.98 | 79.21 |
|
62 |
+
| I_controversial | 47.50 | 70.00 |
|
63 |
+
| I_malicious_instructions | 60.00 | 83.00 |
|
64 |
+
| I_physical_safety_unsafe | 44.00 | 68.00 |
|
65 |
+
| I_physical_safety_safe | 96.00 | 97.00 |
|
66 |
+
| Harmbench | 20.50 | 63.50 |
|
67 |
+
| Spmisconception | 40.98 | 76.23 |
|
68 |
+
| MITRE | 3.20 | 57.30 |
|
69 |
+
| PromptInjection | 54.58 | 56.57 |
|
70 |
+
| Attack_multilingual_overload | 74.67 | 89.00 |
|
71 |
+
| Attack_persona_modulation | 51.67 | 85.67 |
|
72 |
+
| Attack_refusal_suppression | 56.00 | 93.00 |
|
73 |
+
| Attack_do_anything_now | 48.00 | 91.33 |
|
74 |
+
| Attack_conversation_completion | 56.33 | 71.00 |
|
75 |
+
| Attack_wrapped_in_shell | 34.00 | 67.00 |
|
76 |
+
| **Average** | **51.50** | **77.48** |
|
77 |
+
|
78 |
+
|
79 |
|
80 |
# Function Calling
|
81 |
|