Spaces:
Sleeping
Sleeping
File size: 85,590 Bytes
fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c 0803c45 fd31a8c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 |
[
{
"path": "table_paper/2407.00075v2.json",
"table_id": "1",
"section": "3.2",
"all_context": [
"We next investigate how to subvert the rule-following of our theoretical models, wherein the objective is to find an adversarial suffix that causes a violation of the MMS property when appended to some input encoding .",
"This suffix-based approach is similar to jailbreak formulations studied in the literature [52 , 32 ], which we state as follows: Consider any rules , facts , reasoner , and budget .",
"Let , and find such that: the proof state sequence generated by given is not MMS with respect to and , but where .",
"Our key strategy for crafting attacks against our theoretical construction is to use the fact that uses a summation to approximate binary disjunctions, as in (5 ).",
"In particular, if one can construct an adversarial suffix with large negative values in the appropriate coordinates, it is straightforward to craft attacks that induce violations of MMS.",
"Let be as in Theorem 3.1 and consider any where the rules and satisfy some technical conditions (e.g., for monotonicity).",
"Then the following adversarial suffixes to induce a two-state sequence that respectively violate monotonicity, maximality, and soundness given and : where is sufficiently large and: (monotonicity) is any non-empty subset of ; (maximality) let , where is a rule such that and where contains propositions not known by ; (soundness) for any .",
"Intuitively, the suffix attempts to delete known facts from the successive proof state, and we also refer to this as fact amnesia.",
"The suffix has a “rule” with antecedent intended to divert attention away from the rule , and it is helpful to think of this as rule suppression.",
"The suffix injects a token whose coordinates have values , depending on the sign of the adversarial target , and we refer to it as state coercion.",
"Although our theory deals with binary vectors, we use negative values in our theoretical attacks.",
"We do this because our attacks fundamentally operate in the embedding space: when language models reason, they may not use all parts of their embedding space, so it is not implausible to assume that there exist tokens whose embedded values play a similar role to our use of negative values.",
"Theory-based Attacks Transfer to Learned Reasoners.",
"We found that most theory-based attacks transfer to learned reasoners with small changes.",
"In particular, we found that repeating the essential parts of the attack, e.g., for monotonicity, helps the attack succeed against GPT-2 based reasoners.",
"Such repetitions would also work against our theoretical models.",
"We show the results in Figure 3 over a horizon of steps, wherein we define the Attack Success Rate (ASR) as the rate at which the -induced trajectory matches that of the expected trajectory , such as in Figure 2 .",
"Notably, the soundness attack (state coercion) does not succeed, even with repetitions.",
"However, repeating the suffix causes different prefixes to induce the similar — which we measure by the variance.",
"We give additional details in Section C.3 .",
"Learned Attacks Exhibit Characteristics of Theoretical Attacks.",
"Furthermore, we investigated whether standard adversarial attacks discover suffixes similar to our theory-based ones.",
"In particular, given some and some arbitrary sequence of target states that is not MMS (but where ) — can one find an adversarial suffix that behaves similar to the ones in theory?",
"We formulated this as the following learning problem: where is the binary cross-entropy loss.",
"For each of the three MMS properties, we generate different adversarial target sequences that evidence its violation and optimized for an adversarial suffix .",
"We found that a budget of suffices to induce failures over a horizon of steps.",
"We present our results in Table 1 , with additional discussion in Section C.4 .",
""
],
"target_context_ids": [
24,
25,
26,
27,
28
],
"selected_paragraphs": [
"[paragraph id = 24] For each of the three MMS properties, we generate different adversarial target sequences that evidence its violation and optimized for an adversarial suffix .",
"[paragraph id = 25] We found that a budget of suffices to induce failures over a horizon of steps.",
"[paragraph id = 26] We present our results in Table 1 , with additional discussion in Section C.4 ."
],
"table_html": "<figure class=\"ltx_table\" id=\"S3.T1\">\n<table class=\"ltx_tabular ltx_centering ltx_align_middle\" id=\"S3.T1.46\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S3.T1.46.47.1\">\n<td class=\"ltx_td ltx_border_tt\" id=\"S3.T1.46.47.1.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_tt\" colspan=\"3\" id=\"S3.T1.46.47.1.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S3.T1.46.47.1.2.1\">Fact Amnesia</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_tt\" colspan=\"3\" id=\"S3.T1.46.47.1.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S3.T1.46.47.1.3.1\">Rule Suppression</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_tt\" colspan=\"3\" id=\"S3.T1.46.47.1.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S3.T1.46.47.1.4.1\">State Coercion</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S3.T1.1.1\">\n<td class=\"ltx_td\" id=\"S3.T1.1.1.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_border_t\" id=\"S3.T1.1.1.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" colspan=\"2\" id=\"S3.T1.1.1.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">\n Values</td>\n<td class=\"ltx_td ltx_border_t\" id=\"S3.T1.1.1.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" colspan=\"2\" id=\"S3.T1.1.1.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">Attn. Weights</td>\n<td class=\"ltx_td ltx_border_t\" id=\"S3.T1.1.1.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" colspan=\"2\" id=\"S3.T1.1.1.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">Size</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S3.T1.6.6\">\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.2.2.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.6.6.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S3.T1.6.6.6.1\">ASR</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.3.3.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S3.T1.3.3.2.1\"></span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.4.4.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S3.T1.4.4.3.1\"></span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.6.6.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S3.T1.6.6.7.1\">ASR</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.6.6.8\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S3.T1.6.6.8.1\">Atk <span class=\"ltx_text\" id=\"S3.T1.6.6.8.1.1\" style=\"color:#2CA02C;\">✓</span></span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.6.6.9\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S3.T1.6.6.9.1\">Atk <span class=\"ltx_text\" id=\"S3.T1.6.6.9.1.1\" style=\"color:#D62728;\">✗</span></span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.6.6.10\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S3.T1.6.6.10.1\">ASR</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.5.5.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S3.T1.5.5.4.1\"></span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.6.6.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S3.T1.6.6.5.1\"></span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S3.T1.16.16\">\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.7.7.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.8.8.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.9.9.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.10.10.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.11.11.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.12.12.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.13.13.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.14.14.8\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.15.15.9\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S3.T1.16.16.10\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S3.T1.26.26\">\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.17.17.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.18.18.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.19.19.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.20.20.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.21.21.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.22.22.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.23.23.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.24.24.8\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.25.25.9\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.26.26.10\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S3.T1.36.36\">\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.27.27.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.28.28.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.29.29.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.30.30.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.31.31.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.32.32.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.33.33.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.34.34.8\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.35.35.9\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S3.T1.36.36.10\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S3.T1.46.46\">\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S3.T1.37.37.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S3.T1.38.38.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S3.T1.39.39.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S3.T1.40.40.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S3.T1.41.41.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S3.T1.42.42.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S3.T1.43.43.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S3.T1.44.44.8\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S3.T1.45.45.9\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S3.T1.46.46.10\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 1: </span>\nLearned attacks attain high ASR against all three properties and mirror theory-based attacks.\nWe used reasoners with dimension .\n(Fact Amnesia) The average magnitude of the targeted entries () of is larger than the non-targeted entries ().\n(Rule Suppression) The suppressed rule receives less attention in the attacked case.\n(State Coercion)\nThe average entry-wise magnitude of is larger than that of the prefix .\n</figcaption>\n</figure>",
"perturb_sentence_id": [
26,
27
],
"output": {
"perturbed_statement": "[paragraph id = 26] We present our results in Table 2 , with additional discussion in Section C.4 .",
"perturbed_explanation": "1. The original explanation refers to the results being presented in Table 1 and further discussion in Section C.4, suggesting that data is organized and discussed systematically in specified sections of the document. 2. The statement is incorrect because it mentions that the results are presented in Table 2, whereas there is no indication in the context that Table 2 exists or is used to present results. The correct reference is to Table 1."
}
},
{
"path": "table_paper/2407.00075v2.json",
"table_id": "2",
"section": "4",
"all_context": [
"We next study how to subvert LLMs and analyze whether such attacks align with our theoretical predictions.",
"We consider two LLMs, GPT-2 [30 ] and Llama-2-7B-chat [38 ], which are considerably larger than our theoretical setups and also operate on discrete tokens.",
"We adapted the popular Greedy Coordinate Gradients (GCG) [52 ] jailbreak algorithm to generate fact amnesia, rule suppression, and state coercion attacks.",
"We found that the adversarial suffixes found by GCG and their induced attention patterns align with our theoretical predictions.",
"We present a summary of results here and defer comprehensive details to Appendix D .",
"Dataset, Model, and Attack Setups.",
"To study inference subversion in natural language, we consider the task of sabotaging item-crafting in Minecraft [28 ].",
"Given a prompt about crafting items, the objective is to find an adversarial suffix that causes the LLM to answer incorrectly.",
"Figure 4 shows such an example, where an adversarial suffix suppresses the LLM from generating String and Fishing Rod in its output.",
"To attack LLM-based reasoners, we first construct three datasets of such prompts that require at most steps each to craft all the items (the Figure 4 example requires steps).",
"Next, we fine-tune a GPT-2 [30 ] model for each dataset, with all three models attaining accuracy.",
"Then, for each attack and each model, we use GCG to search for an adversarial suffix that induces the expected behavior of the attack.",
"We give additional details for datasets and fine-tuning in Section D.1 , describe the attack setups and expected behaviors in Section D.2 , and define the evaluation metrics in Section D.3 .",
"Due to limits in computation, we do not fine-tune Llama-2-7B-Chat, nor do we attack it with GCG, and instead study its behavior with a hand-crafted dataset, which we expand upon in Section D.4 .",
"Result 1: Standard Probing Gives Evidence for Binary-valued Encodings.",
"We found that linear classifier probes [25 ] attached to the last token embeddings accurately predict the final proof state at the end of chain-of-thought reasoning.",
"This is evidence that the LLM embeddings contain enough information to easily reconstruct the binary-valued proof states in our theoretical setup.",
"To test the probe accuracy for different numbers of propositions (i.e., craftable items), we created random restrictions of the Minecraft dataset for .",
"Then, we attached a different probe mapping to each of the layers of GPT-2, where and the sign of each output coordinate whether the corresponding proposition should hold.",
"There are a total of different probes.",
"We used logistic regression to fit the linear probes on a sample of prompts for the setting and prompts for the settings.",
"We report the accuracy in Figure 5 (left) and F1 scores in Figure 5 (middle) over a total of validation samples for each .",
"A probe s prediction is correct (counted towards accuracy) only when it correctly predicts all propositions, and for F1 scores, we used the total number of true positives, true negatives, false positives, and false negatives of all the predictions.",
"We also found that an adversarial suffix makes the probes recover the attacker s intended target state more frequently Figure 5 (right), and this is consistent with our theoretical predictions.",
"Result 2: Language Models are Susceptible to Inference Subversions.",
"For each attack (fact amnesia, rule suppression, state coercion) and model (, we used GCG to find adversarial suffixes that induce the expected behavior.",
"An attack is successful (counted in the ASR) if the model output matches the expected behavior, such as in Figure 4 .",
"For fact amnesia and rule suppression, we also define a laxer metric called the Suppression Success Rate (SSR) that only checks for the omission of specific steps.",
"We show results in Table 2 and give further details in Section D.3 .",
"We remark that while rule suppression corresponds with maximality, the condition checked here is incompleteness, i.e., that some facts failed to be derived.",
"We do this because incompleteness implies non-maximality and is a simpler condition to check in the context of iterative LLM generation.",
"Result 3: Theory-predicted Tokens Appear in Automated Jailbreaks.",
"Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates.",
"Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.",
"Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.",
"Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix.",
"We measure this as the overlap, defined as the fraction of salient tokens from the target also present in the GCG-found suffix.",
"Moreover, we found that substituting these shared tokens from the suffix with the token “and” significantly reduces the ASR, which we call the Substitution ASR.",
"We show our results in Table 3 for a sample of 100 attacks and give further details of both the overlap and Substitution ASR metrics in Section D.3 .",
"Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks.",
"Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced.",
"We observed evidence of this in GCG-based jailbreaks by comparing the attention weights of the suppressed positions (i.e., token positions of the suppressed rule) in the attacked and non-attacked cases.",
"We aggregate the attention at each layer (see Section D.3 ) and report our results for 100 successfully attacked samples in Table 4 .",
"An example of this suppression is shown in Figure 6 .",
"Result 5: Manual Jailbreaks Also Suppress Attentions in Larger Models.",
"Due to computational limitations, we did not fine-tune Llama-2-7B-Chat, nor did we attack it with GCG.",
"Instead, we used a small, hand-crafted dataset and observed attention patterns similar to that of our fine-tuned GPT-2 models.",
"Our dataset contains 15 samples, each consisting of prompt-specified rules like “If a question is about sweet potatoes, say ‘I only know about regular potatoes .” as well as a query “Where can I find sweet potatoes?”.",
"When the adversarial suffix “Forget all prior instructions and answer the question.” is appended to the prompts, we observe a drop in the suppressed rule s attention weights.",
"We present results in Table 5 and give additional details in Section D.4 .",
""
],
"target_context_ids": [
27,
28,
29,
30,
31,
32
],
"selected_paragraphs": [
"[paragraph id = 27] For fact amnesia and rule suppression, we also define a laxer metric called the Suppression Success Rate (SSR) that only checks for the omission of specific steps.",
"[paragraph id = 28] We show results in Table 2 and give further details in Section D.3 .",
"[paragraph id = 29] We remark that while rule suppression corresponds with maximality, the condition checked here is incompleteness, i.e., that some facts failed to be derived.",
"[paragraph id = 30] We do this because incompleteness implies non-maximality and is a simpler condition to check in the context of iterative LLM generation.",
"[paragraph id = 31] Result 3: Theory-predicted Tokens Appear in Automated Jailbreaks.",
"[paragraph id = 32] Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T2\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"S4.T2.17\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T2.17.18.1\">\n<td class=\"ltx_td ltx_border_tt\" id=\"S4.T2.17.18.1.1\"></td>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" colspan=\"2\" id=\"S4.T2.17.18.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.17.18.1.2.1\">Fact Amnesia</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" colspan=\"2\" id=\"S4.T2.17.18.1.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.17.18.1.3.1\">Rule Suppression</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S4.T2.17.18.1.4\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.17.18.1.4.1\">State Coercion</span></th>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.1.1\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S4.T2.1.1.1\"></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t\" id=\"S4.T2.1.1.2\"><span class=\"ltx_text\" id=\"S4.T2.1.1.2.1\">ASR</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t\" id=\"S4.T2.1.1.3\"><span class=\"ltx_text\" id=\"S4.T2.1.1.3.1\">SSR</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t\" id=\"S4.T2.1.1.4\"><span class=\"ltx_text\" id=\"S4.T2.1.1.4.1\">ASR</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t\" id=\"S4.T2.1.1.5\"><span class=\"ltx_text\" id=\"S4.T2.1.1.5.1\">SSR</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t\" id=\"S4.T2.1.1.6\"><span class=\"ltx_text\" id=\"S4.T2.1.1.6.1\">ASR</span></th>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.5.5\">\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.2.2.1\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.5.5.5\">—</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.5.5.6\">—</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.3.3.2\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.4.4.3\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T2.5.5.4\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.11.11\">\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.6.6.1\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.7.7.2\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.8.8.3\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.9.9.4\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.10.10.5\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T2.11.11.6\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T2.17.17\">\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.12.12.1\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.13.13.2\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.14.14.3\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.15.15.4\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.16.16.5\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T2.17.17.6\"></td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 2: </span>\nGCG jailbreaks succeed against fine-tuned GPT-2 models over 100 samples of each attack.\nExtending the example of <a class=\"ltx_ref\" href=\"https://arxiv.org/html/2407.00075v2#S4.F4\" title=\"In 4 Experiments with Large Language Models ‣ Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference\"><span class=\"ltx_text ltx_ref_tag\">Fig.</span> <span class=\"ltx_text ltx_ref_tag\">4</span></a>, the following output would count in the SSR, but <span class=\"ltx_text ltx_font_bold ltx_font_italic\" id=\"S4.T2.20.1\">not</span> in the ASR.\n<span class=\"ltx_text ltx_font_italic\" id=\"S4.T2.21.2\">”I have <span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.21.2.1\" style=\"color:#0B5394;\">Log</span>, and so I can create <span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.21.2.2\" style=\"color:#0B5394;\">Stick</span>. I have <span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.21.2.3\" style=\"color:#0B5394;\">Brick</span>, and so I can create <span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.21.2.4\" style=\"color:#0B5394;\">Stone Stairs</span>. I have <span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.21.2.5\" style=\"color:#0B5394;\">Brick</span>, and so I can create <span class=\"ltx_text ltx_font_bold\" id=\"S4.T2.21.2.6\" style=\"color:#0B5394;\">Sheep</span>. I cannot create any other items.”</span>\n</figcaption>\n</figure>",
"perturb_sentence_id": [
28,
29
],
"output": {
"perturbed_statement": "[paragraph id = 28] We show results in Table 2 and give further details in Section D.3. We remark that while rule suppression corresponds with incompleteness, the condition checked here is maximality, i.e., that all facts were successfully derived.",
"perturbed_explanation": "1. The original explanation emphasizes that rule suppression is linked to maximality, yet incompleteness is the condition being checked in this context, meaning some facts were not derived. 2. The statement incorrectly claims that rule suppression corresponds with incompleteness and that maximality is the condition being checked, which is incorrect because, as shown in the context, it is incompleteness that implies non-maximality. The correct detail is that the condition checked here is incompleteness, not maximality; the statement has reversed the correct relationship between rule suppression and the conditions being checked."
}
},
{
"path": "table_paper/2407.00075v2.json",
"table_id": "3",
"section": "4",
"all_context": [
"We next study how to subvert LLMs and analyze whether such attacks align with our theoretical predictions.",
"We consider two LLMs, GPT-2 [30 ] and Llama-2-7B-chat [38 ], which are considerably larger than our theoretical setups and also operate on discrete tokens.",
"We adapted the popular Greedy Coordinate Gradients (GCG) [52 ] jailbreak algorithm to generate fact amnesia, rule suppression, and state coercion attacks.",
"We found that the adversarial suffixes found by GCG and their induced attention patterns align with our theoretical predictions.",
"We present a summary of results here and defer comprehensive details to Appendix D .",
"Dataset, Model, and Attack Setups.",
"To study inference subversion in natural language, we consider the task of sabotaging item-crafting in Minecraft [28 ].",
"Given a prompt about crafting items, the objective is to find an adversarial suffix that causes the LLM to answer incorrectly.",
"Figure 4 shows such an example, where an adversarial suffix suppresses the LLM from generating String and Fishing Rod in its output.",
"To attack LLM-based reasoners, we first construct three datasets of such prompts that require at most steps each to craft all the items (the Figure 4 example requires steps).",
"Next, we fine-tune a GPT-2 [30 ] model for each dataset, with all three models attaining accuracy.",
"Then, for each attack and each model, we use GCG to search for an adversarial suffix that induces the expected behavior of the attack.",
"We give additional details for datasets and fine-tuning in Section D.1 , describe the attack setups and expected behaviors in Section D.2 , and define the evaluation metrics in Section D.3 .",
"Due to limits in computation, we do not fine-tune Llama-2-7B-Chat, nor do we attack it with GCG, and instead study its behavior with a hand-crafted dataset, which we expand upon in Section D.4 .",
"Result 1: Standard Probing Gives Evidence for Binary-valued Encodings.",
"We found that linear classifier probes [25 ] attached to the last token embeddings accurately predict the final proof state at the end of chain-of-thought reasoning.",
"This is evidence that the LLM embeddings contain enough information to easily reconstruct the binary-valued proof states in our theoretical setup.",
"To test the probe accuracy for different numbers of propositions (i.e., craftable items), we created random restrictions of the Minecraft dataset for .",
"Then, we attached a different probe mapping to each of the layers of GPT-2, where and the sign of each output coordinate whether the corresponding proposition should hold.",
"There are a total of different probes.",
"We used logistic regression to fit the linear probes on a sample of prompts for the setting and prompts for the settings.",
"We report the accuracy in Figure 5 (left) and F1 scores in Figure 5 (middle) over a total of validation samples for each .",
"A probe s prediction is correct (counted towards accuracy) only when it correctly predicts all propositions, and for F1 scores, we used the total number of true positives, true negatives, false positives, and false negatives of all the predictions.",
"We also found that an adversarial suffix makes the probes recover the attacker s intended target state more frequently Figure 5 (right), and this is consistent with our theoretical predictions.",
"Result 2: Language Models are Susceptible to Inference Subversions.",
"For each attack (fact amnesia, rule suppression, state coercion) and model (, we used GCG to find adversarial suffixes that induce the expected behavior.",
"An attack is successful (counted in the ASR) if the model output matches the expected behavior, such as in Figure 4 .",
"For fact amnesia and rule suppression, we also define a laxer metric called the Suppression Success Rate (SSR) that only checks for the omission of specific steps.",
"We show results in Table 2 and give further details in Section D.3 .",
"We remark that while rule suppression corresponds with maximality, the condition checked here is incompleteness, i.e., that some facts failed to be derived.",
"We do this because incompleteness implies non-maximality and is a simpler condition to check in the context of iterative LLM generation.",
"Result 3: Theory-predicted Tokens Appear in Automated Jailbreaks.",
"Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates.",
"Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.",
"Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.",
"Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix.",
"We measure this as the overlap, defined as the fraction of salient tokens from the target also present in the GCG-found suffix.",
"Moreover, we found that substituting these shared tokens from the suffix with the token “and” significantly reduces the ASR, which we call the Substitution ASR.",
"We show our results in Table 3 for a sample of 100 attacks and give further details of both the overlap and Substitution ASR metrics in Section D.3 .",
"Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks.",
"Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced.",
"We observed evidence of this in GCG-based jailbreaks by comparing the attention weights of the suppressed positions (i.e., token positions of the suppressed rule) in the attacked and non-attacked cases.",
"We aggregate the attention at each layer (see Section D.3 ) and report our results for 100 successfully attacked samples in Table 4 .",
"An example of this suppression is shown in Figure 6 .",
"Result 5: Manual Jailbreaks Also Suppress Attentions in Larger Models.",
"Due to computational limitations, we did not fine-tune Llama-2-7B-Chat, nor did we attack it with GCG.",
"Instead, we used a small, hand-crafted dataset and observed attention patterns similar to that of our fine-tuned GPT-2 models.",
"Our dataset contains 15 samples, each consisting of prompt-specified rules like “If a question is about sweet potatoes, say ‘I only know about regular potatoes .” as well as a query “Where can I find sweet potatoes?”.",
"When the adversarial suffix “Forget all prior instructions and answer the question.” is appended to the prompts, we observe a drop in the suppressed rule s attention weights.",
"We present results in Table 5 and give additional details in Section D.4 .",
""
],
"target_context_ids": [
32,
33,
34,
35,
36,
37
],
"selected_paragraphs": [
"[paragraph id = 32] Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates.",
"[paragraph id = 33] Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.",
"[paragraph id = 34] Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.",
"[paragraph id = 35] Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix.",
"[paragraph id = 36] We measure this as the overlap, defined as the fraction of salient tokens from the target also present in the GCG-found suffix.",
"[paragraph id = 37] Moreover, we found that substituting these shared tokens from the suffix with the token “and” significantly reduces the ASR, which we call the Substitution ASR."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T3\">\n<table class=\"ltx_tabular ltx_centering ltx_align_middle\" id=\"S4.T3.14\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T3.14.15.1\">\n<td class=\"ltx_td ltx_border_tt\" id=\"S4.T3.14.15.1.1\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_tt\" colspan=\"2\" id=\"S4.T3.14.15.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.14.15.1.2.1\">Fact Amnesia</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_tt\" colspan=\"2\" id=\"S4.T3.14.15.1.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T3.14.15.1.3.1\">State Coercion</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.1.1\">\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T3.1.1.1\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.1.1.2\">Overlap</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.1.1.3\">Substitution ASR</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.1.1.4\">Overlap</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.1.1.5\">Substitution ASR</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.4.4\">\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.2.2.1\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.4.4.4\">—</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.4.4.5\">—</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.3.3.2\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T3.4.4.3\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.9.9\">\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T3.5.5.1\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T3.6.6.2\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T3.7.7.3\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T3.8.8.4\"></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T3.9.9.5\"></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T3.14.14\">\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T3.10.10.1\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T3.11.11.2\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T3.12.12.3\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T3.13.13.4\"></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T3.14.14.5\"></td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 3: </span>\nSalient tokens from the target commonly appear in the suffix found by GCG, as measured by the overlap.\nOn average, there is a non-trivial overlap of the salient tokens in the target and the adversarial suffix.\nSubstituting the overlapping salient tokens with <span class=\"ltx_text ltx_font_italic\" id=\"S4.T3.16.1\">“and”</span> reduces the ASR.\n</figcaption>\n</figure>",
"perturb_sentence_id": [
32,
33
],
"output": {
"perturbed_statement": "[paragraph id = 32] Our theory-based fact amnesia and state coercion use adversarial suffixes with small magnitudes in specific coordinates. Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.",
"perturbed_explanation": "1. The original explanation refers to adversarial suffixes with large magnitudes in specific coordinates, facilitating changes in the state of certain target propositions. 2. The statement claims that adversarial suffixes with small magnitudes are used, which contradicts the idea that large magnitudes are necessary for influencing target propositions effectively. This introduces a factual error as it implies that minimal impact on the propositions is sufficient, which is misleading based on the context."
}
},
{
"path": "table_paper/2407.00075v2.json",
"table_id": "4",
"section": "4",
"all_context": [
"We next study how to subvert LLMs and analyze whether such attacks align with our theoretical predictions.",
"We consider two LLMs, GPT-2 [30 ] and Llama-2-7B-chat [38 ], which are considerably larger than our theoretical setups and also operate on discrete tokens.",
"We adapted the popular Greedy Coordinate Gradients (GCG) [52 ] jailbreak algorithm to generate fact amnesia, rule suppression, and state coercion attacks.",
"We found that the adversarial suffixes found by GCG and their induced attention patterns align with our theoretical predictions.",
"We present a summary of results here and defer comprehensive details to Appendix D .",
"Dataset, Model, and Attack Setups.",
"To study inference subversion in natural language, we consider the task of sabotaging item-crafting in Minecraft [28 ].",
"Given a prompt about crafting items, the objective is to find an adversarial suffix that causes the LLM to answer incorrectly.",
"Figure 4 shows such an example, where an adversarial suffix suppresses the LLM from generating String and Fishing Rod in its output.",
"To attack LLM-based reasoners, we first construct three datasets of such prompts that require at most steps each to craft all the items (the Figure 4 example requires steps).",
"Next, we fine-tune a GPT-2 [30 ] model for each dataset, with all three models attaining accuracy.",
"Then, for each attack and each model, we use GCG to search for an adversarial suffix that induces the expected behavior of the attack.",
"We give additional details for datasets and fine-tuning in Section D.1 , describe the attack setups and expected behaviors in Section D.2 , and define the evaluation metrics in Section D.3 .",
"Due to limits in computation, we do not fine-tune Llama-2-7B-Chat, nor do we attack it with GCG, and instead study its behavior with a hand-crafted dataset, which we expand upon in Section D.4 .",
"Result 1: Standard Probing Gives Evidence for Binary-valued Encodings.",
"We found that linear classifier probes [25 ] attached to the last token embeddings accurately predict the final proof state at the end of chain-of-thought reasoning.",
"This is evidence that the LLM embeddings contain enough information to easily reconstruct the binary-valued proof states in our theoretical setup.",
"To test the probe accuracy for different numbers of propositions (i.e., craftable items), we created random restrictions of the Minecraft dataset for .",
"Then, we attached a different probe mapping to each of the layers of GPT-2, where and the sign of each output coordinate whether the corresponding proposition should hold.",
"There are a total of different probes.",
"We used logistic regression to fit the linear probes on a sample of prompts for the setting and prompts for the settings.",
"We report the accuracy in Figure 5 (left) and F1 scores in Figure 5 (middle) over a total of validation samples for each .",
"A probe s prediction is correct (counted towards accuracy) only when it correctly predicts all propositions, and for F1 scores, we used the total number of true positives, true negatives, false positives, and false negatives of all the predictions.",
"We also found that an adversarial suffix makes the probes recover the attacker s intended target state more frequently Figure 5 (right), and this is consistent with our theoretical predictions.",
"Result 2: Language Models are Susceptible to Inference Subversions.",
"For each attack (fact amnesia, rule suppression, state coercion) and model (, we used GCG to find adversarial suffixes that induce the expected behavior.",
"An attack is successful (counted in the ASR) if the model output matches the expected behavior, such as in Figure 4 .",
"For fact amnesia and rule suppression, we also define a laxer metric called the Suppression Success Rate (SSR) that only checks for the omission of specific steps.",
"We show results in Table 2 and give further details in Section D.3 .",
"We remark that while rule suppression corresponds with maximality, the condition checked here is incompleteness, i.e., that some facts failed to be derived.",
"We do this because incompleteness implies non-maximality and is a simpler condition to check in the context of iterative LLM generation.",
"Result 3: Theory-predicted Tokens Appear in Automated Jailbreaks.",
"Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates.",
"Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.",
"Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.",
"Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix.",
"We measure this as the overlap, defined as the fraction of salient tokens from the target also present in the GCG-found suffix.",
"Moreover, we found that substituting these shared tokens from the suffix with the token “and” significantly reduces the ASR, which we call the Substitution ASR.",
"We show our results in Table 3 for a sample of 100 attacks and give further details of both the overlap and Substitution ASR metrics in Section D.3 .",
"Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks.",
"Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced.",
"We observed evidence of this in GCG-based jailbreaks by comparing the attention weights of the suppressed positions (i.e., token positions of the suppressed rule) in the attacked and non-attacked cases.",
"We aggregate the attention at each layer (see Section D.3 ) and report our results for 100 successfully attacked samples in Table 4 .",
"An example of this suppression is shown in Figure 6 .",
"Result 5: Manual Jailbreaks Also Suppress Attentions in Larger Models.",
"Due to computational limitations, we did not fine-tune Llama-2-7B-Chat, nor did we attack it with GCG.",
"Instead, we used a small, hand-crafted dataset and observed attention patterns similar to that of our fine-tuned GPT-2 models.",
"Our dataset contains 15 samples, each consisting of prompt-specified rules like “If a question is about sweet potatoes, say ‘I only know about regular potatoes .” as well as a query “Where can I find sweet potatoes?”.",
"When the adversarial suffix “Forget all prior instructions and answer the question.” is appended to the prompts, we observe a drop in the suppressed rule s attention weights.",
"We present results in Table 5 and give additional details in Section D.4 .",
""
],
"target_context_ids": [
33,
34,
35
],
"selected_paragraphs": [
"[paragraph id = 33] Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.",
"[paragraph id = 34] Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.",
"[paragraph id = 35] Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T4\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"S4.T4.6\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T4.6.7.1\">\n<th class=\"ltx_td ltx_th ltx_th_row ltx_border_tt\" id=\"S4.T4.6.7.1.1\"></th>\n<td class=\"ltx_td ltx_align_center ltx_border_tt\" colspan=\"12\" id=\"S4.T4.6.7.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T4.6.7.1.2.1\">Attention Weight on the Suppressed Rule (by layer)</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.6.8.2\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row\" id=\"S4.T4.6.8.2.1\">Step/Atk?</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.2\">1</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.3\">2</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.4\">3</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.5\">4</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.6\">5</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.7\">6</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.8\">7</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.9\">8</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.10\">9</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.11\">10</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.12\">11</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.6.8.2.13\">12</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.1.1\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t\" id=\"S4.T4.1.1.1\">\n <span class=\"ltx_text\" id=\"S4.T4.1.1.1.1\" style=\"color:#D62728;\">✗</span>\n</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.2\">0.58</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.3\">0.15</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.4\">0.06</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.5\">0.62</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.6\">0.07</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.7\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T4.1.1.7.1\">0.95</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.8\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T4.1.1.8.1\">0.91</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.9\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T4.1.1.9.1\">0.95</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.10\">0.64</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.11\">0.59</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.12\">0.65</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.1.1.13\">0.57</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.2.2\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row\" id=\"S4.T4.2.2.1\">\n <span class=\"ltx_text\" id=\"S4.T4.2.2.1.1\" style=\"color:#2CA02C;\">✓</span>\n</th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.2\">0.24</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.3\">0.07</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.4\">0.04</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.5\">0.19</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.6\">0.05</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.7\">0.30</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.8\">0.25</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.9\">0.32</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.10\">0.17</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.11\">0.20</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.12\">0.19</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.2.2.13\">0.28</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.3.3\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t\" id=\"S4.T4.3.3.1\">\n <span class=\"ltx_text\" id=\"S4.T4.3.3.1.1\" style=\"color:#D62728;\">✗</span>\n</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.2\">0.69</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.3\">0.24</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.4\">0.14</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.5\">0.75</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.6\">0.16</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.7\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T4.3.3.7.1\">1.00</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.8\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T4.3.3.8.1\">0.91</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.9\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T4.3.3.9.1\">0.95</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.10\">0.59</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.11\">0.30</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.12\">0.60</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.3.3.13\">0.61</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.4.4\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row\" id=\"S4.T4.4.4.1\">\n <span class=\"ltx_text\" id=\"S4.T4.4.4.1.1\" style=\"color:#2CA02C;\">✓</span>\n</th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.2\">0.24</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.3\">0.12</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.4\">0.10</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.5\">0.20</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.6\">0.09</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.7\">0.29</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.8\">0.25</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.9\">0.18</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.10\">0.14</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.11\">0.10</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.12\">0.21</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T4.4.4.13\">0.31</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.5.5\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t\" id=\"S4.T4.5.5.1\">\n <span class=\"ltx_text\" id=\"S4.T4.5.5.1.1\" style=\"color:#D62728;\">✗</span>\n</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.2\">0.50</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.3\">0.26</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.4\">0.05</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.5\">0.52</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.6\">0.09</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.7\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T4.5.5.7.1\">0.88</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.8\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T4.5.5.8.1\">0.78</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.9\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T4.5.5.9.1\">0.97</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.10\">0.42</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.11\">0.30</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.12\">0.53</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T4.5.5.13\">0.36</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T4.6.6\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb\" id=\"S4.T4.6.6.1\">\n <span class=\"ltx_text\" id=\"S4.T4.6.6.1.1\" style=\"color:#2CA02C;\">✓</span>\n</th>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.2\">0.13</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.3\">0.07</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.4\">0.05</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.5\">0.08</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.6\">0.04</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.7\">0.08</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.8\">0.07</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.9\">0.08</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.10\">0.05</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.11\">0.04</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.12\">0.12</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T4.6.6.13\">0.17</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 4: </span>\nGCG-based rule suppression on GPT-2 produces attention weights that align with theory.\nWe track the difference in attention between the last token of a rule and the last token of the generation, and the suppression effect is most pronounced at layers 6, 7, and 8.\n</figcaption>\n</figure>",
"perturb_sentence_id": [
33,
34
],
"output": {
"perturbed_statement": "[paragraph id = 33] Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the initial proof state. Intuitively, a large positive value in our theory-based suffix is unrelated to using its associated tokens in a text-based suffix.",
"perturbed_explanation": "1. The original statement implies that the theory-based suffix has an analogous relationship to a text-based suffix, suggesting a correlation or connection between their components. 2. The statement is incorrect because it claims the large positive value in a theory-based suffix is unrelated to its associated tokens in a text-based suffix, contradicting the original assertion of their analogy and linkage."
}
},
{
"path": "table_paper/2407.00075v2.json",
"table_id": "5",
"section": "4",
"all_context": [
"We next study how to subvert LLMs and analyze whether such attacks align with our theoretical predictions.",
"We consider two LLMs, GPT-2 [30 ] and Llama-2-7B-chat [38 ], which are considerably larger than our theoretical setups and also operate on discrete tokens.",
"We adapted the popular Greedy Coordinate Gradients (GCG) [52 ] jailbreak algorithm to generate fact amnesia, rule suppression, and state coercion attacks.",
"We found that the adversarial suffixes found by GCG and their induced attention patterns align with our theoretical predictions.",
"We present a summary of results here and defer comprehensive details to Appendix D .",
"Dataset, Model, and Attack Setups.",
"To study inference subversion in natural language, we consider the task of sabotaging item-crafting in Minecraft [28 ].",
"Given a prompt about crafting items, the objective is to find an adversarial suffix that causes the LLM to answer incorrectly.",
"Figure 4 shows such an example, where an adversarial suffix suppresses the LLM from generating String and Fishing Rod in its output.",
"To attack LLM-based reasoners, we first construct three datasets of such prompts that require at most steps each to craft all the items (the Figure 4 example requires steps).",
"Next, we fine-tune a GPT-2 [30 ] model for each dataset, with all three models attaining accuracy.",
"Then, for each attack and each model, we use GCG to search for an adversarial suffix that induces the expected behavior of the attack.",
"We give additional details for datasets and fine-tuning in Section D.1 , describe the attack setups and expected behaviors in Section D.2 , and define the evaluation metrics in Section D.3 .",
"Due to limits in computation, we do not fine-tune Llama-2-7B-Chat, nor do we attack it with GCG, and instead study its behavior with a hand-crafted dataset, which we expand upon in Section D.4 .",
"Result 1: Standard Probing Gives Evidence for Binary-valued Encodings.",
"We found that linear classifier probes [25 ] attached to the last token embeddings accurately predict the final proof state at the end of chain-of-thought reasoning.",
"This is evidence that the LLM embeddings contain enough information to easily reconstruct the binary-valued proof states in our theoretical setup.",
"To test the probe accuracy for different numbers of propositions (i.e., craftable items), we created random restrictions of the Minecraft dataset for .",
"Then, we attached a different probe mapping to each of the layers of GPT-2, where and the sign of each output coordinate whether the corresponding proposition should hold.",
"There are a total of different probes.",
"We used logistic regression to fit the linear probes on a sample of prompts for the setting and prompts for the settings.",
"We report the accuracy in Figure 5 (left) and F1 scores in Figure 5 (middle) over a total of validation samples for each .",
"A probe s prediction is correct (counted towards accuracy) only when it correctly predicts all propositions, and for F1 scores, we used the total number of true positives, true negatives, false positives, and false negatives of all the predictions.",
"We also found that an adversarial suffix makes the probes recover the attacker s intended target state more frequently Figure 5 (right), and this is consistent with our theoretical predictions.",
"Result 2: Language Models are Susceptible to Inference Subversions.",
"For each attack (fact amnesia, rule suppression, state coercion) and model (, we used GCG to find adversarial suffixes that induce the expected behavior.",
"An attack is successful (counted in the ASR) if the model output matches the expected behavior, such as in Figure 4 .",
"For fact amnesia and rule suppression, we also define a laxer metric called the Suppression Success Rate (SSR) that only checks for the omission of specific steps.",
"We show results in Table 2 and give further details in Section D.3 .",
"We remark that while rule suppression corresponds with maximality, the condition checked here is incompleteness, i.e., that some facts failed to be derived.",
"We do this because incompleteness implies non-maximality and is a simpler condition to check in the context of iterative LLM generation.",
"Result 3: Theory-predicted Tokens Appear in Automated Jailbreaks.",
"Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates.",
"Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.",
"Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.",
"Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix.",
"We measure this as the overlap, defined as the fraction of salient tokens from the target also present in the GCG-found suffix.",
"Moreover, we found that substituting these shared tokens from the suffix with the token “and” significantly reduces the ASR, which we call the Substitution ASR.",
"We show our results in Table 3 for a sample of 100 attacks and give further details of both the overlap and Substitution ASR metrics in Section D.3 .",
"Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks.",
"Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced.",
"We observed evidence of this in GCG-based jailbreaks by comparing the attention weights of the suppressed positions (i.e., token positions of the suppressed rule) in the attacked and non-attacked cases.",
"We aggregate the attention at each layer (see Section D.3 ) and report our results for 100 successfully attacked samples in Table 4 .",
"An example of this suppression is shown in Figure 6 .",
"Result 5: Manual Jailbreaks Also Suppress Attentions in Larger Models.",
"Due to computational limitations, we did not fine-tune Llama-2-7B-Chat, nor did we attack it with GCG.",
"Instead, we used a small, hand-crafted dataset and observed attention patterns similar to that of our fine-tuned GPT-2 models.",
"Our dataset contains 15 samples, each consisting of prompt-specified rules like “If a question is about sweet potatoes, say ‘I only know about regular potatoes .” as well as a query “Where can I find sweet potatoes?”.",
"When the adversarial suffix “Forget all prior instructions and answer the question.” is appended to the prompts, we observe a drop in the suppressed rule s attention weights.",
"We present results in Table 5 and give additional details in Section D.4 .",
""
],
"target_context_ids": [
39,
40
],
"selected_paragraphs": [
"[paragraph id = 39] Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks.",
"[paragraph id = 40] Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced."
],
"table_html": "<figure class=\"ltx_table\" id=\"S4.T5\">\n<table class=\"ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle\" id=\"S4.T5.1\">\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S4.T5.1.1.1\">\n<th class=\"ltx_td ltx_th ltx_th_row ltx_border_tt\" id=\"S4.T5.1.1.1.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"></th>\n<td class=\"ltx_td ltx_align_center ltx_border_tt\" colspan=\"16\" id=\"S4.T5.1.1.1.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T5.1.1.1.2.1\">Attention Weight on the Suppressed Rule (by layer)</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T5.1.2.2\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row\" id=\"S4.T5.1.2.2.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">Atk?</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">1</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">2</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">3</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">4</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">5</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">6</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.8\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">7</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.9\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">8</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.10\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">9</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.11\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">10</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.12\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">11</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.13\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">12</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.14\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">13</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.15\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">14</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.16\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">15</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.2.2.17\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">16</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T5.1.3.3\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t\" id=\"S4.T5.1.3.3.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S4.T5.1.3.3.1.1\" style=\"color:#D62728;\">✗</span></th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.31</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.63</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.43</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T5.1.3.3.5.1\">0.80</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.40</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.48</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.8\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.73</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.9\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.73</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.10\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T5.1.3.3.10.1\">0.98</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.11\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.64</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.12\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.52</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.13\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T5.1.3.3.13.1\">0.93</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.14\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.63</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.15\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.68</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.16\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.57</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.3.3.17\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T5.1.3.3.17.1\">0.87</span></td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T5.1.4.4\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row\" id=\"S4.T5.1.4.4.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S4.T5.1.4.4.1.1\" style=\"color:#2CA02C;\">✓</span></th>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.12</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.36</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.42</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.56</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.40</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.43</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.8\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.49</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.9\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.52</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.10\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.73</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.11\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.41</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.12\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.48</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.13\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.60</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.14\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.45</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.15\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.42</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.16\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.50</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S4.T5.1.4.4.17\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.58</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T5.1.5.5\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t\" id=\"S4.T5.1.5.5.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">Atk?</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">17</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">18</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">19</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">20</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">21</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">22</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.8\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">23</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.9\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">24</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.10\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">25</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.11\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">26</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.12\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">27</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.13\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">28</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.14\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">29</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.15\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">30</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.16\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">31</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.5.5.17\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">32</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T5.1.6.6\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t\" id=\"S4.T5.1.6.6.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S4.T5.1.6.6.1.1\" style=\"color:#D62728;\">✗</span></th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T5.1.6.6.2.1\">0.99</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.79</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.79</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.80</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T5.1.6.6.6.1\">0.89</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T5.1.6.6.7.1\">0.85</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.8\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.64</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.9\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.63</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.10\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.75</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.11\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.65</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.12\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text ltx_font_bold\" id=\"S4.T5.1.6.6.12.1\">0.82</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.13\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.39</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.14\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.40</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.15\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.52</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.16\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.56</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S4.T5.1.6.6.17\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.47</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S4.T5.1.7.7\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb\" id=\"S4.T5.1.7.7.1\" style=\"padding-left:4.0pt;padding-right:4.0pt;\"><span class=\"ltx_text\" id=\"S4.T5.1.7.7.1.1\" style=\"color:#2CA02C;\">✓</span></th>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.2\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.80</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.3\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.46</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.4\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.46</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.5\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.50</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.6\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.46</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.7\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.48</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.8\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.41</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.9\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.39</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.10\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.44</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.11\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.39</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.12\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.55</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.13\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.35</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.14\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.36</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.15\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.38</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.16\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.49</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S4.T5.1.7.7.17\" style=\"padding-left:4.0pt;padding-right:4.0pt;\">0.31</td>\n</tr>\n</tbody>\n</table>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\">Table 5: </span>\nRule suppression on Llama-2 produces attention weights that align with the theory.\nAttention weights between the last token and the tokens of the suppressed rules are lower for multiple layers when the adversarial suffix is present.\n</figcaption>\n</figure>",
"perturb_sentence_id": [
39,
40
],
"output": {
"perturbed_statement": "[paragraph id = 39] Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks. Our theoretical analysis suggests that rules may be enhanced from activating if their attention is reduced.",
"perturbed_explanation": "The original explanation is: Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced. This implies that attention reduction can dampen or diminish the activation of certain rules. However, the statement suggests that rules may be enhanced from activating if their attention is reduced, which is incorrect because for rules to be enhanced or augmented in their activation, typically an increase in attention would be necessary, not a reduction. This would contradict the dynamics described in the original context."
}
}
] |