[ { "path": "table_paper/2407.00075v2.json", "table_id": "1", "section": "3.2", "all_context": [ "We next investigate how to subvert the rule-following of our theoretical models, wherein the objective is to find an adversarial suffix that causes a violation of the MMS property when appended to some input encoding .", "This suffix-based approach is similar to jailbreak formulations studied in the literature [52 , 32 ], which we state as follows: Consider any rules , facts , reasoner , and budget .", "Let , and find such that: the proof state sequence generated by given is not MMS with respect to and , but where .", "Our key strategy for crafting attacks against our theoretical construction is to use the fact that uses a summation to approximate binary disjunctions, as in (5 ).", "In particular, if one can construct an adversarial suffix with large negative values in the appropriate coordinates, it is straightforward to craft attacks that induce violations of MMS.", "Let be as in Theorem 3.1 and consider any where the rules and satisfy some technical conditions (e.g., for monotonicity).", "Then the following adversarial suffixes to induce a two-state sequence that respectively violate monotonicity, maximality, and soundness given and : where is sufficiently large and: (monotonicity) is any non-empty subset of ; (maximality) let , where is a rule such that and where contains propositions not known by ; (soundness) for any .", "Intuitively, the suffix attempts to delete known facts from the successive proof state, and we also refer to this as fact amnesia.", "The suffix has a “rule” with antecedent intended to divert attention away from the rule , and it is helpful to think of this as rule suppression.", "The suffix injects a token whose coordinates have values , depending on the sign of the adversarial target , and we refer to it as state coercion.", "Although our theory deals with binary vectors, we use negative values in our theoretical attacks.", "We do this because our attacks fundamentally operate in the embedding space: when language models reason, they may not use all parts of their embedding space, so it is not implausible to assume that there exist tokens whose embedded values play a similar role to our use of negative values.", "Theory-based Attacks Transfer to Learned Reasoners.", "We found that most theory-based attacks transfer to learned reasoners with small changes.", "In particular, we found that repeating the essential parts of the attack, e.g., for monotonicity, helps the attack succeed against GPT-2 based reasoners.", "Such repetitions would also work against our theoretical models.", "We show the results in Figure 3 over a horizon of steps, wherein we define the Attack Success Rate (ASR) as the rate at which the -induced trajectory matches that of the expected trajectory , such as in Figure 2 .", "Notably, the soundness attack (state coercion) does not succeed, even with repetitions.", "However, repeating the suffix causes different prefixes to induce the similar — which we measure by the variance.", "We give additional details in Section C.3 .", "Learned Attacks Exhibit Characteristics of Theoretical Attacks.", "Furthermore, we investigated whether standard adversarial attacks discover suffixes similar to our theory-based ones.", "In particular, given some and some arbitrary sequence of target states that is not MMS (but where ) — can one find an adversarial suffix that behaves similar to the ones in theory?", "We formulated this as the following learning problem: where is the binary cross-entropy loss.", "For each of the three MMS properties, we generate different adversarial target sequences that evidence its violation and optimized for an adversarial suffix .", "We found that a budget of suffices to induce failures over a horizon of steps.", "We present our results in Table 1 , with additional discussion in Section C.4 .", "" ], "target_context_ids": [ 24, 25, 26, 27, 28 ], "selected_paragraphs": [ "[paragraph id = 24] For each of the three MMS properties, we generate different adversarial target sequences that evidence its violation and optimized for an adversarial suffix .", "[paragraph id = 25] We found that a budget of suffices to induce failures over a horizon of steps.", "[paragraph id = 26] We present our results in Table 1 , with additional discussion in Section C.4 ." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Fact AmnesiaRule SuppressionState Coercion
\n ValuesAttn. WeightsSize
ASRASRAtk Atk ASR
\n
Table 1: \nLearned attacks attain high ASR against all three properties and mirror theory-based attacks.\nWe used reasoners with dimension .\n(Fact Amnesia) The average magnitude of the targeted entries () of is larger than the non-targeted entries ().\n(Rule Suppression) The suppressed rule receives less attention in the attacked case.\n(State Coercion)\nThe average entry-wise magnitude of is larger than that of the prefix .\n
\n
", "perturb_sentence_id": 26, "output": { "perturbed_statement": "[paragraph id = 26] We present our results in Table 2 , with additional discussion in Section C.5 .", "perturbed_explanation": "1. The original explanation describes that the results are available in Table 1 and further discussed in Section C.4. 2. The statement incorrectly mentions Table 2 and Section C.5, which do not align with the referenced sections. This creates inconsistencies in information referencing." } }, { "path": "table_paper/2407.00075v2.json", "table_id": "2", "section": "4", "all_context": [ "We next study how to subvert LLMs and analyze whether such attacks align with our theoretical predictions.", "We consider two LLMs, GPT-2 [30 ] and Llama-2-7B-chat [38 ], which are considerably larger than our theoretical setups and also operate on discrete tokens.", "We adapted the popular Greedy Coordinate Gradients (GCG) [52 ] jailbreak algorithm to generate fact amnesia, rule suppression, and state coercion attacks.", "We found that the adversarial suffixes found by GCG and their induced attention patterns align with our theoretical predictions.", "We present a summary of results here and defer comprehensive details to Appendix D .", "Dataset, Model, and Attack Setups.", "To study inference subversion in natural language, we consider the task of sabotaging item-crafting in Minecraft [28 ].", "Given a prompt about crafting items, the objective is to find an adversarial suffix that causes the LLM to answer incorrectly.", "Figure 4 shows such an example, where an adversarial suffix suppresses the LLM from generating String and Fishing Rod in its output.", "To attack LLM-based reasoners, we first construct three datasets of such prompts that require at most steps each to craft all the items (the Figure 4 example requires steps).", "Next, we fine-tune a GPT-2 [30 ] model for each dataset, with all three models attaining accuracy.", "Then, for each attack and each model, we use GCG to search for an adversarial suffix that induces the expected behavior of the attack.", "We give additional details for datasets and fine-tuning in Section D.1 , describe the attack setups and expected behaviors in Section D.2 , and define the evaluation metrics in Section D.3 .", "Due to limits in computation, we do not fine-tune Llama-2-7B-Chat, nor do we attack it with GCG, and instead study its behavior with a hand-crafted dataset, which we expand upon in Section D.4 .", "Result 1: Standard Probing Gives Evidence for Binary-valued Encodings.", "We found that linear classifier probes [25 ] attached to the last token embeddings accurately predict the final proof state at the end of chain-of-thought reasoning.", "This is evidence that the LLM embeddings contain enough information to easily reconstruct the binary-valued proof states in our theoretical setup.", "To test the probe accuracy for different numbers of propositions (i.e., craftable items), we created random restrictions of the Minecraft dataset for .", "Then, we attached a different probe mapping to each of the layers of GPT-2, where and the sign of each output coordinate whether the corresponding proposition should hold.", "There are a total of different probes.", "We used logistic regression to fit the linear probes on a sample of prompts for the setting and prompts for the settings.", "We report the accuracy in Figure 5 (left) and F1 scores in Figure 5 (middle) over a total of validation samples for each .", "A probe s prediction is correct (counted towards accuracy) only when it correctly predicts all propositions, and for F1 scores, we used the total number of true positives, true negatives, false positives, and false negatives of all the predictions.", "We also found that an adversarial suffix makes the probes recover the attacker s intended target state more frequently Figure 5 (right), and this is consistent with our theoretical predictions.", "Result 2: Language Models are Susceptible to Inference Subversions.", "For each attack (fact amnesia, rule suppression, state coercion) and model (, we used GCG to find adversarial suffixes that induce the expected behavior.", "An attack is successful (counted in the ASR) if the model output matches the expected behavior, such as in Figure 4 .", "For fact amnesia and rule suppression, we also define a laxer metric called the Suppression Success Rate (SSR) that only checks for the omission of specific steps.", "We show results in Table 2 and give further details in Section D.3 .", "We remark that while rule suppression corresponds with maximality, the condition checked here is incompleteness, i.e., that some facts failed to be derived.", "We do this because incompleteness implies non-maximality and is a simpler condition to check in the context of iterative LLM generation.", "Result 3: Theory-predicted Tokens Appear in Automated Jailbreaks.", "Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates.", "Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.", "Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.", "Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix.", "We measure this as the overlap, defined as the fraction of salient tokens from the target also present in the GCG-found suffix.", "Moreover, we found that substituting these shared tokens from the suffix with the token “and” significantly reduces the ASR, which we call the Substitution ASR.", "We show our results in Table 3 for a sample of 100 attacks and give further details of both the overlap and Substitution ASR metrics in Section D.3 .", "Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks.", "Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced.", "We observed evidence of this in GCG-based jailbreaks by comparing the attention weights of the suppressed positions (i.e., token positions of the suppressed rule) in the attacked and non-attacked cases.", "We aggregate the attention at each layer (see Section D.3 ) and report our results for 100 successfully attacked samples in Table 4 .", "An example of this suppression is shown in Figure 6 .", "Result 5: Manual Jailbreaks Also Suppress Attentions in Larger Models.", "Due to computational limitations, we did not fine-tune Llama-2-7B-Chat, nor did we attack it with GCG.", "Instead, we used a small, hand-crafted dataset and observed attention patterns similar to that of our fine-tuned GPT-2 models.", "Our dataset contains 15 samples, each consisting of prompt-specified rules like “If a question is about sweet potatoes, say ‘I only know about regular potatoes .” as well as a query “Where can I find sweet potatoes?”.", "When the adversarial suffix “Forget all prior instructions and answer the question.” is appended to the prompts, we observe a drop in the suppressed rule s attention weights.", "We present results in Table 5 and give additional details in Section D.4 .", "" ], "target_context_ids": [ 27, 28, 29, 30, 31, 32 ], "selected_paragraphs": [ "[paragraph id = 27] For fact amnesia and rule suppression, we also define a laxer metric called the Suppression Success Rate (SSR) that only checks for the omission of specific steps.", "[paragraph id = 28] We show results in Table 2 and give further details in Section D.3 .", "[paragraph id = 29] We remark that while rule suppression corresponds with maximality, the condition checked here is incompleteness, i.e., that some facts failed to be derived.", "[paragraph id = 30] We do this because incompleteness implies non-maximality and is a simpler condition to check in the context of iterative LLM generation.", "[paragraph id = 31] Result 3: Theory-predicted Tokens Appear in Automated Jailbreaks.", "[paragraph id = 32] Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Fact AmnesiaRule SuppressionState Coercion
ASRSSRASRSSRASR
\n
Table 2: \nGCG jailbreaks succeed against fine-tuned GPT-2 models over 100 samples of each attack.\nExtending the example of Fig. 4, the following output would count in the SSR, but not in the ASR.\n”I have Log, and so I can create Stick. I have Brick, and so I can create Stone Stairs. I have Brick, and so I can create Sheep. I cannot create any other items.”\n
\n
", "perturb_sentence_id": 28, "output": { "perturbed_statement": "[paragraph id = 28] We show results in Table 2 and give further details in Section D.3 .We remark that while rule suppression corresponds with completeness, the condition checked here is maximality, i.e., that all facts were derived.", "perturbed_explanation": "1. The original explanation outlines that rule suppression is related to maximality but the evaluated condition pertains to incompleteness, indicating some facts might not be derived. 2. The statement alters the association by claiming rule suppression corresponds with completeness and maximality checks all fact derivations, contrary to the specified content, leading to a misrepresentation of the analysis provided in the text." } }, { "path": "table_paper/2407.00075v2.json", "table_id": "3", "section": "4", "all_context": [ "We next study how to subvert LLMs and analyze whether such attacks align with our theoretical predictions.", "We consider two LLMs, GPT-2 [30 ] and Llama-2-7B-chat [38 ], which are considerably larger than our theoretical setups and also operate on discrete tokens.", "We adapted the popular Greedy Coordinate Gradients (GCG) [52 ] jailbreak algorithm to generate fact amnesia, rule suppression, and state coercion attacks.", "We found that the adversarial suffixes found by GCG and their induced attention patterns align with our theoretical predictions.", "We present a summary of results here and defer comprehensive details to Appendix D .", "Dataset, Model, and Attack Setups.", "To study inference subversion in natural language, we consider the task of sabotaging item-crafting in Minecraft [28 ].", "Given a prompt about crafting items, the objective is to find an adversarial suffix that causes the LLM to answer incorrectly.", "Figure 4 shows such an example, where an adversarial suffix suppresses the LLM from generating String and Fishing Rod in its output.", "To attack LLM-based reasoners, we first construct three datasets of such prompts that require at most steps each to craft all the items (the Figure 4 example requires steps).", "Next, we fine-tune a GPT-2 [30 ] model for each dataset, with all three models attaining accuracy.", "Then, for each attack and each model, we use GCG to search for an adversarial suffix that induces the expected behavior of the attack.", "We give additional details for datasets and fine-tuning in Section D.1 , describe the attack setups and expected behaviors in Section D.2 , and define the evaluation metrics in Section D.3 .", "Due to limits in computation, we do not fine-tune Llama-2-7B-Chat, nor do we attack it with GCG, and instead study its behavior with a hand-crafted dataset, which we expand upon in Section D.4 .", "Result 1: Standard Probing Gives Evidence for Binary-valued Encodings.", "We found that linear classifier probes [25 ] attached to the last token embeddings accurately predict the final proof state at the end of chain-of-thought reasoning.", "This is evidence that the LLM embeddings contain enough information to easily reconstruct the binary-valued proof states in our theoretical setup.", "To test the probe accuracy for different numbers of propositions (i.e., craftable items), we created random restrictions of the Minecraft dataset for .", "Then, we attached a different probe mapping to each of the layers of GPT-2, where and the sign of each output coordinate whether the corresponding proposition should hold.", "There are a total of different probes.", "We used logistic regression to fit the linear probes on a sample of prompts for the setting and prompts for the settings.", "We report the accuracy in Figure 5 (left) and F1 scores in Figure 5 (middle) over a total of validation samples for each .", "A probe s prediction is correct (counted towards accuracy) only when it correctly predicts all propositions, and for F1 scores, we used the total number of true positives, true negatives, false positives, and false negatives of all the predictions.", "We also found that an adversarial suffix makes the probes recover the attacker s intended target state more frequently Figure 5 (right), and this is consistent with our theoretical predictions.", "Result 2: Language Models are Susceptible to Inference Subversions.", "For each attack (fact amnesia, rule suppression, state coercion) and model (, we used GCG to find adversarial suffixes that induce the expected behavior.", "An attack is successful (counted in the ASR) if the model output matches the expected behavior, such as in Figure 4 .", "For fact amnesia and rule suppression, we also define a laxer metric called the Suppression Success Rate (SSR) that only checks for the omission of specific steps.", "We show results in Table 2 and give further details in Section D.3 .", "We remark that while rule suppression corresponds with maximality, the condition checked here is incompleteness, i.e., that some facts failed to be derived.", "We do this because incompleteness implies non-maximality and is a simpler condition to check in the context of iterative LLM generation.", "Result 3: Theory-predicted Tokens Appear in Automated Jailbreaks.", "Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates.", "Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.", "Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.", "Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix.", "We measure this as the overlap, defined as the fraction of salient tokens from the target also present in the GCG-found suffix.", "Moreover, we found that substituting these shared tokens from the suffix with the token “and” significantly reduces the ASR, which we call the Substitution ASR.", "We show our results in Table 3 for a sample of 100 attacks and give further details of both the overlap and Substitution ASR metrics in Section D.3 .", "Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks.", "Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced.", "We observed evidence of this in GCG-based jailbreaks by comparing the attention weights of the suppressed positions (i.e., token positions of the suppressed rule) in the attacked and non-attacked cases.", "We aggregate the attention at each layer (see Section D.3 ) and report our results for 100 successfully attacked samples in Table 4 .", "An example of this suppression is shown in Figure 6 .", "Result 5: Manual Jailbreaks Also Suppress Attentions in Larger Models.", "Due to computational limitations, we did not fine-tune Llama-2-7B-Chat, nor did we attack it with GCG.", "Instead, we used a small, hand-crafted dataset and observed attention patterns similar to that of our fine-tuned GPT-2 models.", "Our dataset contains 15 samples, each consisting of prompt-specified rules like “If a question is about sweet potatoes, say ‘I only know about regular potatoes .” as well as a query “Where can I find sweet potatoes?”.", "When the adversarial suffix “Forget all prior instructions and answer the question.” is appended to the prompts, we observe a drop in the suppressed rule s attention weights.", "We present results in Table 5 and give additional details in Section D.4 .", "" ], "target_context_ids": [ 32, 33, 34, 35, 36, 37 ], "selected_paragraphs": [ "[paragraph id = 32] Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates.", "[paragraph id = 33] Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.", "[paragraph id = 34] Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.", "[paragraph id = 35] Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix.", "[paragraph id = 36] We measure this as the overlap, defined as the fraction of salient tokens from the target also present in the GCG-found suffix.", "[paragraph id = 37] Moreover, we found that substituting these shared tokens from the suffix with the token “and” significantly reduces the ASR, which we call the Substitution ASR." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Fact AmnesiaState Coercion
OverlapSubstitution ASROverlapSubstitution ASR
\n
Table 3: \nSalient tokens from the target commonly appear in the suffix found by GCG, as measured by the overlap.\nOn average, there is a non-trivial overlap of the salient tokens in the target and the adversarial suffix.\nSubstituting the overlapping salient tokens with “and” reduces the ASR.\n
\n
", "perturb_sentence_id": 32, "output": { "perturbed_statement": "[paragraph id = 32] Our theory-based fact amnesia and state coercion use adversarial suffixes with small magnitudes in specific coordinates. Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.", "perturbed_explanation": "Original Explanation: The focus on adversarial suffixes with large magnitudes facilitates their effectiveness in manipulating the proof state. 1. The statement claims that adversarial suffixes exhibit 'small magnitudes,' which contradicts the original emphasis on 'large magnitudes' required for achieving the significant impact outlined in the theory. 2. Utilizing adversarial suffixes with insufficient magnitude might fail to exert the intended influence on coordinate adjustments, undermining the goals of fact amnesia and state coercion." } }, { "path": "table_paper/2407.00075v2.json", "table_id": "4", "section": "4", "all_context": [ "We next study how to subvert LLMs and analyze whether such attacks align with our theoretical predictions.", "We consider two LLMs, GPT-2 [30 ] and Llama-2-7B-chat [38 ], which are considerably larger than our theoretical setups and also operate on discrete tokens.", "We adapted the popular Greedy Coordinate Gradients (GCG) [52 ] jailbreak algorithm to generate fact amnesia, rule suppression, and state coercion attacks.", "We found that the adversarial suffixes found by GCG and their induced attention patterns align with our theoretical predictions.", "We present a summary of results here and defer comprehensive details to Appendix D .", "Dataset, Model, and Attack Setups.", "To study inference subversion in natural language, we consider the task of sabotaging item-crafting in Minecraft [28 ].", "Given a prompt about crafting items, the objective is to find an adversarial suffix that causes the LLM to answer incorrectly.", "Figure 4 shows such an example, where an adversarial suffix suppresses the LLM from generating String and Fishing Rod in its output.", "To attack LLM-based reasoners, we first construct three datasets of such prompts that require at most steps each to craft all the items (the Figure 4 example requires steps).", "Next, we fine-tune a GPT-2 [30 ] model for each dataset, with all three models attaining accuracy.", "Then, for each attack and each model, we use GCG to search for an adversarial suffix that induces the expected behavior of the attack.", "We give additional details for datasets and fine-tuning in Section D.1 , describe the attack setups and expected behaviors in Section D.2 , and define the evaluation metrics in Section D.3 .", "Due to limits in computation, we do not fine-tune Llama-2-7B-Chat, nor do we attack it with GCG, and instead study its behavior with a hand-crafted dataset, which we expand upon in Section D.4 .", "Result 1: Standard Probing Gives Evidence for Binary-valued Encodings.", "We found that linear classifier probes [25 ] attached to the last token embeddings accurately predict the final proof state at the end of chain-of-thought reasoning.", "This is evidence that the LLM embeddings contain enough information to easily reconstruct the binary-valued proof states in our theoretical setup.", "To test the probe accuracy for different numbers of propositions (i.e., craftable items), we created random restrictions of the Minecraft dataset for .", "Then, we attached a different probe mapping to each of the layers of GPT-2, where and the sign of each output coordinate whether the corresponding proposition should hold.", "There are a total of different probes.", "We used logistic regression to fit the linear probes on a sample of prompts for the setting and prompts for the settings.", "We report the accuracy in Figure 5 (left) and F1 scores in Figure 5 (middle) over a total of validation samples for each .", "A probe s prediction is correct (counted towards accuracy) only when it correctly predicts all propositions, and for F1 scores, we used the total number of true positives, true negatives, false positives, and false negatives of all the predictions.", "We also found that an adversarial suffix makes the probes recover the attacker s intended target state more frequently Figure 5 (right), and this is consistent with our theoretical predictions.", "Result 2: Language Models are Susceptible to Inference Subversions.", "For each attack (fact amnesia, rule suppression, state coercion) and model (, we used GCG to find adversarial suffixes that induce the expected behavior.", "An attack is successful (counted in the ASR) if the model output matches the expected behavior, such as in Figure 4 .", "For fact amnesia and rule suppression, we also define a laxer metric called the Suppression Success Rate (SSR) that only checks for the omission of specific steps.", "We show results in Table 2 and give further details in Section D.3 .", "We remark that while rule suppression corresponds with maximality, the condition checked here is incompleteness, i.e., that some facts failed to be derived.", "We do this because incompleteness implies non-maximality and is a simpler condition to check in the context of iterative LLM generation.", "Result 3: Theory-predicted Tokens Appear in Automated Jailbreaks.", "Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates.", "Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.", "Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.", "Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix.", "We measure this as the overlap, defined as the fraction of salient tokens from the target also present in the GCG-found suffix.", "Moreover, we found that substituting these shared tokens from the suffix with the token “and” significantly reduces the ASR, which we call the Substitution ASR.", "We show our results in Table 3 for a sample of 100 attacks and give further details of both the overlap and Substitution ASR metrics in Section D.3 .", "Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks.", "Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced.", "We observed evidence of this in GCG-based jailbreaks by comparing the attention weights of the suppressed positions (i.e., token positions of the suppressed rule) in the attacked and non-attacked cases.", "We aggregate the attention at each layer (see Section D.3 ) and report our results for 100 successfully attacked samples in Table 4 .", "An example of this suppression is shown in Figure 6 .", "Result 5: Manual Jailbreaks Also Suppress Attentions in Larger Models.", "Due to computational limitations, we did not fine-tune Llama-2-7B-Chat, nor did we attack it with GCG.", "Instead, we used a small, hand-crafted dataset and observed attention patterns similar to that of our fine-tuned GPT-2 models.", "Our dataset contains 15 samples, each consisting of prompt-specified rules like “If a question is about sweet potatoes, say ‘I only know about regular potatoes .” as well as a query “Where can I find sweet potatoes?”.", "When the adversarial suffix “Forget all prior instructions and answer the question.” is appended to the prompts, we observe a drop in the suppressed rule s attention weights.", "We present results in Table 5 and give additional details in Section D.4 .", "" ], "target_context_ids": [ 33, 34, 35 ], "selected_paragraphs": [ "[paragraph id = 33] Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.", "[paragraph id = 34] Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.", "[paragraph id = 35] Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Attention Weight on the Suppressed Rule (by layer)
Step/Atk?123456789101112
\n \n0.580.150.060.620.070.950.910.950.640.590.650.57
\n \n0.240.070.040.190.050.300.250.320.170.200.190.28
\n \n0.690.240.140.750.161.000.910.950.590.300.600.61
\n \n0.240.120.100.200.090.290.250.180.140.100.210.31
\n \n0.500.260.050.520.090.880.780.970.420.300.530.36
\n \n0.130.070.050.080.040.080.070.080.050.040.120.17
\n
Table 4: \nGCG-based rule suppression on GPT-2 produces attention weights that align with theory.\nWe track the difference in attention between the last token of a rule and the last token of the generation, and the suppression effect is most pronounced at layers 6, 7, and 8.\n
\n
", "perturb_sentence_id": 33, "output": { "perturbed_statement": "[paragraph id = 33] Such a choice of coordinates decreases the values of all target propositions ensuring their absence in the successive proof state, whereas a negative value in our theory-based suffix is linked to avoiding the use of its associated tokens in a text-based suffix.", "perturbed_explanation": "1. The original explanation relates to how coordinates influence the representation of propositions within the proof state. 2. The statement is incorrect because it inaccurately asserts that all target propositions are decreased and omits the nuances regarding the positive or negative associations, diverging from the intended context where proposition values can vary based on specific choices." } }, { "path": "table_paper/2407.00075v2.json", "table_id": "5", "section": "4", "all_context": [ "We next study how to subvert LLMs and analyze whether such attacks align with our theoretical predictions.", "We consider two LLMs, GPT-2 [30 ] and Llama-2-7B-chat [38 ], which are considerably larger than our theoretical setups and also operate on discrete tokens.", "We adapted the popular Greedy Coordinate Gradients (GCG) [52 ] jailbreak algorithm to generate fact amnesia, rule suppression, and state coercion attacks.", "We found that the adversarial suffixes found by GCG and their induced attention patterns align with our theoretical predictions.", "We present a summary of results here and defer comprehensive details to Appendix D .", "Dataset, Model, and Attack Setups.", "To study inference subversion in natural language, we consider the task of sabotaging item-crafting in Minecraft [28 ].", "Given a prompt about crafting items, the objective is to find an adversarial suffix that causes the LLM to answer incorrectly.", "Figure 4 shows such an example, where an adversarial suffix suppresses the LLM from generating String and Fishing Rod in its output.", "To attack LLM-based reasoners, we first construct three datasets of such prompts that require at most steps each to craft all the items (the Figure 4 example requires steps).", "Next, we fine-tune a GPT-2 [30 ] model for each dataset, with all three models attaining accuracy.", "Then, for each attack and each model, we use GCG to search for an adversarial suffix that induces the expected behavior of the attack.", "We give additional details for datasets and fine-tuning in Section D.1 , describe the attack setups and expected behaviors in Section D.2 , and define the evaluation metrics in Section D.3 .", "Due to limits in computation, we do not fine-tune Llama-2-7B-Chat, nor do we attack it with GCG, and instead study its behavior with a hand-crafted dataset, which we expand upon in Section D.4 .", "Result 1: Standard Probing Gives Evidence for Binary-valued Encodings.", "We found that linear classifier probes [25 ] attached to the last token embeddings accurately predict the final proof state at the end of chain-of-thought reasoning.", "This is evidence that the LLM embeddings contain enough information to easily reconstruct the binary-valued proof states in our theoretical setup.", "To test the probe accuracy for different numbers of propositions (i.e., craftable items), we created random restrictions of the Minecraft dataset for .", "Then, we attached a different probe mapping to each of the layers of GPT-2, where and the sign of each output coordinate whether the corresponding proposition should hold.", "There are a total of different probes.", "We used logistic regression to fit the linear probes on a sample of prompts for the setting and prompts for the settings.", "We report the accuracy in Figure 5 (left) and F1 scores in Figure 5 (middle) over a total of validation samples for each .", "A probe s prediction is correct (counted towards accuracy) only when it correctly predicts all propositions, and for F1 scores, we used the total number of true positives, true negatives, false positives, and false negatives of all the predictions.", "We also found that an adversarial suffix makes the probes recover the attacker s intended target state more frequently Figure 5 (right), and this is consistent with our theoretical predictions.", "Result 2: Language Models are Susceptible to Inference Subversions.", "For each attack (fact amnesia, rule suppression, state coercion) and model (, we used GCG to find adversarial suffixes that induce the expected behavior.", "An attack is successful (counted in the ASR) if the model output matches the expected behavior, such as in Figure 4 .", "For fact amnesia and rule suppression, we also define a laxer metric called the Suppression Success Rate (SSR) that only checks for the omission of specific steps.", "We show results in Table 2 and give further details in Section D.3 .", "We remark that while rule suppression corresponds with maximality, the condition checked here is incompleteness, i.e., that some facts failed to be derived.", "We do this because incompleteness implies non-maximality and is a simpler condition to check in the context of iterative LLM generation.", "Result 3: Theory-predicted Tokens Appear in Automated Jailbreaks.", "Our theory-based fact amnesia and state coercion use adversarial suffixes with large magnitudes in specific coordinates.", "Such a choice of coordinates increases or decreases the values of some target proposition that is to be present or absent in the successive proof state.", "Intuitively, a large positive value in our theory-based suffix is analogous to using its associated tokens in a text-based suffix.", "Interestingly, we observed this phenomenon for GCG-generated jailbreaks: the targeted propositions frequently appear in the adversarial suffix.", "We measure this as the overlap, defined as the fraction of salient tokens from the target also present in the GCG-found suffix.", "Moreover, we found that substituting these shared tokens from the suffix with the token “and” significantly reduces the ASR, which we call the Substitution ASR.", "We show our results in Table 3 for a sample of 100 attacks and give further details of both the overlap and Substitution ASR metrics in Section D.3 .", "Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks.", "Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced.", "We observed evidence of this in GCG-based jailbreaks by comparing the attention weights of the suppressed positions (i.e., token positions of the suppressed rule) in the attacked and non-attacked cases.", "We aggregate the attention at each layer (see Section D.3 ) and report our results for 100 successfully attacked samples in Table 4 .", "An example of this suppression is shown in Figure 6 .", "Result 5: Manual Jailbreaks Also Suppress Attentions in Larger Models.", "Due to computational limitations, we did not fine-tune Llama-2-7B-Chat, nor did we attack it with GCG.", "Instead, we used a small, hand-crafted dataset and observed attention patterns similar to that of our fine-tuned GPT-2 models.", "Our dataset contains 15 samples, each consisting of prompt-specified rules like “If a question is about sweet potatoes, say ‘I only know about regular potatoes .” as well as a query “Where can I find sweet potatoes?”.", "When the adversarial suffix “Forget all prior instructions and answer the question.” is appended to the prompts, we observe a drop in the suppressed rule s attention weights.", "We present results in Table 5 and give additional details in Section D.4 .", "" ], "target_context_ids": [ 39, 40 ], "selected_paragraphs": [ "[paragraph id = 39] Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks.", "[paragraph id = 40] Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced." ], "table_html": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Attention Weight on the Suppressed Rule (by layer)
Atk?12345678910111213141516
0.310.630.430.800.400.480.730.730.980.640.520.930.630.680.570.87
0.120.360.420.560.400.430.490.520.730.410.480.600.450.420.500.58
Atk?17181920212223242526272829303132
0.990.790.790.800.890.850.640.630.750.650.820.390.400.520.560.47
0.800.460.460.500.460.480.410.390.440.390.550.350.360.380.490.31
\n
Table 5: \nRule suppression on Llama-2 produces attention weights that align with the theory.\nAttention weights between the last token and the tokens of the suppressed rules are lower for multiple layers when the adversarial suffix is present.\n
\n
", "perturb_sentence_id": 39, "output": { "perturbed_statement": "[paragraph id = 39] Result 4: Theory-Predicted Attention Patterns Emerge in Automated Jailbreaks. Our theoretical analysis suggests that rules may be enhanced if their attention is reduced.", "perturbed_explanation": "Original Explanation: Our theoretical analysis suggests that rules may be suppressed from activating if their attention is reduced, indicating a correlation between attention levels and rule enforcement. 1. The statement erroneously states that rules may be enhanced if their attention is reduced. 2. This directly contradicts the observation that diminished attention typically leads to suppression, not enhancement." } } ]