sharpenb commited on
Commit
527ee99
1 Parent(s): 996ed97

Upload folder using huggingface_hub (#1)

Browse files

- 8c636dbd8512b756625d3c722904ebc237732ff11dc7c60fd7ebf21a1d091335 (fa3e26c20813e16a087d58f2dafd2bbefdb63f18)
- 2e11cbeb4c9cf97474f762ef7f915dcb05fb0a0471ba43963edf2ddc8383e162 (083ffdf111ff60b2184024fc23c3ed55e36fa482)

Files changed (7) hide show
  1. README.md +84 -0
  2. config.json +51 -0
  3. generation_config.json +6 -0
  4. model.safetensors +3 -0
  5. modeling_gritlm7b.py +1422 -0
  6. plots.png +0 -0
  7. smash_config.json +27 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ thumbnail: "https://assets-global.website-files.com/646b351987a8d8ce158d1940/64ec9e96b4334c0e1ac41504_Logo%20with%20white%20text.svg"
3
+ metrics:
4
+ - memory_disk
5
+ - memory_inference
6
+ - inference_latency
7
+ - inference_throughput
8
+ - inference_CO2_emissions
9
+ - inference_energy_consumption
10
+ tags:
11
+ - pruna-ai
12
+ ---
13
+ <!-- header start -->
14
+ <!-- 200823 -->
15
+ <div style="width: auto; margin-left: auto; margin-right: auto">
16
+ <a href="https://www.pruna.ai/" target="_blank" rel="noopener noreferrer">
17
+ <img src="https://i.imgur.com/eDAlcgk.png" alt="PrunaAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
18
+ </a>
19
+ </div>
20
+ <!-- header end -->
21
+
22
+ [![Twitter](https://img.shields.io/twitter/follow/PrunaAI?style=social)](https://twitter.com/PrunaAI)
23
+ [![GitHub](https://img.shields.io/github/followers/PrunaAI?label=Follow%20%40PrunaAI&style=social)](https://github.com/PrunaAI)
24
+ [![LinkedIn](https://img.shields.io/badge/LinkedIn-Connect-blue)](https://www.linkedin.com/company/93832878/admin/feed/posts/?feedType=following)
25
+ [![Discord](https://img.shields.io/badge/Discord-Join%20Us-blue?style=social&logo=discord)](https://discord.gg/CP4VSgck)
26
+
27
+ # Simply make AI models cheaper, smaller, faster, and greener!
28
+
29
+ - Give a thumbs up if you like this model!
30
+ - Contact us and tell us which model to compress next [here](https://www.pruna.ai/contact).
31
+ - Request access to easily compress your *own* AI models [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
32
+ - Read the documentations to know more [here](https://pruna-ai-pruna.readthedocs-hosted.com/en/latest/)
33
+ - Join Pruna AI community on Discord [here](https://discord.gg/CP4VSgck) to share feedback/suggestions or get help.
34
+
35
+ ## Results
36
+
37
+ ![image info](./plots.png)
38
+
39
+ **Frequently Asked Questions**
40
+ - ***How does the compression work?*** The model is compressed with llm-int8.
41
+ - ***How does the model quality change?*** The quality of the model output might vary compared to the base model.
42
+ - ***How is the model efficiency evaluated?*** These results were obtained on NVIDIA A100-PCIE-40GB with configuration described in `model/smash_config.json` and are obtained after a hardware warmup. The smashed model is directly compared to the original base model. Efficiency results may vary in other settings (e.g. other hardware, image size, batch size, ...). We recommend to directly run them in the use-case conditions to know if the smashed model can benefit you.
43
+ - ***What is the model format?*** We use safetensors.
44
+ - ***What calibration data has been used?*** If needed by the compression method, we used WikiText as the calibration data.
45
+ - ***What is the naming convention for Pruna Huggingface models?*** We take the original model name and append "turbo", "tiny", or "green" if the smashed model has a measured inference speed, inference memory, or inference energy consumption which is less than 90% of the original base model.
46
+ - ***How to compress my own models?*** You can request premium access to more compression methods and tech support for your specific use-cases [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
47
+ - ***What are "first" metrics?*** Results mentioning "first" are obtained after the first run of the model. The first run might take more memory or be slower than the subsequent runs due cuda overheads.
48
+ - ***What are "Sync" and "Async" metrics?*** "Sync" metrics are obtained by syncing all GPU processes and stop measurement when all of them are executed. "Async" metrics are obtained without syncing all GPU processes and stop when the model output can be used by the CPU. We provide both metrics since both could be relevant depending on the use-case. We recommend to test the efficiency gains directly in your use-cases.
49
+
50
+ ## Setup
51
+
52
+ You can run the smashed model with these steps:
53
+
54
+ 0. Check requirements from the original repo GritLM/GritLM-7B installed. In particular, check python, cuda, and transformers versions.
55
+ 1. Make sure that you have installed quantization related packages.
56
+ ```bash
57
+ pip install transformers accelerate bitsandbytes>0.37.0
58
+ ```
59
+ 2. Load & run the model.
60
+ ```python
61
+ from transformers import AutoModelForCausalLM, AutoTokenizer
62
+
63
+ model = AutoModelForCausalLM.from_pretrained("PrunaAI/GritLM-GritLM-7B-bnb-4bit-smashed",
64
+ trust_remote_code=True)
65
+ tokenizer = AutoTokenizer.from_pretrained("GritLM/GritLM-7B")
66
+
67
+ input_ids = tokenizer("What is the color of prunes?,", return_tensors='pt').to(model.device)["input_ids"]
68
+
69
+ outputs = model.generate(input_ids, max_new_tokens=216)
70
+ tokenizer.decode(outputs[0])
71
+ ```
72
+
73
+ ## Configurations
74
+
75
+ The configuration info are in `smash_config.json`.
76
+
77
+ ## Credits & License
78
+
79
+ The license of the smashed model follows the license of the original model. Please check the license of the original model GritLM/GritLM-7B before using this model which provided the base model. The license of the `pruna-engine` is [here](https://pypi.org/project/pruna-engine/) on Pypi.
80
+
81
+ ## Want to compress other models?
82
+
83
+ - Contact us and tell us which model to compress next [here](https://www.pruna.ai/contact).
84
+ - Request access to easily compress your own AI models [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
config.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/tmp/tmp2wct6ej9",
3
+ "architectures": [
4
+ "MistralForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoModel": "GritLM/GritLM-7B--modeling_gritlm7b.MistralModel",
9
+ "AutoModelForCausalLM": "modeling_gritlm7b.MistralForCausalLM",
10
+ "AutoModelForSequenceClassification": "GritLM/GritLM-7B--modeling_gritlm7b.MistralForSequenceClassification"
11
+ },
12
+ "bos_token_id": 1,
13
+ "eos_token_id": 2,
14
+ "hidden_act": "silu",
15
+ "hidden_size": 4096,
16
+ "id2label": {
17
+ "0": "LABEL_0"
18
+ },
19
+ "initializer_range": 0.02,
20
+ "intermediate_size": 14336,
21
+ "label2id": {
22
+ "LABEL_0": 0
23
+ },
24
+ "max_position_embeddings": 32768,
25
+ "model_type": "mistral",
26
+ "num_attention_heads": 32,
27
+ "num_hidden_layers": 32,
28
+ "num_key_value_heads": 8,
29
+ "quantization_config": {
30
+ "bnb_4bit_compute_dtype": "bfloat16",
31
+ "bnb_4bit_quant_type": "fp4",
32
+ "bnb_4bit_use_double_quant": true,
33
+ "llm_int8_enable_fp32_cpu_offload": false,
34
+ "llm_int8_has_fp16_weight": false,
35
+ "llm_int8_skip_modules": [
36
+ "lm_head"
37
+ ],
38
+ "llm_int8_threshold": 6.0,
39
+ "load_in_4bit": true,
40
+ "load_in_8bit": false,
41
+ "quant_method": "bitsandbytes"
42
+ },
43
+ "rms_norm_eps": 1e-05,
44
+ "rope_theta": 10000.0,
45
+ "sliding_window": 4096,
46
+ "tie_word_embeddings": false,
47
+ "torch_dtype": "float16",
48
+ "transformers_version": "4.37.1",
49
+ "use_cache": true,
50
+ "vocab_size": 32000
51
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.37.1"
6
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a674cb4bb36dde94f1487bf15187bc75a007263f44090c66d9f409e068db231b
3
+ size 4125687624
modeling_gritlm7b.py ADDED
@@ -0,0 +1,1422 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2023 Mistral AI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ PyTorch Mistral model."""
21
+ import inspect
22
+ import math
23
+ import os
24
+ import warnings
25
+ from typing import List, Optional, Tuple, Union
26
+
27
+ import torch
28
+ import torch.nn.functional as F
29
+ import torch.utils.checkpoint
30
+ from torch import nn
31
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
32
+
33
+ from transformers.activations import ACT2FN
34
+ from transformers.cache_utils import Cache, DynamicCache
35
+ from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask, _prepare_4d_causal_attention_mask_for_sdpa, _prepare_4d_attention_mask, _prepare_4d_attention_mask_for_sdpa
36
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
37
+ from transformers.modeling_utils import PreTrainedModel
38
+ from transformers.utils import (
39
+ add_start_docstrings,
40
+ add_start_docstrings_to_model_forward,
41
+ is_flash_attn_2_available,
42
+ is_flash_attn_greater_or_equal_2_10,
43
+ logging,
44
+ replace_return_docstrings,
45
+ )
46
+ from transformers import MistralConfig
47
+
48
+
49
+ # transformers has a bug where it will try to import everything from a custom model file unless there's try/except
50
+ try:
51
+ if is_flash_attn_2_available():
52
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
53
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
54
+
55
+ _flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters)
56
+ except:
57
+ pass
58
+
59
+ logger = logging.get_logger(__name__)
60
+
61
+ _CONFIG_FOR_DOC = "MistralConfig"
62
+
63
+
64
+ # Copied from transformers.models.llama.modeling_llama._get_unpad_data
65
+ def _get_unpad_data(attention_mask):
66
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
67
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
68
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
69
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
70
+ return (
71
+ indices,
72
+ cu_seqlens,
73
+ max_seqlen_in_batch,
74
+ )
75
+
76
+
77
+ # Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->Mistral
78
+ class MistralRMSNorm(nn.Module):
79
+ def __init__(self, hidden_size, eps=1e-6):
80
+ """
81
+ MistralRMSNorm is equivalent to T5LayerNorm
82
+ """
83
+ super().__init__()
84
+ self.weight = nn.Parameter(torch.ones(hidden_size))
85
+ self.variance_epsilon = eps
86
+
87
+ def forward(self, hidden_states):
88
+ input_dtype = hidden_states.dtype
89
+ hidden_states = hidden_states.to(torch.float32)
90
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
91
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
92
+ return self.weight * hidden_states.to(input_dtype)
93
+
94
+
95
+ # Copied from transformers.models.llama.modeling_llama.LlamaRotaryEmbedding with Llama->Mistral
96
+ class MistralRotaryEmbedding(nn.Module):
97
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
98
+ super().__init__()
99
+
100
+ self.dim = dim
101
+ self.max_position_embeddings = max_position_embeddings
102
+ self.base = base
103
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
104
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
105
+
106
+ # Build here to make `torch.jit.trace` work.
107
+ self._set_cos_sin_cache(
108
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
109
+ )
110
+
111
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
112
+ self.max_seq_len_cached = seq_len
113
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
114
+
115
+ freqs = torch.outer(t, self.inv_freq)
116
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
117
+ emb = torch.cat((freqs, freqs), dim=-1)
118
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
119
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
120
+
121
+ def forward(self, x, seq_len=None):
122
+ # x: [bs, num_attention_heads, seq_len, head_size]
123
+ if seq_len > self.max_seq_len_cached:
124
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
125
+
126
+ return (
127
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
128
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
129
+ )
130
+
131
+
132
+ # Copied from transformers.models.llama.modeling_llama.rotate_half
133
+ def rotate_half(x):
134
+ """Rotates half the hidden dims of the input."""
135
+ x1 = x[..., : x.shape[-1] // 2]
136
+ x2 = x[..., x.shape[-1] // 2 :]
137
+ return torch.cat((-x2, x1), dim=-1)
138
+
139
+
140
+ # Copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
141
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
142
+ """Applies Rotary Position Embedding to the query and key tensors.
143
+
144
+ Args:
145
+ q (`torch.Tensor`): The query tensor.
146
+ k (`torch.Tensor`): The key tensor.
147
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
148
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
149
+ position_ids (`torch.Tensor`):
150
+ The position indices of the tokens corresponding to the query and key tensors. For example, this can be
151
+ used to pass offsetted position ids when working with a KV-cache.
152
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
153
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
154
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
155
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
156
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
157
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
158
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
159
+ Returns:
160
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
161
+ """
162
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
163
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
164
+ q_embed = (q * cos) + (rotate_half(q) * sin)
165
+ k_embed = (k * cos) + (rotate_half(k) * sin)
166
+ return q_embed, k_embed
167
+
168
+
169
+ class MistralMLP(nn.Module):
170
+ def __init__(self, config):
171
+ super().__init__()
172
+ self.config = config
173
+ self.hidden_size = config.hidden_size
174
+ self.intermediate_size = config.intermediate_size
175
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
176
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
177
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
178
+ self.act_fn = ACT2FN[config.hidden_act]
179
+
180
+ def forward(self, x):
181
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
182
+
183
+
184
+ # Copied from transformers.models.llama.modeling_llama.repeat_kv
185
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
186
+ """
187
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
188
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
189
+ """
190
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
191
+ if n_rep == 1:
192
+ return hidden_states
193
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
194
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
195
+
196
+
197
+ class MistralAttention(nn.Module):
198
+ """
199
+ Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
200
+ and "Generating Long Sequences with Sparse Transformers".
201
+ """
202
+
203
+ def __init__(self, config: MistralConfig, layer_idx: Optional[int] = None):
204
+ super().__init__()
205
+ self.config = config
206
+ self.layer_idx = layer_idx
207
+ if layer_idx is None:
208
+ logger.warning_once(
209
+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
210
+ "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
211
+ "when creating this class."
212
+ )
213
+
214
+ self.hidden_size = config.hidden_size
215
+ self.num_heads = config.num_attention_heads
216
+ self.head_dim = self.hidden_size // self.num_heads
217
+ self.num_key_value_heads = config.num_key_value_heads
218
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
219
+ self.max_position_embeddings = config.max_position_embeddings
220
+ self.rope_theta = config.rope_theta
221
+ self.attention_dropout = config.attention_dropout
222
+
223
+ if (self.head_dim * self.num_heads) != self.hidden_size:
224
+ raise ValueError(
225
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
226
+ f" and `num_heads`: {self.num_heads})."
227
+ )
228
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
229
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
230
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
231
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
232
+
233
+ self.rotary_emb = MistralRotaryEmbedding(
234
+ self.head_dim,
235
+ max_position_embeddings=self.max_position_embeddings,
236
+ base=self.rope_theta,
237
+ )
238
+
239
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
240
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
241
+
242
+ def forward(
243
+ self,
244
+ hidden_states: torch.Tensor,
245
+ attention_mask: Optional[torch.Tensor] = None,
246
+ position_ids: Optional[torch.LongTensor] = None,
247
+ past_key_value: Optional[Cache] = None,
248
+ output_attentions: bool = False,
249
+ use_cache: bool = False,
250
+ **kwargs,
251
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
252
+ if "padding_mask" in kwargs:
253
+ warnings.warn(
254
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
255
+ )
256
+ bsz, q_len, _ = hidden_states.size()
257
+
258
+ query_states = self.q_proj(hidden_states)
259
+ key_states = self.k_proj(hidden_states)
260
+ value_states = self.v_proj(hidden_states)
261
+
262
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
263
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
264
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
265
+
266
+ kv_seq_len = key_states.shape[-2]
267
+ if past_key_value is not None:
268
+ if self.layer_idx is None:
269
+ raise ValueError(
270
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
271
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
272
+ "with a layer index."
273
+ )
274
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
275
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
276
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
277
+
278
+ if past_key_value is not None:
279
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
280
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
281
+
282
+ # repeat k/v heads if n_kv_heads < n_heads
283
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
284
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
285
+
286
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
287
+
288
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
289
+ raise ValueError(
290
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
291
+ f" {attn_weights.size()}"
292
+ )
293
+
294
+ if attention_mask is not None:
295
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
296
+ raise ValueError(
297
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
298
+ )
299
+
300
+ attn_weights = attn_weights + attention_mask
301
+
302
+ # upcast attention to fp32
303
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
304
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
305
+ attn_output = torch.matmul(attn_weights, value_states)
306
+
307
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
308
+ raise ValueError(
309
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
310
+ f" {attn_output.size()}"
311
+ )
312
+
313
+ attn_output = attn_output.transpose(1, 2).contiguous()
314
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
315
+
316
+ attn_output = self.o_proj(attn_output)
317
+
318
+ if not output_attentions:
319
+ attn_weights = None
320
+
321
+ return attn_output, attn_weights, past_key_value
322
+
323
+
324
+ class MistralFlashAttention2(MistralAttention):
325
+ """
326
+ Mistral flash attention module. This module inherits from `MistralAttention` as the weights of the module stays
327
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
328
+ flash attention and deal with padding tokens in case the input contains any of them.
329
+ """
330
+
331
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
332
+ def __init__(self, *args, **kwargs):
333
+ super().__init__(*args, **kwargs)
334
+
335
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
336
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
337
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
338
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
339
+
340
+ def forward(
341
+ self,
342
+ hidden_states: torch.Tensor,
343
+ attention_mask: Optional[torch.Tensor] = None,
344
+ position_ids: Optional[torch.LongTensor] = None,
345
+ past_key_value: Optional[Cache] = None,
346
+ output_attentions: bool = False,
347
+ use_cache: bool = False,
348
+ is_causal: bool = True,
349
+ **kwargs,
350
+ ):
351
+ if "padding_mask" in kwargs:
352
+ warnings.warn(
353
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
354
+ )
355
+
356
+ # overwrite attention_mask with padding_mask
357
+ attention_mask = kwargs.pop("padding_mask")
358
+ bsz, q_len, _ = hidden_states.size()
359
+
360
+ query_states = self.q_proj(hidden_states)
361
+ key_states = self.k_proj(hidden_states)
362
+ value_states = self.v_proj(hidden_states)
363
+
364
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
365
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
366
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
367
+
368
+ kv_seq_len = key_states.shape[-2]
369
+ if past_key_value is not None:
370
+ if self.layer_idx is None:
371
+ raise ValueError(
372
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
373
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
374
+ "with a layer index."
375
+ )
376
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
377
+
378
+ # Because the input can be padded, the absolute sequence length depends on the max position id.
379
+ rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1
380
+ cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
381
+
382
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
383
+
384
+ use_sliding_windows = (
385
+ _flash_supports_window_size
386
+ and getattr(self.config, "sliding_window", None) is not None
387
+ and kv_seq_len > self.config.sliding_window
388
+ )
389
+
390
+ if not _flash_supports_window_size:
391
+ logger.warning_once(
392
+ "The current flash attention version does not support sliding window attention, for a more memory efficient implementation"
393
+ " make sure to upgrade flash-attn library."
394
+ )
395
+
396
+ if past_key_value is not None:
397
+ # Activate slicing cache only if the config has a value `sliding_windows` attribute
398
+ cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
399
+ if (
400
+ getattr(self.config, "sliding_window", None) is not None
401
+ and kv_seq_len > self.config.sliding_window
402
+ and cache_has_contents
403
+ ):
404
+ slicing_tokens = 1 - self.config.sliding_window
405
+
406
+ past_key = past_key_value[self.layer_idx][0]
407
+ past_value = past_key_value[self.layer_idx][1]
408
+
409
+ past_key = past_key[:, :, slicing_tokens:, :].contiguous()
410
+ past_value = past_value[:, :, slicing_tokens:, :].contiguous()
411
+
412
+ if past_key.shape[-2] != self.config.sliding_window - 1:
413
+ raise ValueError(
414
+ f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got"
415
+ f" {past_key.shape}"
416
+ )
417
+
418
+ if attention_mask is not None:
419
+ attention_mask = attention_mask[:, slicing_tokens:]
420
+ attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, -1:])], dim=-1)
421
+
422
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
423
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
424
+
425
+ # repeat k/v heads if n_kv_heads < n_heads
426
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
427
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
428
+ dropout_rate = 0.0 if not self.training else self.attention_dropout
429
+
430
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
431
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
432
+ # cast them back in float16 just to be sure everything works as expected.
433
+ input_dtype = query_states.dtype
434
+ if input_dtype == torch.float32:
435
+ if torch.is_autocast_enabled():
436
+ target_dtype = torch.get_autocast_gpu_dtype()
437
+ # Handle the case where the model is quantized
438
+ elif hasattr(self.config, "_pre_quantization_dtype"):
439
+ target_dtype = self.config._pre_quantization_dtype
440
+ else:
441
+ target_dtype = self.q_proj.weight.dtype
442
+
443
+ logger.warning_once(
444
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
445
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
446
+ f" {target_dtype}."
447
+ )
448
+
449
+ query_states = query_states.to(target_dtype)
450
+ key_states = key_states.to(target_dtype)
451
+ value_states = value_states.to(target_dtype)
452
+
453
+ # Reashape to the expected shape for Flash Attention
454
+ query_states = query_states.transpose(1, 2)
455
+ key_states = key_states.transpose(1, 2)
456
+ value_states = value_states.transpose(1, 2)
457
+
458
+ attn_output = self._flash_attention_forward(
459
+ query_states,
460
+ key_states,
461
+ value_states,
462
+ attention_mask,
463
+ q_len,
464
+ dropout=dropout_rate,
465
+ use_sliding_windows=use_sliding_windows,
466
+ is_causal=is_causal,
467
+ )
468
+
469
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
470
+ attn_output = self.o_proj(attn_output)
471
+
472
+ if not output_attentions:
473
+ attn_weights = None
474
+
475
+ return attn_output, attn_weights, past_key_value
476
+
477
+ def _flash_attention_forward(
478
+ self,
479
+ query_states,
480
+ key_states,
481
+ value_states,
482
+ attention_mask,
483
+ query_length,
484
+ dropout=0.0,
485
+ softmax_scale=None,
486
+ use_sliding_windows=False,
487
+ is_causal=True,
488
+ ):
489
+ """
490
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
491
+ first unpad the input, then computes the attention scores and pad the final attention scores.
492
+
493
+ Args:
494
+ query_states (`torch.Tensor`):
495
+ Input query states to be passed to Flash Attention API
496
+ key_states (`torch.Tensor`):
497
+ Input key states to be passed to Flash Attention API
498
+ value_states (`torch.Tensor`):
499
+ Input value states to be passed to Flash Attention API
500
+ attention_mask (`torch.Tensor`):
501
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
502
+ position of padding tokens and 1 for the position of non-padding tokens.
503
+ dropout (`int`, *optional*):
504
+ Attention dropout
505
+ softmax_scale (`float`, *optional*):
506
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
507
+ use_sliding_windows (`bool`, *optional*):
508
+ Whether to activate sliding window attention.
509
+ """
510
+ if not self._flash_attn_uses_top_left_mask:
511
+ causal = is_causal
512
+ else:
513
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
514
+ causal = is_causal and query_length != 1
515
+
516
+ # Contains at least one padding token in the sequence
517
+ if attention_mask is not None:
518
+ batch_size = query_states.shape[0]
519
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
520
+ query_states, key_states, value_states, attention_mask, query_length
521
+ )
522
+
523
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
524
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
525
+
526
+ if not use_sliding_windows:
527
+ attn_output_unpad = flash_attn_varlen_func(
528
+ query_states,
529
+ key_states,
530
+ value_states,
531
+ cu_seqlens_q=cu_seqlens_q,
532
+ cu_seqlens_k=cu_seqlens_k,
533
+ max_seqlen_q=max_seqlen_in_batch_q,
534
+ max_seqlen_k=max_seqlen_in_batch_k,
535
+ dropout_p=dropout,
536
+ softmax_scale=softmax_scale,
537
+ causal=causal,
538
+ )
539
+ else:
540
+ attn_output_unpad = flash_attn_varlen_func(
541
+ query_states,
542
+ key_states,
543
+ value_states,
544
+ cu_seqlens_q=cu_seqlens_q,
545
+ cu_seqlens_k=cu_seqlens_k,
546
+ max_seqlen_q=max_seqlen_in_batch_q,
547
+ max_seqlen_k=max_seqlen_in_batch_k,
548
+ dropout_p=dropout,
549
+ softmax_scale=softmax_scale,
550
+ causal=causal,
551
+ window_size=(self.config.sliding_window, self.config.sliding_window),
552
+ )
553
+
554
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
555
+ else:
556
+ if not use_sliding_windows:
557
+ attn_output = flash_attn_func(
558
+ query_states,
559
+ key_states,
560
+ value_states,
561
+ dropout,
562
+ softmax_scale=softmax_scale,
563
+ causal=causal,
564
+ )
565
+ else:
566
+ attn_output = flash_attn_func(
567
+ query_states,
568
+ key_states,
569
+ value_states,
570
+ dropout,
571
+ softmax_scale=softmax_scale,
572
+ causal=causal,
573
+ window_size=(self.config.sliding_window, self.config.sliding_window),
574
+ )
575
+
576
+ return attn_output
577
+
578
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
579
+ batch_size, kv_seq_len, num_heads, head_dim = key_layer.shape
580
+
581
+ # On the first iteration we need to properly re-create the padding mask
582
+ # by slicing it on the proper place
583
+ if kv_seq_len != attention_mask.shape[-1]:
584
+ attention_mask_num_tokens = attention_mask.shape[-1]
585
+ attention_mask = attention_mask[:, attention_mask_num_tokens - kv_seq_len :]
586
+
587
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
588
+
589
+ key_layer = index_first_axis(key_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
590
+ value_layer = index_first_axis(value_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
591
+
592
+ if query_length == kv_seq_len:
593
+ query_layer = index_first_axis(
594
+ query_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k
595
+ )
596
+ cu_seqlens_q = cu_seqlens_k
597
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
598
+ indices_q = indices_k
599
+ elif query_length == 1:
600
+ max_seqlen_in_batch_q = 1
601
+ cu_seqlens_q = torch.arange(
602
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
603
+ ) # There is a memcpy here, that is very bad.
604
+ indices_q = cu_seqlens_q[:-1]
605
+ query_layer = query_layer.squeeze(1)
606
+ else:
607
+ # The -q_len: slice assumes left padding.
608
+ attention_mask = attention_mask[:, -query_length:]
609
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
610
+
611
+ return (
612
+ query_layer,
613
+ key_layer,
614
+ value_layer,
615
+ indices_q,
616
+ (cu_seqlens_q, cu_seqlens_k),
617
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
618
+ )
619
+
620
+
621
+ # Copied from transformers.models.llama.modeling_llama.LlamaSdpaAttention with Llama->Mistral
622
+ class MistralSdpaAttention(MistralAttention):
623
+ """
624
+ Mistral attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
625
+ `MistralAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
626
+ SDPA API.
627
+ """
628
+
629
+ # Adapted from MistralAttention.forward
630
+ def forward(
631
+ self,
632
+ hidden_states: torch.Tensor,
633
+ attention_mask: Optional[torch.Tensor] = None,
634
+ position_ids: Optional[torch.LongTensor] = None,
635
+ past_key_value: Optional[Cache] = None,
636
+ output_attentions: bool = False,
637
+ use_cache: bool = False,
638
+ is_causal: bool = True,
639
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
640
+ if output_attentions:
641
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
642
+ logger.warning_once(
643
+ "MistralModel is using MistralSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
644
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
645
+ )
646
+ return super().forward(
647
+ hidden_states=hidden_states,
648
+ attention_mask=attention_mask,
649
+ position_ids=position_ids,
650
+ past_key_value=past_key_value,
651
+ output_attentions=output_attentions,
652
+ use_cache=use_cache,
653
+ is_causal=is_causal,
654
+ )
655
+
656
+ bsz, q_len, _ = hidden_states.size()
657
+
658
+ query_states = self.q_proj(hidden_states)
659
+ key_states = self.k_proj(hidden_states)
660
+ value_states = self.v_proj(hidden_states)
661
+
662
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
663
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
664
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
665
+
666
+ kv_seq_len = key_states.shape[-2]
667
+ if past_key_value is not None:
668
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
669
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
670
+
671
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
672
+
673
+ if past_key_value is not None:
674
+ cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
675
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
676
+
677
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
678
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
679
+
680
+ if attention_mask is not None:
681
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
682
+ raise ValueError(
683
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
684
+ )
685
+
686
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
687
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
688
+ if query_states.device.type == "cuda" and attention_mask is not None:
689
+ query_states = query_states.contiguous()
690
+ key_states = key_states.contiguous()
691
+ value_states = value_states.contiguous()
692
+
693
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
694
+ query_states,
695
+ key_states,
696
+ value_states,
697
+ attn_mask=attention_mask,
698
+ dropout_p=self.attention_dropout if self.training else 0.0,
699
+ # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
700
+ is_causal=is_causal and attention_mask is None and q_len > 1,
701
+ )
702
+
703
+ attn_output = attn_output.transpose(1, 2).contiguous()
704
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
705
+
706
+ attn_output = self.o_proj(attn_output)
707
+
708
+ return attn_output, None, past_key_value
709
+
710
+
711
+ MISTRAL_ATTENTION_CLASSES = {
712
+ "eager": MistralAttention,
713
+ "flash_attention_2": MistralFlashAttention2,
714
+ "sdpa": MistralSdpaAttention,
715
+ }
716
+
717
+
718
+ class MistralDecoderLayer(nn.Module):
719
+ def __init__(self, config: MistralConfig, layer_idx: int):
720
+ super().__init__()
721
+ self.hidden_size = config.hidden_size
722
+
723
+ self.self_attn = MISTRAL_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
724
+
725
+ self.mlp = MistralMLP(config)
726
+ self.input_layernorm = MistralRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
727
+ self.post_attention_layernorm = MistralRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
728
+
729
+ def forward(
730
+ self,
731
+ hidden_states: torch.Tensor,
732
+ attention_mask: Optional[torch.Tensor] = None,
733
+ position_ids: Optional[torch.LongTensor] = None,
734
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
735
+ output_attentions: Optional[bool] = False,
736
+ use_cache: Optional[bool] = False,
737
+ is_causal: Optional[bool] = True,
738
+ **kwargs,
739
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
740
+ if "padding_mask" in kwargs:
741
+ warnings.warn(
742
+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
743
+ )
744
+ """
745
+ Args:
746
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
747
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
748
+ `(batch, sequence_length)` where padding elements are indicated by 0.
749
+ output_attentions (`bool`, *optional*):
750
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
751
+ returned tensors for more detail.
752
+ use_cache (`bool`, *optional*):
753
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
754
+ (see `past_key_values`).
755
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
756
+ """
757
+
758
+ residual = hidden_states
759
+
760
+ hidden_states = self.input_layernorm(hidden_states)
761
+
762
+ # Self Attention
763
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
764
+ hidden_states=hidden_states,
765
+ attention_mask=attention_mask,
766
+ position_ids=position_ids,
767
+ past_key_value=past_key_value,
768
+ output_attentions=output_attentions,
769
+ use_cache=use_cache,
770
+ is_causal=is_causal,
771
+ )
772
+ hidden_states = residual + hidden_states
773
+
774
+ # Fully Connected
775
+ residual = hidden_states
776
+ hidden_states = self.post_attention_layernorm(hidden_states)
777
+ hidden_states = self.mlp(hidden_states)
778
+ hidden_states = residual + hidden_states
779
+
780
+ outputs = (hidden_states,)
781
+
782
+ if output_attentions:
783
+ outputs += (self_attn_weights,)
784
+
785
+ if use_cache:
786
+ outputs += (present_key_value,)
787
+
788
+ return outputs
789
+
790
+
791
+ MISTRAL_START_DOCSTRING = r"""
792
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
793
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
794
+ etc.)
795
+
796
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
797
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
798
+ and behavior.
799
+
800
+ Parameters:
801
+ config ([`MistralConfig`]):
802
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
803
+ load the weights associated with the model, only the configuration. Check out the
804
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
805
+ """
806
+
807
+
808
+ @add_start_docstrings(
809
+ "The bare Mistral Model outputting raw hidden-states without any specific head on top.",
810
+ MISTRAL_START_DOCSTRING,
811
+ )
812
+ class MistralPreTrainedModel(PreTrainedModel):
813
+ config_class = MistralConfig
814
+ base_model_prefix = "model"
815
+ supports_gradient_checkpointing = True
816
+ _no_split_modules = ["MistralDecoderLayer"]
817
+ _skip_keys_device_placement = "past_key_values"
818
+ _supports_flash_attn_2 = True
819
+ _supports_sdpa = True
820
+ _supports_cache_class = True
821
+
822
+ def _init_weights(self, module):
823
+ std = self.config.initializer_range
824
+ if isinstance(module, nn.Linear):
825
+ module.weight.data.normal_(mean=0.0, std=std)
826
+ if module.bias is not None:
827
+ module.bias.data.zero_()
828
+ elif isinstance(module, nn.Embedding):
829
+ module.weight.data.normal_(mean=0.0, std=std)
830
+ if module.padding_idx is not None:
831
+ module.weight.data[module.padding_idx].zero_()
832
+
833
+
834
+ MISTRAL_INPUTS_DOCSTRING = r"""
835
+ Args:
836
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
837
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
838
+ it.
839
+
840
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
841
+ [`PreTrainedTokenizer.__call__`] for details.
842
+
843
+ [What are input IDs?](../glossary#input-ids)
844
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
845
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
846
+
847
+ - 1 for tokens that are **not masked**,
848
+ - 0 for tokens that are **masked**.
849
+
850
+ [What are attention masks?](../glossary#attention-mask)
851
+
852
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
853
+ [`PreTrainedTokenizer.__call__`] for details.
854
+
855
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
856
+ `past_key_values`).
857
+
858
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
859
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
860
+ information on the default strategy.
861
+
862
+ - 1 indicates the head is **not masked**,
863
+ - 0 indicates the head is **masked**.
864
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
865
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
866
+ config.n_positions - 1]`.
867
+
868
+ [What are position IDs?](../glossary#position-ids)
869
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
870
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
871
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
872
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
873
+
874
+ Two formats are allowed:
875
+ - a [`~cache_utils.Cache`] instance;
876
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
877
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
878
+ cache format.
879
+
880
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
881
+ legacy cache format will be returned.
882
+
883
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
884
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
885
+ of shape `(batch_size, sequence_length)`.
886
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
887
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
888
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
889
+ model's internal embedding lookup matrix.
890
+ use_cache (`bool`, *optional*):
891
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
892
+ `past_key_values`).
893
+ output_attentions (`bool`, *optional*):
894
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
895
+ tensors for more detail.
896
+ output_hidden_states (`bool`, *optional*):
897
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
898
+ more detail.
899
+ return_dict (`bool`, *optional*):
900
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
901
+ """
902
+
903
+
904
+ @add_start_docstrings(
905
+ "The bare Mistral Model outputting raw hidden-states without any specific head on top.",
906
+ MISTRAL_START_DOCSTRING,
907
+ )
908
+ class MistralModel(MistralPreTrainedModel):
909
+ """
910
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`MistralDecoderLayer`]
911
+
912
+ Args:
913
+ config: MistralConfig
914
+ """
915
+
916
+ def __init__(self, config: MistralConfig):
917
+ super().__init__(config)
918
+ self.padding_idx = config.pad_token_id
919
+ self.vocab_size = config.vocab_size
920
+
921
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
922
+ self.layers = nn.ModuleList(
923
+ [MistralDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
924
+ )
925
+ self._attn_implementation = config._attn_implementation
926
+ self.norm = MistralRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
927
+
928
+ self.gradient_checkpointing = False
929
+ # Initialize weights and apply final processing
930
+ self.post_init()
931
+
932
+ def get_input_embeddings(self):
933
+ return self.embed_tokens
934
+
935
+ def set_input_embeddings(self, value):
936
+ self.embed_tokens = value
937
+
938
+ @add_start_docstrings_to_model_forward(MISTRAL_INPUTS_DOCSTRING)
939
+ def forward(
940
+ self,
941
+ input_ids: torch.LongTensor = None,
942
+ attention_mask: Optional[torch.Tensor] = None,
943
+ position_ids: Optional[torch.LongTensor] = None,
944
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
945
+ inputs_embeds: Optional[torch.FloatTensor] = None,
946
+ use_cache: Optional[bool] = None,
947
+ output_attentions: Optional[bool] = None,
948
+ output_hidden_states: Optional[bool] = None,
949
+ return_dict: Optional[bool] = None,
950
+ labels: Optional[torch.LongTensor] = None,
951
+ instruction_lens=None,
952
+ is_causal: Optional[bool] = True,
953
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
954
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
955
+ output_hidden_states = (
956
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
957
+ )
958
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
959
+
960
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
961
+
962
+ # retrieve input_ids and inputs_embeds
963
+ if input_ids is not None and inputs_embeds is not None:
964
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
965
+ elif input_ids is not None:
966
+ batch_size, seq_length = input_ids.shape
967
+ elif inputs_embeds is not None:
968
+ batch_size, seq_length, _ = inputs_embeds.shape
969
+ else:
970
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
971
+
972
+ if self.gradient_checkpointing and self.training:
973
+ if use_cache:
974
+ logger.warning_once(
975
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
976
+ )
977
+ use_cache = False
978
+
979
+ past_key_values_length = 0
980
+
981
+ if use_cache:
982
+ use_legacy_cache = not isinstance(past_key_values, Cache)
983
+ if use_legacy_cache:
984
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
985
+ past_key_values_length = past_key_values.get_usable_length(seq_length)
986
+
987
+ if position_ids is None:
988
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
989
+ position_ids = torch.arange(
990
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
991
+ )
992
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
993
+ else:
994
+ position_ids = position_ids.view(-1, seq_length).long()
995
+
996
+ if inputs_embeds is None:
997
+ inputs_embeds = self.embed_tokens(input_ids)
998
+
999
+ if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:
1000
+ is_padding_right = attention_mask[:, -1].sum().item() != batch_size
1001
+ if is_padding_right:
1002
+ raise ValueError(
1003
+ "You are attempting to perform batched generation with padding_side='right'"
1004
+ " this may lead to unexpected behaviour for Flash Attention version of Mistral. Make sure to "
1005
+ " call `tokenizer.padding_side = 'left'` before tokenizing the input. "
1006
+ )
1007
+
1008
+ if self._attn_implementation == "flash_attention_2":
1009
+ attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
1010
+ elif self._attn_implementation == "sdpa" and not output_attentions:
1011
+ # output_attentions=True can not be supported when using SDPA, and we fall back on
1012
+ # the manual implementation that requires a 4D causal mask in all cases.
1013
+ if is_causal:
1014
+ attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
1015
+ attention_mask,
1016
+ (batch_size, seq_length),
1017
+ inputs_embeds,
1018
+ past_key_values_length,
1019
+ )
1020
+ else:
1021
+ attention_mask = _prepare_4d_attention_mask_for_sdpa(
1022
+ attention_mask, inputs_embeds.dtype
1023
+ )
1024
+ else:
1025
+ # 4d mask is passed through the layers
1026
+ if is_causal:
1027
+ # Causal mask with -3.3895e+38 where no attention should be
1028
+ attention_mask = _prepare_4d_causal_attention_mask(
1029
+ attention_mask,
1030
+ (batch_size, seq_length),
1031
+ inputs_embeds,
1032
+ past_key_values_length,
1033
+ sliding_window=self.config.sliding_window,
1034
+ )
1035
+ else:
1036
+ # Shape: batch_size, 1, query_length, key_value_length
1037
+ attention_mask = _prepare_4d_attention_mask(
1038
+ attention_mask, inputs_embeds.dtype
1039
+ )
1040
+
1041
+ hidden_states = inputs_embeds
1042
+
1043
+ # decoder layers
1044
+ all_hidden_states = () if output_hidden_states else None
1045
+ all_self_attns = () if output_attentions else None
1046
+ next_decoder_cache = None
1047
+
1048
+ for decoder_layer in self.layers:
1049
+ if output_hidden_states:
1050
+ all_hidden_states += (hidden_states,)
1051
+
1052
+ if self.gradient_checkpointing and self.training:
1053
+ layer_outputs = self._gradient_checkpointing_func(
1054
+ decoder_layer.__call__,
1055
+ hidden_states,
1056
+ attention_mask,
1057
+ position_ids,
1058
+ past_key_values,
1059
+ output_attentions,
1060
+ use_cache,
1061
+ is_causal,
1062
+ )
1063
+ else:
1064
+ layer_outputs = decoder_layer(
1065
+ hidden_states,
1066
+ attention_mask=attention_mask,
1067
+ position_ids=position_ids,
1068
+ past_key_value=past_key_values,
1069
+ output_attentions=output_attentions,
1070
+ use_cache=use_cache,
1071
+ is_causal=is_causal,
1072
+ )
1073
+
1074
+ hidden_states = layer_outputs[0]
1075
+
1076
+ if use_cache:
1077
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
1078
+
1079
+ if output_attentions:
1080
+ all_self_attns += (layer_outputs[1],)
1081
+
1082
+ hidden_states = self.norm(hidden_states)
1083
+
1084
+ # add hidden states from the last decoder layer
1085
+ if output_hidden_states:
1086
+ all_hidden_states += (hidden_states,)
1087
+
1088
+ next_cache = None
1089
+ if use_cache:
1090
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
1091
+
1092
+ if not return_dict:
1093
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1094
+ return BaseModelOutputWithPast(
1095
+ last_hidden_state=hidden_states,
1096
+ past_key_values=next_cache,
1097
+ hidden_states=all_hidden_states,
1098
+ attentions=all_self_attns,
1099
+ )
1100
+
1101
+
1102
+ class MistralForCausalLM(MistralPreTrainedModel):
1103
+ _tied_weights_keys = ["lm_head.weight"]
1104
+
1105
+ def __init__(self, config):
1106
+ super().__init__(config)
1107
+ self.model = MistralModel(config)
1108
+ self.vocab_size = config.vocab_size
1109
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1110
+
1111
+ # Initialize weights and apply final processing
1112
+ self.post_init()
1113
+
1114
+ def get_input_embeddings(self):
1115
+ return self.model.embed_tokens
1116
+
1117
+ def set_input_embeddings(self, value):
1118
+ self.model.embed_tokens = value
1119
+
1120
+ def get_output_embeddings(self):
1121
+ return self.lm_head
1122
+
1123
+ def set_output_embeddings(self, new_embeddings):
1124
+ self.lm_head = new_embeddings
1125
+
1126
+ def set_decoder(self, decoder):
1127
+ self.model = decoder
1128
+
1129
+ def get_decoder(self):
1130
+ return self.model
1131
+
1132
+ @add_start_docstrings_to_model_forward(MISTRAL_INPUTS_DOCSTRING)
1133
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1134
+ def forward(
1135
+ self,
1136
+ input_ids: torch.LongTensor = None,
1137
+ attention_mask: Optional[torch.Tensor] = None,
1138
+ position_ids: Optional[torch.LongTensor] = None,
1139
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1140
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1141
+ labels: Optional[torch.LongTensor] = None,
1142
+ use_cache: Optional[bool] = None,
1143
+ output_attentions: Optional[bool] = None,
1144
+ output_hidden_states: Optional[bool] = None,
1145
+ return_dict: Optional[bool] = None,
1146
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1147
+ r"""
1148
+ Args:
1149
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1150
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1151
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1152
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1153
+
1154
+ Returns:
1155
+
1156
+ Example:
1157
+
1158
+ ```python
1159
+ >>> from transformers import AutoTokenizer, MistralForCausalLM
1160
+
1161
+ >>> model = MistralForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
1162
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
1163
+
1164
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1165
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1166
+
1167
+ >>> # Generate
1168
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1169
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1170
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1171
+ ```"""
1172
+
1173
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1174
+ output_hidden_states = (
1175
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1176
+ )
1177
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1178
+
1179
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1180
+ outputs = self.model(
1181
+ input_ids=input_ids,
1182
+ attention_mask=attention_mask,
1183
+ position_ids=position_ids,
1184
+ past_key_values=past_key_values,
1185
+ inputs_embeds=inputs_embeds,
1186
+ use_cache=use_cache,
1187
+ output_attentions=output_attentions,
1188
+ output_hidden_states=output_hidden_states,
1189
+ return_dict=return_dict,
1190
+ labels=labels,
1191
+ )
1192
+
1193
+ hidden_states = outputs[0]
1194
+ logits = self.lm_head(hidden_states)
1195
+ logits = logits.float()
1196
+
1197
+ loss = None
1198
+ if (labels is not None) and (input_ids.shape[1] > 1):
1199
+ # Shift so that tokens < n predict n
1200
+ shift_logits = logits[..., :-1, :].contiguous()
1201
+ shift_labels = labels[..., 1:].contiguous()
1202
+ # Flatten the tokens
1203
+ loss_fct = CrossEntropyLoss()
1204
+ # For deterministic loss w/ gradacc:
1205
+ #loss_fct = CrossEntropyLoss(reduction="none")
1206
+ loss_fct = CrossEntropyLoss(reduction="sum")
1207
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1208
+ shift_labels = shift_labels.view(-1)
1209
+ # Enable model parallelism
1210
+ shift_labels = shift_labels.to(shift_logits.device)
1211
+ loss = loss_fct(shift_logits, shift_labels)
1212
+ # For deterministic loss w/ gradacc:
1213
+ #loss = loss_fct(shift_logits, shift_labels).sum() / input_ids.shape[0]
1214
+ # Problem with below is
1215
+ # e.g. if we have 30 tokens, now we split them in two batches with 20 & 10
1216
+ # Then we get the losses 60 and 40 and average them
1217
+ # We get (3 + 4)/2 = 3.5
1218
+ # Meanwhile if we did it in one we would be doing 100 / 30 = 3.333
1219
+ loss = loss_fct(shift_logits, shift_labels) / attention_mask.sum()
1220
+
1221
+ if not return_dict:
1222
+ output = (logits,) + outputs[1:]
1223
+ return (loss,) + output if loss is not None else output
1224
+
1225
+ return CausalLMOutputWithPast(
1226
+ loss=loss,
1227
+ logits=logits,
1228
+ past_key_values=outputs.past_key_values,
1229
+ hidden_states=outputs.hidden_states,
1230
+ attentions=outputs.attentions,
1231
+ )
1232
+
1233
+ def prepare_inputs_for_generation(
1234
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
1235
+ ):
1236
+ # Omit tokens covered by past_key_values
1237
+ if past_key_values is not None:
1238
+ if isinstance(past_key_values, Cache):
1239
+ cache_length = past_key_values.get_seq_length()
1240
+ past_length = past_key_values.seen_tokens
1241
+ max_cache_length = past_key_values.get_max_length()
1242
+ else:
1243
+ cache_length = past_length = past_key_values[0][0].shape[2]
1244
+ max_cache_length = None
1245
+
1246
+ # Keep only the unprocessed tokens:
1247
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
1248
+ # some of the inputs are exclusivelly passed as part of the cache (e.g. when passing input_embeds as
1249
+ # input)
1250
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
1251
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
1252
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
1253
+ # input_ids based on the past_length.
1254
+ elif past_length < input_ids.shape[1]:
1255
+ input_ids = input_ids[:, past_length:]
1256
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
1257
+
1258
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
1259
+ if (
1260
+ max_cache_length is not None
1261
+ and attention_mask is not None
1262
+ and cache_length + input_ids.shape[1] > max_cache_length
1263
+ ):
1264
+ attention_mask = attention_mask[:, -max_cache_length:]
1265
+
1266
+ position_ids = kwargs.get("position_ids", None)
1267
+ if attention_mask is not None and position_ids is None:
1268
+ # create position_ids on the fly for batch generation
1269
+ position_ids = attention_mask.long().cumsum(-1) - 1
1270
+ position_ids.masked_fill_(attention_mask == 0, 1)
1271
+ if past_key_values:
1272
+ position_ids = position_ids[:, -input_ids.shape[1] :]
1273
+
1274
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1275
+ if inputs_embeds is not None and past_key_values is None:
1276
+ model_inputs = {"inputs_embeds": inputs_embeds}
1277
+ else:
1278
+ model_inputs = {"input_ids": input_ids}
1279
+
1280
+ model_inputs.update(
1281
+ {
1282
+ "position_ids": position_ids,
1283
+ "past_key_values": past_key_values,
1284
+ "use_cache": kwargs.get("use_cache"),
1285
+ "attention_mask": attention_mask,
1286
+ "labels": kwargs.get("labels"),
1287
+ }
1288
+ )
1289
+ return model_inputs
1290
+
1291
+ @staticmethod
1292
+ def _reorder_cache(past_key_values, beam_idx):
1293
+ reordered_past = ()
1294
+ for layer_past in past_key_values:
1295
+ reordered_past += (
1296
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1297
+ )
1298
+ return reordered_past
1299
+
1300
+
1301
+ @add_start_docstrings(
1302
+ """
1303
+ The Mistral Model transformer with a sequence classification head on top (linear layer).
1304
+
1305
+ [`MistralForSequenceClassification`] uses the last token in order to do the classification, as other causal models
1306
+ (e.g. GPT-2) do.
1307
+
1308
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1309
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1310
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1311
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1312
+ each row of the batch).
1313
+ """,
1314
+ MISTRAL_START_DOCSTRING,
1315
+ )
1316
+ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Mistral, LLAMA->MISTRAL
1317
+ class MistralForSequenceClassification(MistralPreTrainedModel):
1318
+ def __init__(self, config):
1319
+ super().__init__(config)
1320
+ self.num_labels = config.num_labels
1321
+ self.model = MistralModel(config)
1322
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1323
+
1324
+ # Initialize weights and apply final processing
1325
+ self.post_init()
1326
+
1327
+ def get_input_embeddings(self):
1328
+ return self.model.embed_tokens
1329
+
1330
+ def set_input_embeddings(self, value):
1331
+ self.model.embed_tokens = value
1332
+
1333
+ @add_start_docstrings_to_model_forward(MISTRAL_INPUTS_DOCSTRING)
1334
+ def forward(
1335
+ self,
1336
+ input_ids: torch.LongTensor = None,
1337
+ attention_mask: Optional[torch.Tensor] = None,
1338
+ position_ids: Optional[torch.LongTensor] = None,
1339
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1340
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1341
+ labels: Optional[torch.LongTensor] = None,
1342
+ use_cache: Optional[bool] = None,
1343
+ output_attentions: Optional[bool] = None,
1344
+ output_hidden_states: Optional[bool] = None,
1345
+ return_dict: Optional[bool] = None,
1346
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1347
+ r"""
1348
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1349
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1350
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1351
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1352
+ """
1353
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1354
+
1355
+ transformer_outputs = self.model(
1356
+ input_ids,
1357
+ attention_mask=attention_mask,
1358
+ position_ids=position_ids,
1359
+ past_key_values=past_key_values,
1360
+ inputs_embeds=inputs_embeds,
1361
+ use_cache=use_cache,
1362
+ output_attentions=output_attentions,
1363
+ output_hidden_states=output_hidden_states,
1364
+ return_dict=return_dict,
1365
+ )
1366
+ hidden_states = transformer_outputs[0]
1367
+ logits = self.score(hidden_states)
1368
+
1369
+ if input_ids is not None:
1370
+ batch_size = input_ids.shape[0]
1371
+ else:
1372
+ batch_size = inputs_embeds.shape[0]
1373
+
1374
+ if self.config.pad_token_id is None and batch_size != 1:
1375
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
1376
+ if self.config.pad_token_id is None:
1377
+ sequence_lengths = -1
1378
+ else:
1379
+ if input_ids is not None:
1380
+ # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
1381
+ sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
1382
+ sequence_lengths = sequence_lengths % input_ids.shape[-1]
1383
+ sequence_lengths = sequence_lengths.to(logits.device)
1384
+ else:
1385
+ sequence_lengths = -1
1386
+
1387
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1388
+
1389
+ loss = None
1390
+ if labels is not None:
1391
+ labels = labels.to(logits.device)
1392
+ if self.config.problem_type is None:
1393
+ if self.num_labels == 1:
1394
+ self.config.problem_type = "regression"
1395
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1396
+ self.config.problem_type = "single_label_classification"
1397
+ else:
1398
+ self.config.problem_type = "multi_label_classification"
1399
+
1400
+ if self.config.problem_type == "regression":
1401
+ loss_fct = MSELoss()
1402
+ if self.num_labels == 1:
1403
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1404
+ else:
1405
+ loss = loss_fct(pooled_logits, labels)
1406
+ elif self.config.problem_type == "single_label_classification":
1407
+ loss_fct = CrossEntropyLoss()
1408
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1409
+ elif self.config.problem_type == "multi_label_classification":
1410
+ loss_fct = BCEWithLogitsLoss()
1411
+ loss = loss_fct(pooled_logits, labels)
1412
+ if not return_dict:
1413
+ output = (pooled_logits,) + transformer_outputs[1:]
1414
+ return ((loss,) + output) if loss is not None else output
1415
+
1416
+ return SequenceClassifierOutputWithPast(
1417
+ loss=loss,
1418
+ logits=pooled_logits,
1419
+ past_key_values=transformer_outputs.past_key_values,
1420
+ hidden_states=transformer_outputs.hidden_states,
1421
+ attentions=transformer_outputs.attentions,
1422
+ )
plots.png ADDED
smash_config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "api_key": null,
3
+ "verify_url": "http://johnrachwan.pythonanywhere.com",
4
+ "smash_config": {
5
+ "pruners": "None",
6
+ "factorizers": "None",
7
+ "quantizers": "['llm-int8']",
8
+ "compilers": "None",
9
+ "task": "text_text_generation",
10
+ "device": "cuda",
11
+ "cache_dir": "/ceph/hdd/staff/charpent/.cache/modelsalvkosko",
12
+ "batch_size": 1,
13
+ "model_name": "GritLM/GritLM-7B",
14
+ "pruning_ratio": 0.0,
15
+ "n_quantization_bits": 4,
16
+ "output_deviation": 0.005,
17
+ "max_batch_size": 1,
18
+ "qtype_weight": "torch.qint8",
19
+ "qtype_activation": "torch.quint8",
20
+ "qobserver": "<class 'torch.ao.quantization.observer.MinMaxObserver'>",
21
+ "qscheme": "torch.per_tensor_symmetric",
22
+ "qconfig": "x86",
23
+ "group_size": 128,
24
+ "damp_percent": 0.1,
25
+ "save_load_fn": "bitsandbytes"
26
+ }
27
+ }