Duplicate from Deci/DeciCoder-1b

Browse files

Co-authored-by: Itay Levy <itay-levy@users.noreply.huggingface.co>

Files changed (11) hide show

.gitattributes +35 -0
README.md +180 -0
config.json +28 -0
configuration_decicoder.py +50 -0
merges.txt +0 -0
model.safetensors +3 -0
modeling_decicoder.py +253 -0
special_tokens_map.json +26 -0
tokenizer.json +0 -0
tokenizer_config.json +30 -0
vocab.json +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,180 @@

+---
+pipeline_tag: text-generation
+license: apache-2.0
+tags:
+- text generation
+- Deci AI
+- DeciCoder
+programming_language:
+  - Java
+  - JavaScript
+  - Python
+metrics:
+- code_eval
+inference: true
+widget:
+- text: 'def print_hello_world():'
+  example_title: Hello world
+  group: Python
+model-index:
+- name: DeciCoder-1b
+  results:
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Python)
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.191
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (JavaScript)
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.184
+      verified: false
+  - task:
+      type: text-generation
+    dataset:
+      type: nuprl/MultiPL-E
+      name: MultiPL-HumanEval (Java)
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 0.166
+      verified: false
+datasets:
+- bigcode/starcoderdata
+---
+# Model Card for DeciCoder 1B
+DeciCoder 1B is a 1 billion parameter decoder-only code completion model
+trained on the Python, Java, and Javascript subsets of [Starcoder Training Dataset](https://huggingface.co/datasets/bigcode/starcoderdata).
+The model uses Grouped Query Attention and has a context window of 2048
+tokens. It was trained using a Fill-in-the-Middle training objective. The model's
+architecture was generated by Deci's proprietary Neural Architecture
+Search-based technology, AutoNAC.
+## Model Details
+- **Developed by:** Deci
+- **Model type:** DeciCoder is an auto-regressive language model based on the transformer decoder architecture, using Grouped Query Attention.
+- **Language(s):** Python, Java, JavaScript
+- **License:** Model checkpoints are licensed under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+## Model Architecture
+| Parameters | Layers | Heads  | Sequence Length  | GQA num_key_value_heads  | Hidden Size  |
+|:----------|:----------|:----------|:----------|:----------|:----------|
+| 1.1B    | 20    | 32    | 2048   | 4  | 2048 |  |
+- **Decoder layer:** Grouped Query Attention [Ainslie et al., 2023](https://arxiv.org/abs/2305.13245)
+- **Position Embeddings:** Rotary Position Embeddings [Su et al., 2021](https://arxiv.org/abs/2104.09864)
+## Uses
+The model is intended to do single/multiline code completion from a
+context window of up to 2048k tokens. It is *not* an instruction model
+and commands like \"Write a function that computes the absolute value of
+an integer,\" won't yield the desired results. A more effective approach
+is to frame instructions in the style of source code comments (e.g. \#
+this function calculates the absolute value of an integer) or to present
+a function signature and docstring, enabling the model to complete the
+function's body.
+### How to Use
+```bibtex
+# pip install -q transformers
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+checkpoint = "Deci/DeciCoder-1b"
+device = "cuda" # for GPU usage or "cpu" for CPU usage
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device)
+inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
+outputs = model.generate(inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0]))
+```
+### Attribution
+DeciCoder was trained on StarCoder Training Dataset, filtered for
+Python, Java, and Javascript code. For additional information, please
+refer to [https://huggingface.co/datasets/bigcode/starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata).
+### Limitations
+The model has undergone training with source code from Python, Java, and
+JavaScript. While the primary language in the source is English, it does
+contain other languages. Therefore, the model can produce code snippets
+given some context. However, there\'s no assurance that the resulting
+code will function as expected. It might be suboptimal, contain bugs, or
+even exploits.
+## Training Details
+### Training Data
+DeciCoder was trained on the Python, Java, and Javascript subsets of [Starcoder Training Dataset](https://huggingface.co/datasets/bigcode/starcoderdata)
+### Training Procedure
+- **Warm-Up Steps**: 9000
+- **Total Training Steps**: 284k
+- **Total Tokens**: 446B
+- **Global Batch Size**: 768
+- **Optimizer**: AdamW
+- **Optimizer Parameters**: beta1=0.9, beta2=0.95
+- **Weight Decay**: 0.1
+- **Learning Rate**: 4e-4
+- **Learning Rate Schedule**: cosine
+## Evaluation
+Below are DeciCoder's pass@1 on MultiPL HumanEval scores
+| Python | JavaScript | Java  |
+|:----------|:----------|:----------|
+| 19.1%    | 18.4%    | 16.6%    |
+### Runtime Benchmarks
+|Inference Tool/Hardware | A10 (tokens/sec) |A100 (tokens/sec) |
+|:----------|:----------|:----------|
+| PyTorch  | 1,364.2 | 3,244.4 |
+| Infery LLM | 3,889.3   | 11,676.8  |
+- Throughput (tokens/sec) - Measured with optimal batch size per hardware - A10 on BS 128, A100 on BS 512
+## Documentation
+- [Notebook](https://colab.research.google.com/drive/1JCxvBsWCZKHfIcHSMVf7GZCs3ClMQPjs)
+- Blog post: [Introducing DeciCoder: The New Gold Standard in Efficient and Accurate Code Generation](https://deci.ai/blog/decicoder-efficient-and-accurate-code-generation-llm/)
+- Questions:Feel free to contact us via our [Discord Community!](https://discord.com/invite/p9ecgRhDR8/)
+## How to Cite
+Please cite this model using this format.
+```bibtex
+@misc{DeciFoundationModels,
+title = {DeciCoder},
+author = {DeciAI Research Team},
+year = {2023}
+url={[https://huggingface.co/deci/decicoder-1b](https://huggingface.co/deci/decicoder-1b)},
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "architectures": [
+    "DeciCoderForCausalLM"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_decicoder.DeciCoderConfig",
+    "AutoModelForCausalLM": "modeling_decicoder.DeciCoderForCausalLM"
+  },
+  "bos_token_id": 0,
+  "eos_token_id": 0,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": 5888,
+  "max_position_embeddings": 2048,
+  "num_attention_heads": 32,
+  "num_hidden_layers": 20,
+  "num_key_value_heads": 4,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "use_bfloat16": true,
+  "transformers_version": "4.31.0.dev0",
+  "use_cache": true,
+  "vocab_size": 49152
+}

configuration_decicoder.py ADDED Viewed

	@@ -0,0 +1,50 @@

+from packaging import version
+import transformers
+if version.parse(transformers.__version__) < version.parse("4.31.0"):
+    raise ImportError(
+        f"You are using transformers=={transformers.__version__}, but transformers>=4.31.0 is required to use DeciCoder. Please upgrade transformers."
+    )
+from transformers.models.llama.configuration_llama import LlamaConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
+class DeciCoderConfig(LlamaConfig):
+    r"""
+   This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
+   model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+   defaults will yield a similar configuration to that of the LLaMA-7B.
+   Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+   documentation from [`PretrainedConfig`] for more information.
+    Args:
+        naive_attention_prefill (`bool`, *optional*, defaults to False):
+            Whether to use naive matmul or scaled dot product attention during prefill.
+        naive_attention_decode_batched (`bool`, *optional*, defaults to True):
+            Whether to use naive matmul or scaled dot product attention during decode for batch_size > 1.
+        naive_attention_decode_single (`bool`, *optional*, defaults to False):
+            Whether to use naive matmul or scaled dot product attention during decode for batch_size == 1.
+       ```"""
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        naive_attention_prefill: bool = False,
+        naive_attention_decode_batched: bool = True,
+        naive_attention_decode_single: bool = False,
+        **kwargs,
+    ):
+        self.naive_attention_prefill = naive_attention_prefill
+        self.naive_attention_decode_batched = naive_attention_decode_batched
+        self.naive_attention_decode_single = naive_attention_decode_single
+        super().__init__(**kwargs,)

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:510256faa3d388cad1dcbc30c39d32f9289410a399f4c0435bec27ec135c6f0f
+size 2227364400

modeling_decicoder.py ADDED Viewed

	@@ -0,0 +1,253 @@

+# coding=utf-8
+# Copyright and license here
+""" PyTorch DeciCoder model."""
+import math
+from typing import Optional, Tuple
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from torch import nn
+from packaging import version
+import transformers
+if version.parse(transformers.__version__) < version.parse("4.31.0"):
+    raise ImportError(
+        f"You are using transformers=={transformers.__version__}, but transformers>=4.31.0 is required to use DeciCoder. Please upgrade transformers."
+    )
+from transformers.models.llama.modeling_llama import LlamaMLP, LlamaRMSNorm, LlamaAttention, apply_rotary_pos_emb, \
+    repeat_kv, LlamaPreTrainedModel, LLAMA_START_DOCSTRING, LlamaDecoderLayer, LlamaForCausalLM, LlamaModel
+from transformers.utils import add_start_docstrings
+from .configuration_decicoder import DeciCoderConfig
+_CONFIG_FOR_DOC = "DeciCoderConfig"
+class DeciCoderAttention(LlamaAttention):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: DeciCoderConfig):
+        nn.Module.__init__(self)
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+        self.pretraining_tp = config.pretraining_tp
+        self.max_position_embeddings = config.max_position_embeddings
+        self.rope_theta = getattr(config, 'rope_theta', None)
+        if (self.head_dim * self.num_heads) != self.hidden_size:
+            raise ValueError(
+                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
+                f" and `num_heads`: {self.num_heads})."
+            )
+        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
+        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
+        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
+        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
+        self.naive_attention_prefill = config.naive_attention_prefill
+        self.naive_attention_decode_batched = config.naive_attention_decode_batched
+        self.naive_attention_decode_single = config.naive_attention_decode_single
+        self._init_rope()
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            attention_mask: Optional[torch.Tensor] = None,
+            position_ids: Optional[torch.LongTensor] = None,
+            past_key_value: Optional[Tuple[torch.Tensor]] = None,
+            output_attentions: bool = False,
+            use_cache: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+        if past_key_value is None:
+            is_decode = False
+        else:
+            is_decode = True
+        if self.pretraining_tp > 1:
+            key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.pretraining_tp
+            query_slices = self.q_proj.weight.split((self.num_heads * self.head_dim) // self.pretraining_tp, dim=0)
+            key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
+            value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
+            query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
+            query_states = torch.cat(query_states, dim=-1)
+            key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.pretraining_tp)]
+            key_states = torch.cat(key_states, dim=-1)
+            value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.pretraining_tp)]
+            value_states = torch.cat(value_states, dim=-1)
+        else:
+            query_states = self.q_proj(hidden_states)
+            key_states = self.k_proj(hidden_states)
+            value_states = self.v_proj(hidden_states)
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        kv_seq_len = key_states.shape[-2]
+        if past_key_value is not None:
+            kv_seq_len += past_key_value[0].shape[-2]
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+        if past_key_value is not None:
+            # reuse k, v, self_attention
+            key_states = torch.cat([past_key_value[0], key_states], dim=2)
+            value_states = torch.cat([past_key_value[1], value_states], dim=2)
+        past_key_value = (key_states, value_states) if use_cache else None
+        # repeat k/v heads if n_kv_heads < n_heads
+        if is_decode:
+            query_states = query_states.view(bsz, self.num_key_value_heads, self.num_key_value_groups, self.head_dim)
+            if self.naive_attention_decode_batched and bsz > 1 or self.naive_attention_decode_single and bsz == 1:
+                attn_weights = (query_states @ key_states.transpose(-2, -1)) / math.sqrt(key_states.size(-1))
+                attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
+                if attention_mask is not None:
+                    if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
+                        raise ValueError(
+                            f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
+                        )
+                    attn_weights = attn_weights + attention_mask
+                attn_output = torch.matmul(attn_weights, value_states)
+            else:
+                attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, is_causal=False,
+                                                             dropout_p=0.0)
+            attn_output = attn_output.contiguous().view(bsz, q_len, self.hidden_size)
+        else:
+            key_states = repeat_kv(key_states, self.num_key_value_groups)
+            value_states = repeat_kv(value_states, self.num_key_value_groups)
+            if not self.naive_attention_prefill:
+                attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, is_causal=True,
+                                                             dropout_p=0.0)
+            else:
+                attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
+                # attn_weights = (query_states @ key_states.transpose(-2, -1)) / math.sqrt(key_states.size(-1))
+                if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
+                    raise ValueError(
+                        f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
+                        f" {attn_weights.size()}"
+                    )
+                if attention_mask is not None:
+                    if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
+                        raise ValueError(
+                            f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
+                        )
+                    attn_weights = attn_weights + attention_mask
+                # upcast attention to fp32
+                attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
+                attn_output = torch.matmul(attn_weights, value_states)
+            if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
+                raise ValueError(
+                    f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
+                    f" {attn_output.size()}"
+                )
+            attn_output = attn_output.transpose(1, 2).contiguous().view(bsz, q_len, self.hidden_size)
+            # attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
+        if self.pretraining_tp > 1:
+            attn_output = attn_output.split(self.hidden_size // self.pretraining_tp, dim=2)
+            o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.pretraining_tp, dim=1)
+            attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.pretraining_tp)])
+        else:
+            attn_output = self.o_proj(attn_output)
+        if not output_attentions:
+            attn_weights = None
+        return attn_output, attn_weights, past_key_value
+class DeciCoderDecoderLayer(LlamaDecoderLayer):
+    def __init__(self, config: DeciCoderConfig):
+        nn.Module.__init__(self)
+        self.hidden_size = config.hidden_size
+        self.self_attn = DeciCoderAttention(config=config)
+        self.mlp = LlamaMLP(config)
+        self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+@add_start_docstrings(
+    "The bare DeciCoder Model outputting raw hidden-states without any specific head on top.",
+    LLAMA_START_DOCSTRING,
+)
+class DeciCoderPreTrainedModel(LlamaPreTrainedModel):
+    config_class = DeciCoderConfig
+    _no_split_modules = ["DeciCoderDecoderLayer"]
+    _keys_to_ignore_on_load_missing = ["self_attn.rotary_emb.inv_freq"]
+@add_start_docstrings(
+    "The bare DeciCoder Model outputting raw hidden-states without any specific head on top.",
+    LLAMA_START_DOCSTRING,
+)
+class DeciCoderModel(LlamaModel, DeciCoderPreTrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`DeciCoderDecoderLayer`]
+    Args:
+        config: DeciCoderConfig
+    """
+    def __init__(self, config: DeciCoderConfig):
+        DeciCoderPreTrainedModel.__init__(self, config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList([DeciCoderDecoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
+        self._validate_config_supports_attention_mask(attention_mask, input_shape, past_key_values_length)
+        return LlamaModel._prepare_decoder_attention_mask(
+            self, attention_mask, input_shape, inputs_embeds, past_key_values_length)
+    def _validate_config_supports_attention_mask(self, attention_mask, input_shape, past_key_values_length):
+        is_decode = past_key_values_length > 0
+        if not torch.all(torch.eq(attention_mask, 1)).item():
+            if is_decode:
+                if input_shape[0] == 1 and not self.config.naive_attention_decode_single:
+                    raise ValueError(
+                        "For support of custom attention masks please set naive_attention_decode_single to True in the "
+                        "config")
+                elif input_shape[0] > 1 and not self.config.naive_attention_decode_batched:
+                    raise ValueError(
+                        "For support of custom attention masks please set naive_attention_decode_batched to True in the"
+                        "config")
+            else:
+                if not self.config.naive_attention_prefill:
+                    raise ValueError("For support of custom attention masks please set naive_attention_prefill to "
+                                     "True in the config")
+class DeciCoderForCausalLM(LlamaForCausalLM, DeciCoderPreTrainedModel):
+    def __init__(self, config):
+        DeciCoderPreTrainedModel.__init__(self, config)
+        self.model = DeciCoderModel(config)
+        self.pretraining_tp = config.pretraining_tp
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "unk_token": "<|endoftext|>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "add_prefix_space": false,
+  "additional_special_tokens": [
+    "<|endoftext|>",
+    "<fim_prefix>",
+    "<fim_middle>",
+    "<fim_suffix>",
+    "<fim_pad>",
+    "<filename>",
+    "<gh_stars>",
+    "<issue_start>",
+    "<issue_comment>",
+    "<issue_closed>",
+    "<jupyter_start>",
+    "<jupyter_text>",
+    "<jupyter_code>",
+    "<jupyter_output>",
+    "<empty_output>",
+    "<commit_before>",
+    "<commit_msg>",
+    "<commit_after>",
+    "<reponame>"
+  ],
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 1000000000000000019884624838656,
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>",
+  "vocab_size": 49152
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff