bigcode
/

starcoder2-15b

@@ -29,7 +29,7 @@ TODO
 ## Model Summary
-The StarCoderBase models are 15.5B parameter models trained on 600+ programming languages from [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train), with opt-out requests excluded. The model uses [Grouped Query Attention](https://arxiv.org/abs/2305.13245), [a context window of 16,384 tokens](https://arxiv.org/abs/2205.14135) with [a sliding window attention of 4,096 tokens](https://arxiv.org/abs/2004.05150v2),  and was trained using the [Fill-in-the-Middle objective](https://arxiv.org/abs/2207.14255) on 4+ trillion tokens.
 - **Project Website:** [bigcode-project.org](https://www.bigcode-project.org)
 - **Paper:** TODO
@@ -47,11 +47,11 @@ The model was trained on GitHub code as well as additional selected data sources
 # pip install -q transformers # TODO: from main
 from transformers import AutoModelForCausalLM, AutoTokenizer
-checkpoint = "bigcode/starcoderbase"
 device = "cuda" # for GPU usage or "cpu" for CPU usage
 tokenizer = AutoTokenizer.from_pretrained(checkpoint)
-model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)
 inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
 outputs = model.generate(inputs)
@@ -71,18 +71,17 @@ The model has been trained on source code from 600+ programming languages. The p
 ## Model
 - **Architecture:** Transformer decoder with grouped-query and sliding window attention and Fill-in-the-Middle objective
-- **Pretraining steps:** TODO
 - **Pretraining tokens:** 4+ trillion
 - **Precision:** bfloat16
 ## Hardware
 - **GPUs:** 1024 A100
-- **Training time:** TODO
 ## Software
-- **Framework:** [Megatron-Nemo](https://github.com/NVIDIA/NeMo) TODO double check
 - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
 # License

 ## Model Summary
+The StarCoder2-15B model is a 15B parameter model trained on 600+ programming languages from [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train), with opt-out requests excluded. The model uses [Grouped Query Attention](https://arxiv.org/abs/2305.13245), [a context window of 16,384 tokens](https://arxiv.org/abs/2205.14135) with [a sliding window attention of 4,096 tokens](https://arxiv.org/abs/2004.05150v2),  and was trained using the [Fill-in-the-Middle objective](https://arxiv.org/abs/2207.14255) on 4+ trillion tokens.
 - **Project Website:** [bigcode-project.org](https://www.bigcode-project.org)
 - **Paper:** TODO
 # pip install -q transformers # TODO: from main
 from transformers import AutoModelForCausalLM, AutoTokenizer
+checkpoint = "bigcode/starcoder2-15b"
 device = "cuda" # for GPU usage or "cpu" for CPU usage
 tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
 inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
 outputs = model.generate(inputs)
 ## Model
 - **Architecture:** Transformer decoder with grouped-query and sliding window attention and Fill-in-the-Middle objective
+- **Pretraining steps:** 1 million
 - **Pretraining tokens:** 4+ trillion
 - **Precision:** bfloat16
 ## Hardware
 - **GPUs:** 1024 A100
 ## Software
+- **Framework:** [NeMo](https://github.com/NVIDIA/NeMo)
 - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
 # License