Refact-1_6B-fim / README.md

Update README.md

71199b4 about 1 year ago

18.4 kB

	---
	pipeline_tag: text-generation
	inference: true
	widget:
	- text: 'def print_hello_world():'
	example_title: Hello world
	group: Python
	license: bigscience-openrail-m
	pretrain-datasets:
	- books
	- arxiv
	- c4
	- falcon-refinedweb
	- wiki
	- github-issues
	- stack_markdown
	- self-made dataset of permissive github code
	datasets:
	- bigcode/the-stack-dedup
	- rombodawg/2XUNCENSORED_MegaCodeTraining188k
	- bigcode/commitpackft
	metrics:
	- code_eval
	library_name: transformers
	tags:
	- code
	model-index:
	- name: Refact-1.6B
	results:
	- task:
	type: text-generation
	dataset:
	type: openai_humaneval
	name: HumanEval
	metrics:
	- name: pass@1 (T=0.01)
	type: pass@1
	value: 32.0
	verified: false
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 31.5
	verified: false
	- name: pass@10 (T=0.8)
	type: pass@10
	value: 53.0
	verified: false
	- name: pass@100 (T=0.8)
	type: pass@100
	value: 76.9
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalSynthesize Python
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 35.8
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalSynthesize JavaScript
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 31.6
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalSynthesize Java
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 29.1
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalSynthesize Go
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: -1
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalSynthesize C++
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 26.3
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalSynthesize Rust
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: -1
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalSynthesize Average
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: -1
	verified: false





	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixTests Python
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 18.38
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixTests JavaScript
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 12.28
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixTests Java
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 15.12
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixTests Go
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: -1
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixTests C++
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 13.17
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixTests Rust
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 2.8
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixTests Average
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: -1
	verified: false






	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixDocs Python
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 26.92
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixDocs JavaScript
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 26.85
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixDocs Java
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 30.76
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixDocs Go
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: -1
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixDocs C++
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 25.94
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixDocs Rust
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 8.44
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalFixDocs Average
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: -1
	verified: false




	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalExplain Python
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 26.46
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalExplain JavaScript
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 17.86
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalExplain Java
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 20.94
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalExplain Go
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: -1
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalExplain C++
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 18.78
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalExplain Rust
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: -1
	verified: false
	- task:
	type: text-generation
	dataset:
	type: bigcode/humanevalpack
	name: HumanEvalExplain Average
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: -1
	verified: false


	- task:
	type: text-generation
	dataset:
	type: mbpp
	name: MBPP
	metrics:
	- name: pass@1 (T=0.01)
	type: pass@1
	value: 31.15
	verified: false
	- task:
	type: text-generation
	dataset:
	type: ds1000
	name: DS-1000 (Overall Completion)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 10.1
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (C++)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 21.61
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (C#)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 13.91
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (D)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 9.5
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Go)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 53.57
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Java)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 21.58
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Julia)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 13.75
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (JavaScript)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 26.88
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Lua)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 15.26
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (PHP)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 23.04
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Perl)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 12.1
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Python)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 29.6
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (R)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 13.77
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Ruby)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 12.68
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Racket)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 4.29
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Rust)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 19.54
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Scala)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 18.33
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Bash)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 5.7
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (Swift)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 17.68
	verified: false
	- task:
	type: text-generation
	dataset:
	type: nuprl/MultiPL-E
	name: MultiPL-HumanEval (TypeScript)
	metrics:
	- name: pass@1 (T=0.2)
	type: pass@1
	value: 25
	verified: false

	language:
	- en
	---

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/643a9dd0c5f633a7fa7e804a/HkB0QYV0BbmB3ktMugbZy.png)


	# Refact-1.6B

	Finally, the model we started training with our [blog post](https://refact.ai/blog/2023/applying-recent-innovations-to-train-model/) is ready 🎉

	After fine-tuning on generated data, it beats Replit 3b, Stability Code 3b and many other models. It almost beats
	StarCoder ten times the size!


	Model \| Size \| HumanEval pass@1 \| HumanEval pass@10 \|
	----------------------\|---------------\|--------------------\|--------------------\|
	DeciCoder-1b \| 1b \| 19.1% \| \|
	<b>Refact-1.6-fim</b> \| <b>1.6b</b> \| <b>32.0%</b> \| <b>53.0%</b> \|
	StableCode \| 3b \| 20.2% \| 33.8% \|
	ReplitCode v1 \| 3b \| 21.9% \| \|
	CodeGen2.5-multi \| 7b \| 28.4% \| 47.5% \|
	CodeLlama \| 7b \| 33.5% \| 59.6% \|
	StarCoder \| 15b \| 33.6% \| \|

	Likely, it's the best model for practical use in your IDE for code completion because it's smart and fast!
	You can start using it right now by downloading the
	[Refact plugin](https://refact.ai/). You can host the model yourself, too, using the
	[open source docker container](https://github.com/smallcloudai/refact).

	And it's multi-language (see MultiPL-HumanEval and other metrics below) and it works as a chat (see the section below).

	# It Works As a Chat

	The primary application of this model is code completion (infill) in multiple programming languages.
	But it works as a chat quite well.

	HumanEval results using instruction following (chat) format, against models specialized for chat only:

	Model \| Size \| pass@1 \| pass@10 \|
	-----------------------\|--------\|----------\|----------\|
	<b>Refact-1.6-fim</b> \| 1.6b \| 38.4% \| 55.6% \|
	StableCode-instruct \| 3b \| 26.9% \| 36.2% \|
	OctoGeeX \| 6b \| 44.7% \| \|
	CodeLlama-instruct \| 7b \| 34.8% \| 64.3% \|
	CodeGen2.5-instruct \| 7b \| 36.2% \| 60.87 \|
	CodeLlama-instruct \| 13b \| 42.7% \| 71.6% \|
	StarChat-β \| 15b \| 33.5% \| \|
	OctoCoder \| 15b \| 46.2% \| \|


	# Example

	Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output:

	```python
	# pip install -q transformers
	from transformers import AutoModelForCausalLM, AutoTokenizer

	checkpoint = "smallcloudai/Refact-1_6B-fim"
	device = "cuda" # for GPU usage or "cpu" for CPU usage

	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)

	prompt = '<fim_prefix>def print_hello_world():\n """<fim_suffix>\n print("Hello world!")<fim_middle>'

	inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
	outputs = model.generate(inputs, max_length=100, temperature=0.2)
	print("-"*80)
	print(tokenizer.decode(outputs[0]))
	```

	# Chat Format

	The same model works as chat (experimental).

	```python
	prompt_template = "<empty_output>SYSTEM {system}\n" \
	"<empty_output>USER {query}\n" \
	"<empty_output>ASSISTANT"
	prompt = prompt_template.format(system="You are a programming assistant",
	query="How do I sort a list in Python?")
	```

	# Architecture

	As described in more detail in the blog post, we used:

	- [ALiBi](https://arxiv.org/abs/2108.12409) based attention
	- [LayerNorm](https://arxiv.org/abs/1607.06450v1) instead of [RMSNorm](https://arxiv.org/pdf/1910.07467.pdf)
	- [Multi Query Attention](https://arxiv.org/abs/1911.02150)

	We also used LiON, flash attention, early dropout. It's not that innovative that you can't run it, in fact you can -- see an example below.


	# Pretraining

	For the base model, we used our own dataset that contains code with permissive licenses only, and open text datasets.
	Filtering is the key to success of this model:

	- We only used text in English
	- Only topics related to computer science
	- Applied heavy deduplication

	The text to code proportion was 50:50, model trained for 1.2T tokens.

	We don't release the base model, because its Fill-in-the-Middle (FIM) capability likes to repeat itself too much, so
	its practical use is limited. But if you still want it, write us a message on Discord.


	# Finetuning

	We tested our hypothesis that chat data should boost base model performance in FIM and
	regular left-to-right code completion. We found that just 15% of open
	[code](https://huggingface.co/datasets/bigcode/commitpackft)
	[instruction-following](https://huggingface.co/datasets/rombodawg/2XUNCENSORED_MegaCodeTraining188k) datasets,
	that we filtered for quality, improves almost all metrics.

	Additionally, to improve FIM, we observed common failure modes, and prepared a synthetic dataset based on
	[The Stack dedup v1.1](https://huggingface.co/datasets/bigcode/the-stack-dedup) to address them.

	There is a distribution shift between typical code on the internet, and the code you write in your IDE.
	The former is likely finished, so the model tries to come up with a suggestion that makes the code complete.
	You are likely to have half-written code as you work on it, there is no single addition that can repair it
	fully.

	In practice, model needs to have a tendency to stop after a couple of lines are added, and sometimes don't write
	anything at all. We found that just giving it empty completions, single line completions, multiline
	completions that end with a smaller text indent or at least a newline -- makes it much more usable. This data
	was used as the rest 85% of the finetune dataset.

	The final model is the result of several attempts to make it work as good as possible for code completion,
	and to perform well on a wide range of metrics. The best attempt took 40B tokens.


	# Limitations and Bias

	The Refact-1.6B model was trained on text in English. But it has seen a lot more languages in
	code comments. Its performance on non-English languages is lower, for sure.


	# Model Stats

	- Architecture: LLAMA-like model with multi-query attention
	- Objectives Fill-in-the-Middle, Chat
	- Tokens context: 4096
	- Pretraining tokens: 1.2T
	- Finetuning tokens: 40B
	- Precision: bfloat16
	- GPUs 64 NVidia A5000
	- Training time 28 days


	# License

	The model is licensed under the BigScience OpenRAIL-M v1 license agreement


	# Citation

	If you are using this model, please give a link to this page.