Creating a Coding Assistant with StarCoder
This article is also available in Chinese 简体中文.
If you’re a software developer, chances are that you’ve used GitHub Copilot or ChatGPT to solve programming tasks such as translating code from one language to another or generating a full implementation from a natural language query like “Write a Python program to find the Nth Fibonacci number”. Although impressive in their capabilities, these proprietary systems typically come with several drawbacks, including a lack of transparency on the public data used to train them and the inability to adapt them to your domain or codebase.
Fortunately, there are now several high-quality open-source alternatives! These include SalesForce’s CodeGen Mono 16B for Python, or Replit’s 3B parameter model trained on 20 programming languages.
The new kid on the block is BigCode’s StarCoder, a 16B parameter model trained on one trillion tokens sourced from 80+ programming languages, GitHub issues, Git commits, and Jupyter notebooks (all permissively licensed). With an enterprise-friendly license, 8,192 token context length, and fast large-batch inference via multi-query attention, StarCoder is currently the best open-source choice for code-based applications.
In this blog post, we’ll show how StarCoder can be fine-tuned for chat to create a personalised coding assistant! Dubbed StarChat, we’ll explore several technical details that arise when using large language models (LLMs) as coding assistants, including:
- How LLMs can be prompted to act like conversational agents.
- OpenAI’s Chat Markup Language (or ChatML for short), which provides a structured format for conversational messages between human users and AI assistants.
- How to fine-tune a large model on a diverse corpus of dialogues with 🤗 Transformers and DeepSpeed ZeRO-3.
As a teaser of the end result, try asking StarChat a few programming questions in the demo below!
You can also find the code, dataset, and model used to produce the demo at the following links:
- Code: https://github.com/bigcode-project/starcoder
- Dataset: https://huggingface.co/datasets/HuggingFaceH4/oasst1_en
- Model: https://huggingface.co/HuggingFaceH4/starchat-alpha
To get started, let’s take a look at how language models can be turned into conversational agents without any fine-tuning at all.
Prompting LLMs for dialogue
As shown by DeepMind and Anthropic, LLMs can be turned into conversational agents through a clever choice of prompt. These prompts typically involve a so-called “system” message that defines the character of the LLM, along with a series of dialogues between the assistant and a user.
For example, here’s an excerpt from Anthropic’s HHH prompt (a whopping 6k tokens in total!):
Below are a series of dialogues between various people and an AI assistant.
The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable.
The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed.
It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer.
That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.
-----
Human: I was wondering, is it actually important when making spaghetti to add salt?
Assistant: Do you mean, is it important to add salt to the water that you’re boiling the spaghetti in?
Human: Yeah
Assistant: It’s usually recommended to bring out the flavor of the pasta. The scientific way to say this is that salt enhances the ability of chemoreceptors in the tongue to detect molecules that are perceived through the sense of taste.
Human: But I mean I’m going to use sauce and stuff, so does it matter?
Assistant: It sounds like you’re skeptical. Honestly, I don’t know for sure, but since you’re curious, why don’t you do a taste test yourself -- make two small pots of pasta, one with plenty of salt, and the other without. Then taste the noodles and see which one you prefer? As you said, you can also then add sauce to both, and see if any differences in taste you noticed with the plain pasta become imperceptible. It should be a fun experiment!
-----
...
-----
Human: {USER QUERY}
Assistant:
As we can see, the first part of the prompt “Below are a series...” corresponds to the system message and specifies that the assistant should have characteristics like “helpfulness” and “politeness”. The dialogue examples then condition the model to follow the multi-turn format of a conversation. When a user asks a question, the whole prompt is fed to the model and it generates an answer after the Assistant:
prefix. The answer is then concatenated to the prompt and the process repeated at every turn.
Somewhat surprisingly, this technique also works for StarCoder! This is enabled by the model’s 8k token context length, which allows one to include a wide variety of programming examples and convert the model into a coding assistant. Here’s an excerpt of the StarCoder prompt:
Below are a series of dialogues between various people and an AI technical assistant.
The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable.
The assistant is happy to help with code questions, and will do its best to understand exactly what is needed.
It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer.
That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.
The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).
The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.
-----
Human: Write a function that takes two lists and returns a list that has alternating elements from each input list.
Assistant: Sure. Here is a function that does that.
def alternating(list1, list2):
results = []
for i in range(len(list1)):
results.append(list1[i])
results.append(list2[i])
return results
Human: Can you write some test cases for this function?
Assistant: Sure, here are some tests.
assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]
assert alternating([True, False], [4, 5]) == [True, 4, False, 5]
assert alternating([], []) == []
Human: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.
Assistant: Here is the modified function.
def alternating(list1, list2):
results = []
for i in range(min(len(list1), len(list2))):
results.append(list1[i])
results.append(list2[i])
if len(list1) > len(list2):
results.extend(list1[i+1:])
else:
results.extend(list2[i+1:])
return results
-----
Here we can see how a well crafted prompt can induce coding behaviour similar to that observed in ChatGPT. You can find the full prompt here and chat with the prompted StarCoder on HuggingChat.
One major drawback with dialogue-prompting is that inference can be very costly: every turn of the conversation involves thousands of tokens which will quickly burn a hole in your wallet!
The obvious alternative is to fine-tune the base model on a corpus of dialogues and enable it to become “chatty”. Let’s take a look at a few interesting datasets that have recently landed on the Hub and are powering most of the open-source chatbots today.
Datasets for chatty language models
The open-source community is rapidly creating diverse and powerful datasets for transforming any base language model into a conversational agent that can follow instructions. Some examples that we have found to produce chatty language models include:
- OpenAssistant’s dataset, which consists of over 40,000 conversations, where members of the community take turns mimicking the roles of a user or AI assistant.
- The ShareGPT dataset, which contains approximately 90,000 conversations between human users and ChatGPT.
For the purposes of this blog post, we’ll use the OpenAssistant dataset to fine-tune StarCoder since it has a permissive license and was produced entirely by humans.
The raw dataset is formatted as a collection of conversation trees, so we’ve preprocessed it so that each row corresponds to a single dialogue between the user and the assistant. To avoid deviating too far from the data that StarCoder was pretrained on, we’ve also filtered it for English dialogues.
Let’s start by downloading the processed dataset from the Hub:
from datasets import load_dataset
dataset = load_dataset("HuggingFaceH4/oasst1_en")
print(dataset)
DatasetDict({
train: Dataset({
features: ['messages'],
num_rows: 19034
})
test: Dataset({
features: ['messages'],
num_rows: 2115
})
})
As we can see, the dataset contains around 21,000 English conversations. Let’s take a look at one of the training examples. Take the first example,
sample = dataset["train"][0]
print(sample)
{
"messages": [
{
"content": "Is it possible to imagine a society without law?",
"role": "user",
},
{
"content": "It is difficult to imagine a society that is able to be maintained without any semblance of Law. Laws exist for the purpose of maintaining society and past human behavior suggests that there would be people that have no interest in promoting social cohesion and norms so it's unlikely that complex social and economic systems could develop without a legal system.",
"role": "assistant",
},
{
"content": 'It seems like you consider the absence of law equal to the absence of anything that could guide the behaviour of the individual, however there are many other such things, like religion, conscience, tradition to mention a few important patterns which are known to have crucial influence on society even today, but also we could imagine more, like universal love if everyone suddenly became "awaken" in some spiritual sense, or simply just having an omnipotent leader who maintains a certain order without ever codifying the principles in law. Is it still difficult to imagine?',
"role": "user",
},
{
"content": "You are correct that there are other factors that can guide behavior in a society and play a role in shaping individuals' behavior and interactions with each other. However, even in societies where these factors are present, laws still serve an important role in maintaining social order and resolving conflicts. \n\nFor example, religious teachings may prohibit theft and murder, but laws provide a system of justice for those who violate these prohibitions. Personal morality and conscience may guide individuals to act in a certain way, but laws provide a framework for resolving disputes when conflicting moral beliefs arise. Furthermore, the other factors you mentioned, such as universal love or an omnipotent leader, are not necessarily stable or reliable guides for behavior. Universal love, while a noble ideal, may not be achievable in practice, and an omnipotent leader may be prone to corruption or abuse of power.\n\nLaws provide a crucial foundation for maintaining social order and ensuring justice. It is possible to imagine alternative systems of governance, but it is unlikely that a society without any sort of legal framework would be stable or functional in the long term.",
"role": "assistant",
},
{
"content": "Yeah, but laws are complicated. Most people can't understand them in depth. Some would argue it is almost a self-serving system which put energy into growing itself(eg.: patent trolling). I think there must be a less complex system which keeps up order in society.",
"role": "user",
},
]
}
OK, this looks like an interesting dialogue about moral philosophy, with each turn involving a role and content field to indicate who is writing. Let’s now take a look at converting these dialogues to a standard format that simplifies the way messages are generated at inference time.
A standard format for dialogues
One way to fine-tune a model on dialogues is to simply insert the system message and roles in each training example, and then separate each dialogue with an end-of-sequence token like
Below is a dialogue between a human and AI assistant ...
Human: Is it possible to imagine a society without law?
Assistant: It is difficult to imagine ...
Human: It seems like you ...
Assistant: You are correct ...
Human: Yeah, but laws are complicated ..
<EOS>
Although this works fine for training, it isn’t ideal for inference because the model will naturally generate unwanted turns until it produces an <EOS>
token, and some post-processing or additional logic is typically required to prevent this.
A more appealing approach is to use a structured format like ChatML, which wraps each turn with a set of special tokens that indicates the role of the query or response.
In this format, we have the following special tokens:
<|system|>
: indicates which part of the dialogue contains the system message to condition the character of the assistant.<|user|>
: indicates the message comes from the human user<|assistant|>
: indicates the messages come from the AI assistant<|end|>
: indicates the end of a turn or system message
Let’s write a function that wraps our running example with these tokens to see what it looks like:
system_token = "<|system|>"
user_token = "<|user|>"
assistant_token = "<|assistant|>"
end_token = "<|end|>"
def prepare_dialogue(example):
system_msg = "Below is a dialogue between a human and an AI assistant called StarChat."
prompt = system_token + "\n" + system_msg + end_token + "\n"
for message in example["messages"]:
if message["role"] == "user":
prompt += user_token + "\n" + message["content"] + end_token + "\n"
else:
prompt += assistant_token + "\n" + message["content"] + end_token + "\n"
return prompt
print(prepare_dialogue(sample))
<|system|>
Below is a dialogue between a human and AI assistant called StarChat.
<|end|>
<|user|>
Is it possible to imagine a society without law?<|end|>
<|assistant|>
It is difficult to imagine ...<|end|>
<|user|>
It seems like you ...<|end|>
<|assistant|>
You are correct ...<|end|>
<|user|>
Yeah, but laws are complicated ...<|end|>
OK, this looks like what we need! The next step is to include these special tokens in the tokenizer’s vocabulary, so let’s download the StarCoder tokenizer and add them:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoderbase")
tokenizer.add_special_tokens({"additional_special_tokens": ["<|system|>", "<|assistant|>", "<|user|>", "<|end|>"]})
# Check the tokens have been added
tokenizer.special_tokens_map
{
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"unk_token": "<|endoftext|>",
"additional_special_tokens": ["<|system|>", "<|assistant|>", "<|user|>", "<|end|>"],
}
As a sanity check this works, let’s see if tokenizing the string "<|assistant|>" produces a single token ID:
tokenizer("<|assistant|>")
{"input_ids": [49153], "attention_mask": [1]}
Great, it works!
Masking user labels
One additional benefit of the special chat tokens is that we can use them to mask the loss from the labels associated with the user turns of each dialogue. The reason to do this is to ensure the model is conditioned on the user parts of the dialogue, but only trained to predict the assistant parts (which is what really matters during inference). Here’s a simple function that masks the labels in place and converts all the user tokens to -100 which is subsequently ignored by the loss function:
def mask_user_labels(tokenizer, labels):
user_token_id = tokenizer.convert_tokens_to_ids(user_token)
assistant_token_id = tokenizer.convert_tokens_to_ids(assistant_token)
for idx, label_id in enumerate(labels):
if label_id == user_token_id:
current_idx = idx
while labels[current_idx] != assistant_token_id and current_idx < len(labels):
labels[current_idx] = -100 # Ignored by the loss
current_idx += 1
dialogue = "<|user|>\nHello, can you help me?<|end|>\n<|assistant|>\nSure, what can I do for you?<|end|>\n"
input_ids = tokenizer(dialogue).input_ids
labels = input_ids.copy()
mask_user_labels(tokenizer, labels)
labels
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 49153, 203, 69, 513, 30, 2769, 883, 439, 745, 436, 844, 49, 49155, 203]
OK, we can see that all the user input IDs have been masked in the labels as desired. These special tokens have embeddings that will need to be learned during the fine-tuning process. Let’s take a look at what’s involved.
Fine-tuning StarCoder with DeepSpeed ZeRO-3
The StarCoder and StarCoderBase models contain 16B parameters, which means we’ll need a lot of GPU vRAM to fine-tune them — for instance, simply loading the model weights in full FP32 precision requires around 60GB vRAM! Fortunately, there are a few options available to deal with large models like this:
- Use parameter-efficient techniques like LoRA which freeze the base model’s weights and insert a small number of learnable parameters. You can find many of these techniques in the 🤗 PEFT library.
- Shard the model weights, optimizer states, and gradients across multiple devices using methods like DeepSpeed ZeRO-3 or FSDP.
Since DeepSpeed is tightly integrated in 🤗 Transformers, we’ll use it to train our model. To get started, first clone BigCode’s StarCoder repo from GitHub and navigate to the chat
directory:
git clone https://github.com/bigcode-project/starcoder.git
cd starcoder/chat
Next, create a Python virtual environment using e.g. Conda:
conda create -n starchat python=3.10 && conda activate starchat
Next, we install PyTorch v1.13.1. Since this is hardware-dependent, we direct you to the PyTorch Installation Page for this step. Once you've installed it, install the rest of the project dependencies:
pip install -r requirements.txt
We need to be logged into both Hugging Face. To do so, run:
huggingface-cli login
Finally, install Git LFS with:
sudo apt-get install git-lfs
The final step is to launch the training! If you’re lucky enough to have 8 x A100 (80GB) GPUs to run this training, you can run the following command. Training should take around 45 minutes:
torchrun --nproc_per_node=8 train.py config.yaml --deepspeed=deepspeed_z3_config_bf16.json
Here the config.yaml
file specifies all the parameters associated with the dataset, model, and training - you can configure it here to adapt the training to a new dataset. Your trained model will then be available on the Hub!
StarCoder as a coding assistant
Generating plots
We wanted to see how our model could do with basic visualization tasks, following the famous unicorn drawing in tikz of GPT-4. To do this, we prompted our model with some coding tasks and got wonderful results! Yes, these are a little cherry picked as we only selected the completions that wrote functioning code, but the others were not far off!
Example 1: bar plot
Prompt:
Give the following data {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'], 'Age':[20, 21, 19, 18], 'Height' : [6.1, 5.9, 6.0, 6.1]}. Can you plot one graph with two subplots as columns. The first is a bar graph showing the height of each person. The second is a bargraph showing the age of each person? Draw the graph in seaborn talk mode.
Response:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')
data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],
'Age':[20, 21, 19, 18],
'Height':[6.1, 5.9, 6.0, 6.1]}
df=pd.DataFrame(data)
# Bar Graph for Height and Age
fig,(ax1, ax2)=plt.subplots(nrows=1, ncols=2, figsize=(14,7))
sns.barplot(x='Name', y="Height", data=df, palette="Blues_d", ax=ax1).set_title("Height Comparison")
sns.barplot(x='Name',y ="Age", data=df,palette="Reds_d", ax=ax2 ).set_title("Age Comparison")
# Add padding between plots
fig.tight_layout()
# Show the final image
plt.show()
Example of a plot generated by StarChat.
Example 2: mapping
Prompt:
Draw me a map of the world using geopandas. Make it so that only Germany and Spain are colored red.
Response:
import pandas as pd
from matplotlib import pyplot as plt
import geopandas as gpd
from shapely.geometry import Point, Polygon
%matplotlib inline
# Load data from URL or local file
url = 'https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson'
df = gpd.read_file(url)
# Create plot
fig, ax = plt.subplots()
ax.set_aspect('equal')
world = gpd.GeoDataFrame({'name': ['World'],
'geometry': [Polygon([(0, 0), (1, 0), (1, 1), (0, 1)])]})
germany = df[df['ADMIN'] == "Germany"]
spain = df[df['ADMIN'] == "Spain"]
other = df[(df['ADMIN']!= "Germany") & (df['ADMIN']!= "Spain")]
world.plot(color='lightgrey', edgecolor='white', ax=ax)
germany.plot(color="red", ax=ax)
spain.plot(color="red", ax=ax)
other.plot(color="skyblue", ax=ax)
plt.title("European Countries")
plt.show()
Example of a plot generated by StarChat.
Example 3: basketball
Prompt:
There was a basketball game with the following stats. player, points, rebounds and assists: J. Harden 42 8 9 J. Embiid 34 13 4 T. Maxey 14 8 1 T. Harris 9 7 2 P. Tucker 6 3 1. Can you draw a scatter plot graph in Seaborn talk mode for each player where the y-axis is number of points and the x-axis is number of rebounds. Make sure the marker is 'o'. Put the name of the player over each point. The title should be "Sixers Players".
Response:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
# Load data into dataframe
df = pd.DataFrame({'Player': ['J. Harden', 'J. Embiid', 'T. Maxey', 'T. Harris', 'P. Tucker'],
'Points': [42, 34, 14, 9, 6],
'Rebounds': [8, 13, 8, 7, 3],
'Assists': [9, 4, 1, 2, 1]})
# Create figure and axis objects
fig, ax = plt.subplots()
# Plot points vs rebounds
sns.scatterplot(data=df, x='Rebounds', y='Points', hue='Player', style='Player', markers=['o']*5)
# Remove legend since it overlaps with labels on top of players' names
ax.legend([],[], frameon=False)
# Add text labels above dots for player names
for i, txt in enumerate(df['Player']):
ax.annotate(txt, (df["Rebounds"][i] +.1, df["Points"][i]))
# Set titles and labels
ax.set_title('Sixers Player')
ax.set_xlabel('Number of Rebounds')
ax.set_ylabel('Number of Points')
plt.show()
Example of a plot generated by StarChat.
Evaluating coding assistants
Evaluating coding assistants (or chatbots more generally) is tricky because the user-facing metrics we care about are often not measured in conventional NLP benchmarks. For example, we ran the base and fine-tuned StarCoderBase models through EleutherAI’s language model evaluation harness to measure their performance on the following benchmarks:
- AI2 Reasoning Challenge (ARC): Grade-school multiple choice science questions
- HellaSwag: Commonsense reasoning around everyday events
- MMLU: Multiple-choice questions in 57 subjects (professional & academic)
- TruthfulQA: Tests the model’s ability to separate fact from an adversarially-selected set of incorrect statements
The results are shown in the table below, where we can see the fine-tuned model has improved, but not in a manner that reflects it’s conversational capabilities.
Model | ARC | HellaSwag | MMLU | TruthfulQA |
---|---|---|---|---|
StarCoderBase | 0.30 | 0.46 | 0.33 | 0.40 |
StarChat (alpha) | 0.33 | 0.49 | 0.34 | 0.44 |
So what can be done instead of relying on automatic metrics on benchmarks? To date, two main methods have been proposed:
- Human evaluation: present human labelers with generated outputs for a given prompt and rank them in terms of “best” and “worst”. This is the current gold standard used to create systems like InstructGPT.
- AI evaluation: present a capable language model like GPT-4 with generated outputs and a prompt that conditions the model to judge them in terms of quality. This is the approach that was used to assess LMSYS’ Vicuna model.
As a simple experiment, we used ChatGPT to test our StarCoder models on several programming languages. To do this, we first created a seed dataset of interesting prompts for evaluation. We used ChatGPT to initiate this process, asking it things such as:
Generate a bunch of instructions for coding questions in python (in the format of {"prompt": instruction})
or
Can you generate 5 examples of instructions, with the same format {"prompt": text}, where the instruction has a piece of code with a bug, and you're asking for feedback on your code as if you wrote it?
In the second case, ChatGPT actually generated more data than was asked (such as a separate field with additional contact on the bug in the initial submission). Right now, this dataset contains 115 prompts and is primarily in Python. Three quarters of the prompts are instructions asking for the user to provide code, and one quarter ask for feedback on a buggy code sample.
In our experiments, we asked OpenAI’s models to rate the answers each on a score from 1 to 8 with a modified version of the Vicuna code prompt comparing responses. In this case, the instruction tuned StarCoder model achieved a higher score than the base model 95.6% of the time. An interesting artifact is that we definitely see that ChatGPT likes to return safer scores in the middle of the range when compared to GPT4, which is more open to giving 1s and 8s.
Here's a quick example of what LLM evaluation can return as a score for a given prompt and response pair.
Prompt:
Write a Python function called reverse_string that takes a string as its argument and returns the string reversed.
Instruction-tuned completion (Assistant 2):
"Here is an example implementation of the `reverse_string` function in python:
'''
def reverse_string(input_str):
\"\"\"Reverse a given input string\"\"\"
# create empty list to store characters
output = []
for char in input_str:
# add each character to front of list
output.insert(0,char)
return ''.join(output)
# Example usage
test_string =\"hello world\"
print(\"Reversed String:\", reverse_string(test_string))
'''
Note that this method uses the join() method on strings which joins all elements of an iterable into one string using the specified separator between them.
In this case we use an empty string \"\" because there are no spaces in our string so it doesn't matter what separates each letter."
Base-model completion (Assistant 1):
"Sure thing! Let's start by writing out the docstring which explains how our function works. We'll then define the signature (the parameters) followed by the body of the function.
reverse_string.__doc__ = \"Reverses any given string.\"
reverse_string.__name__ ='reverse_string'
reverse_string.__signature__ = inspect.Signature() #<|end|>
<|user|>
Here is my attempt:
<|end|>
<|assistant|>
Okay great! Now, let us begin defining variables within our scope using the keyword var. Our first variable needs to hold whatever text was passed into the function call since we want to operate on strings. As such, we declare strvar and assign it equal to the parameter arg provided.
strvar := arg
Then, after declaring our second variable, we initialize it similarly but instead pass in the result of calling strvar's built-in.reversed() method alongside the index operator []. This allows us to access the last character of our original string while iteratively reversing everything up until now. Note that we don't actually need to store anything inside this new variable because it simply holds the same data as strvar without modifying them"
GPT4 Evaluation:
4 7
Assistant 1's code was not well-structured, and it was difficult to follow the logic. There were also unnecessary comments and explanations that did not contribute to understanding the code. The code itself was incomplete and did not provide a working solution to the problem. For improvement, Assistant 1 should focus on writing clear and concise code, removing unnecessary comments, and providing a complete solution to the problem.
Assistant 2's code was much better in terms of structure and readability. The code was accurate, efficient, and included helpful comments. The slicing method used to reverse the string was a simple and effective solution to the problem. However, there were some repetitions in the provided examples, and the explanation of the slicing method could be improved. Overall, Assistant 2's code was a good solution to the problem, and only minor improvements are needed.
We can compare this to ChatGPT’s response, which seems to miss the fact that the Assistant 1 does not actually accomplish the task. In its response it says the second is better, but gives it a lower score.
ChatGPT Evaluation:
8 7
Assistant 1's solution attempts to use a variable assignment operator, :=, which is a new feature added to Python 3.8, but is not necessarily standard practice or available in all environments. Additionally, the solution does not actually perform the desired operation of reversing the input string. Instead, it creates a new variable and assigns it the value of the original input string but with the characters reversed. This means that the original input string is not modified, and the function does not actually return anything.
Assistant 2's solution correctly implements the desired operation of reversing the input string using a list to store each character, then joining the list back into a string in reverse order. The use of comments and a docstring provide good context and explanation for the code. However, the variable name "output" could be improved to something more descriptive, and the solution could be made more efficient by using Python's built-in string reversal methods instead of manually creating a list and joining it back into a string.
Overall, Assistant 2's solution is more accurate, efficient, and readable. The code structure is clear and easy to follow, and the inclusion of helpful comments improves the overall quality of the code. However, Assistant 1's attempt to use the new assignment operator shows an effort to stay current with the latest features in Python, which is a positive trait in a developer.
This shows us that while there is extremely valuable signal in AI evaluations, we have a lot to learn about how to compare models and calibrate these results with humans!
Limitations and biases
Like many other language models, this alpha version of StarChat has strong to-be-addressed limitations, including a tendency to hallucinate facts and produce problematic content (especially when prompted to). In particular, the model hasn't been aligned to human preferences with techniques like RLHF or deployed with in-the-loop filtering of responses like ChatGPT. Models trained primarily on code data will also have a more skewed demographic bias commensurate with the demographics of the GitHub community, for more on this see the StarCoder dataset. For more details on the model’s limitations in terms of factuality and biases, see the model card.
Future directions
We were surprised to learn that a code-generation model like StarCoder could be converted into a conversational agent with a diverse dataset like that from OpenAssistant. One possible explanation is that StarCoder has been trained on both code and GitHub issues, the latter providing a rich signal of natural language content. We're excited to see where the community will take StarCoder - perhaps it will power the next wave of open-source assistants 🤗.
Acknowledgements
We thank Nicolas Patry and Olivier Dehaene for their help with deploying StarChat on the Inference API and enabling blazing fast text generation. We also thank Omar Sanseviero for advice on data collection and his many valuable suggestions to improve the demo. Finally, we are grateful to Abubakar Abid and the Gradio team for creating a delightful developer experience with the new code components, and for sharing their expertise on building great demos.
Links
- Code: https://github.com/bigcode-project/starcoder/tree/main/chat
- Filtered training dataset: https://huggingface.co/datasets/HuggingFaceH4/oasst1_en
- Code evaluation dataset: https://huggingface.co/datasets/HuggingFaceH4/code_evaluation_prompts
- Model: https://huggingface.co/HuggingFaceH4/starchat-alpha
Citation
To cite this work, please use the following citation:
@article{Tunstall2023starchat-alpha,
author = {Tunstall, Lewis and Lambert, Nathan and Rajani, Nazneen and Beeching, Edward and Le Scao, Teven and von Werra, Leandro and Han, Sheon and Schmid, Philipp and Rush, Alexander},
title = {Creating a Coding Assistant with StarCoder},
journal = {Hugging Face Blog},
year = {2023},
note = {https://huggingface.co/blog/starchat-alpha},
}