metadata
library_name: transformers
datasets:
- julep-ai/samantha_finetune_dataset_03
language:
- en
Samantha
Technical notes
This model is trained on a specialized dataset and uses special sentinel tokens to demarcate conversations.
Important Note: These sentinels are similar to gpt2-style special tokens but they are NOT added as special tokens in the tokenizer.
Concepts:
- Each conversation consists of n "sections"
- Each section can be one of:
me
: The modelperson
: The speakersituation
: relevant background information to set the context of the conversationthought
: Thoughts generated by the model for parsing intermediate steps etcinformation
: External information added into the context by the system running the model
- The model and speaker sections can optionally include a name like
me (Samantha)
orperson (Dmitry)
Tokens:
<|section|>
token marks the start of a "section"<|endsection|>
token marks the end of a "section". This is also set to be the defaultEOS
token in the tokenizer- these are both "special" tokens and are not split up by the tokenizer
Example
<|section|>situation
I am talking to Diwank. I want to ask him about his food preferences.<|endsection|>
<|section|>person (Diwank)
Hey Samantha! What do you want to talk about?<|endsection|>
<|section|>me (Samantha)