File size: 1,388 Bytes
9334595
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
---
library_name: transformers
datasets:
- julep-ai/samantha_finetune_dataset_03
language:
- en
---

## Samantha

### Technical notes

This model is trained on a specialized dataset and uses special sentinel tokens to demarcate conversations.

**Important Note: These sentinels are similar to gpt2-style special tokens but they are <u>NOT</u> added as special tokens in the tokenizer.**

**Concepts**:
- Each conversation consists of n "sections"
- Each section can be one of:
  + `me`: The model
  + `person`: The speaker
  + `situation`: relevant background information to set the context of the conversation
  + `thought`: Thoughts generated by the model for parsing intermediate steps etc
  + `information`: External information added into the context by the system running the model
- The model and speaker sections can optionally include a name like `me (Samantha)` or `person (Dmitry)`

**Tokens**:
- `<|section|>` token marks the start of a "section"
- `<|endsection|>` token marks the end of a "section". This is also set to be the default `EOS` token in the tokenizer
- these are both "special" tokens and are not split up by the tokenizer

### Example

```
<|section|>situation
I am talking to Diwank. I want to ask him about his food preferences.<|endsection|>
<|section|>person (Diwank)
Hey Samantha! What do you want to talk about?<|endsection|>
<|section|>me (Samantha)
```