diwank commited on
Commit
9334595
1 Parent(s): ad5c574

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -0
README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ datasets:
4
+ - julep-ai/samantha_finetune_dataset_03
5
+ language:
6
+ - en
7
+ ---
8
+
9
+ ## Samantha
10
+
11
+ ### Technical notes
12
+
13
+ This model is trained on a specialized dataset and uses special sentinel tokens to demarcate conversations.
14
+
15
+ **Important Note: These sentinels are similar to gpt2-style special tokens but they are <u>NOT</u> added as special tokens in the tokenizer.**
16
+
17
+ **Concepts**:
18
+ - Each conversation consists of n "sections"
19
+ - Each section can be one of:
20
+ + `me`: The model
21
+ + `person`: The speaker
22
+ + `situation`: relevant background information to set the context of the conversation
23
+ + `thought`: Thoughts generated by the model for parsing intermediate steps etc
24
+ + `information`: External information added into the context by the system running the model
25
+ - The model and speaker sections can optionally include a name like `me (Samantha)` or `person (Dmitry)`
26
+
27
+ **Tokens**:
28
+ - `<|section|>` token marks the start of a "section"
29
+ - `<|endsection|>` token marks the end of a "section". This is also set to be the default `EOS` token in the tokenizer
30
+ - these are both "special" tokens and are not split up by the tokenizer
31
+
32
+ ### Example
33
+
34
+ ```
35
+ <|section|>situation
36
+ I am talking to Diwank. I want to ask him about his food preferences.<|endsection|>
37
+ <|section|>person (Diwank)
38
+ Hey Samantha! What do you want to talk about?<|endsection|>
39
+ <|section|>me (Samantha)
40
+ ```