File size: 4,909 Bytes
7da33c0
 
b92f4df
 
 
 
 
26ffbde
 
4ebeafe
 
 
e679a5b
 
4ebeafe
5d4ca24
26ffbde
 
c4a6e66
 
 
 
 
 
 
 
 
a8a18a4
fcb2a2c
8761bd7
fcb2a2c
 
a8a18a4
170ec43
a8a18a4
170ec43
a8a18a4
 
 
 
 
 
 
5e8c6dc
cb707eb
f86b18a
e91bb19
a8a18a4
 
 
c8c5151
 
a8a18a4
 
 
 
d601354
 
 
 
 
 
 
 
 
 
 
a8a18a4
d601354
a8a18a4
 
 
77c6cec
a8a18a4
 
d601354
 
 
 
a8a18a4
 
 
 
 
 
 
e91bb19
a8a18a4
 
 
 
 
 
 
 
0e3538f
 
 
 
 
a8a18a4
 
 
 
baa9fa2
a8a18a4
 
 
 
0e3538f
 
 
 
 
a8a18a4
 
 
e91bb19
a8a18a4
0e3538f
baa9fa2
 
a8a18a4
baa9fa2
a8a18a4
 
8d00553
e91bb19
 
 
 
8d00553
5f96544
82820e0
8d00553
 
c4a6e66
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: openrail
datasets:
- humarin/chatgpt-paraphrases
language:
- en
library_name: transformers
inference:
  parameters:
    num_beams: 5
    num_beam_groups: 5
    num_return_sequences: 5
    repetition_penalty: 10.01
    diversity_penalty: 3.01
    no_repeat_ngram_size: 2
    temperature: 0.7
    max_length: 128
widget:
- text: What are the best places to see in New York?
  example_title: New York tourist attractions
- text: When should I go to the doctor?
  example_title: Doctor's time
- text: >-
    Rammstein's album Mutter was recorded in the south of France in May and June
    2000, and mixed in Stockholm in October of that year.
  example_title: Rammstein's album Mutter
pipeline_tag: text2text-generation
---
<a href="https://www.patreon.com/HumarinAI">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/60e85cdf977b59e2fda5c46c/4tcUfRObyMrxirxTfvdXg.png" alt="Descriptive Text of Image" style="width: 60%; height: auto;">
</a>


Join the Humarin community on [Patreon](https://patreon.com/HumarinAI?utm_medium=unknown&utm_source=join_link&utm_campaign=creatorshare_creator&utm_content=copyLink) and help us conduct research to release more exciting models!

This model was trained on our [ChatGPT paraphrase dataset](https://huggingface.co/datasets/humarin/chatgpt-paraphrases).

This dataset is based on the [Quora paraphrase question](https://www.kaggle.com/competitions/quora-question-pairs), texts from the [SQUAD 2.0](https://huggingface.co/datasets/squad_v2) and the [CNN news dataset](https://huggingface.co/datasets/cnn_dailymail).

This model is based on the T5-base model. We used "transfer learning" to get our model to generate paraphrases as well as ChatGPT. Now we can say that this is one of the best paraphrases of the Hugging Face.

[Kaggle](https://www.kaggle.com/datasets/vladimirvorobevv/chatgpt-paraphrases) link

[Author's 1 LinkedIn](https://www.linkedin.com/in/vladimir-vorobev/) link
[Author's 2 LinkedIn](https://www.linkedin.com/in/makual/) link

## Deploying example
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")

model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)

def paraphrase(
    question,
    num_beams=5,
    num_beam_groups=5,
    num_return_sequences=5,
    repetition_penalty=10.0,
    diversity_penalty=3.0,
    no_repeat_ngram_size=2,
    temperature=0.7,
    max_length=128
):
    input_ids = tokenizer(
        f'paraphrase: {question}',
        return_tensors="pt", padding="longest",
        max_length=max_length,
        truncation=True,
    ).input_ids.to(device)
    
    outputs = model.generate(
        input_ids, temperature=temperature, repetition_penalty=repetition_penalty,
        num_return_sequences=num_return_sequences, no_repeat_ngram_size=no_repeat_ngram_size,
        num_beams=num_beams, num_beam_groups=num_beam_groups,
        max_length=max_length, diversity_penalty=diversity_penalty
    )

    res = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    return res
```

## Usage examples

**Input:**
```python
text = 'What are the best places to see in New York?'
paraphrase(text)
```
**Output:**
```python
['What are some must-see places in New York?',
 'Can you suggest some must-see spots in New York?',
 'Where should one go to experience the best NYC has to offer?',
 'Which places should I visit in New York?',
 'What are the top destinations to explore in New York?']
```

**Input:**
```python
text = "Rammstein's album Mutter was recorded in the south of France in May and June 2000, and mixed in Stockholm in October of that year."
paraphrase(text)
```
**Output:**
```python
['In May and June 2000, Rammstein travelled to the south of France to record his album Mutter, which was mixed in Stockholm in October of that year.',
 'The album Mutter by Rammstein was recorded in the south of France during May and June 2000, with mixing taking place in Stockholm in October of that year.',
 'The album Mutter by Rammstein was recorded in the south of France during May and June 2000, with mixing taking place in Stockholm in October of that year. It',
 'Mutter, the album released by Rammstein, was recorded in southern France during May and June 2000, with mixing taking place between October and September.',
 'In May and June 2000, Rammstein recorded his album Mutter in the south of France, with the mix being made at Stockholm during October.']
```


## Train parameters
```python
epochs = 5
batch_size = 64
max_length = 128
lr = 5e-5
batches_qty = 196465
betas = (0.9, 0.999)
eps = 1e-08
```

### BibTeX entry and citation info

```bibtex
@inproceedings{chatgpt_paraphraser,
  author={Vladimir Vorobev, Maxim Kuznetsov},
  title={A paraphrasing model based on ChatGPT paraphrases},
  year={2023}
}
```