Update README.md
Browse files
README.md
CHANGED
@@ -1,16 +1,16 @@
|
|
1 |
---
|
2 |
-
language: "en"
|
3 |
datasets:
|
4 |
- Spotify Podcasts Dataset
|
5 |
tags:
|
6 |
- t5
|
7 |
-
-
|
8 |
- pytorch
|
9 |
- lm-head
|
10 |
metrics:
|
11 |
- ROUGE
|
12 |
pipeline:
|
13 |
-
-
|
14 |
---
|
15 |
|
16 |
# T5 for Automatic Podcast Summarisation
|
@@ -26,24 +26,23 @@ Authors: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
|
|
26 |
|
27 |
## Intended uses & limitations
|
28 |
This model is intended to be used for automatic podcast summarisation. As creator provided descriptions
|
29 |
-
were used for training, the model also learned to generate promotional material in its summaries, as such
|
30 |
some post processing may be required on the model's outputs.
|
31 |
|
|
|
|
|
|
|
32 |
#### How to use
|
33 |
-
A 'summarize:' tag must be pre-pended to the source text before it is passed to the T5 model.
|
34 |
|
35 |
-
|
36 |
-
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
37 |
|
38 |
-
|
39 |
-
|
40 |
|
41 |
-
|
42 |
-
|
43 |
-
summary_ids = model.generate(tokens, max_length=150, num_beams=2, repetition_penalty=2.5, length_penalty=1.0, early_stopping=True)
|
44 |
-
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
|
45 |
|
46 |
-
print(
|
47 |
```
|
48 |
|
49 |
## Training data
|
|
|
1 |
---
|
2 |
+
language: "[en]"
|
3 |
datasets:
|
4 |
- Spotify Podcasts Dataset
|
5 |
tags:
|
6 |
- t5
|
7 |
+
- summarisation
|
8 |
- pytorch
|
9 |
- lm-head
|
10 |
metrics:
|
11 |
- ROUGE
|
12 |
pipeline:
|
13 |
+
-summarisation
|
14 |
---
|
15 |
|
16 |
# T5 for Automatic Podcast Summarisation
|
|
|
26 |
|
27 |
## Intended uses & limitations
|
28 |
This model is intended to be used for automatic podcast summarisation. As creator provided descriptions
|
29 |
+
were used for training, the model also learned to generate promotional material (links, hashtags, etc) in its summaries, as such
|
30 |
some post processing may be required on the model's outputs.
|
31 |
|
32 |
+
If using on Colab, the instance will crash if the number of tokens in the transcript exceeds 7000. I discovered that the model
|
33 |
+
generated reasonable summaries even when the podcast transcript was truncated to reduce the number of tokens.
|
34 |
+
|
35 |
#### How to use
|
|
|
36 |
|
37 |
+
The model can be used with the summarisation as follows:
|
|
|
38 |
|
39 |
+
```python
|
40 |
+
from transformers import pipeline
|
41 |
|
42 |
+
summarizer = pipeline("summarization", model="paulowoicho/t5-podcast-summarisation", tokenizer="paulowoicho/t5-podcast-summarisation")
|
43 |
+
summary = summarizer(podcast_transcript, min_length=5, max_length=20)
|
|
|
|
|
44 |
|
45 |
+
print(summary[0]['summary_text'])
|
46 |
```
|
47 |
|
48 |
## Training data
|