paulowoicho
/

t5-podcast-summarisation

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

system HF staff commited on Aug 17, 2020

Commit

e8449bc

•

1 Parent(s): b44be3a

Update README.md

Files changed (1) hide show

README.md +13 -14

README.md CHANGED Viewed

@@ -1,16 +1,16 @@
 ---
-language: "en"
 datasets:
 - Spotify Podcasts Dataset
 tags:
 - t5
-- summarization
 - pytorch
 - lm-head
 metrics:
 - ROUGE
 pipeline:
-- summarization
 ---
 # T5 for Automatic Podcast Summarisation
@@ -26,24 +26,23 @@ Authors: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
 ## Intended uses & limitations
 This model is intended to be used for automatic podcast summarisation. As creator provided descriptions
-were used for training, the model also learned to generate promotional material in its summaries, as such
 some post processing may be required on the model's outputs.
 #### How to use
-A 'summarize:' tag must be pre-pended to the source text before it is passed to the T5 model.
-```python
-from transformers import T5Tokenizer, T5ForConditionalGeneration
-tokenizer = T5Tokenizer.from_pretrained('paulowoicho/t5-podcast-summarisation')
-model = T5ForConditionalGeneration.from_pretrained('paulowoicho/t5-podcast-summarisation')
-podcast_transcript = 'summarize: ' + podcast_transcript
-tokens = tokenizer.encode(podcast_transcript, return_tensors="pt")
-summary_ids = model.generate(tokens, max_length=150, num_beams=2, repetition_penalty=2.5, length_penalty=1.0, early_stopping=True)
-output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
-print(output)
 ```
 ## Training data

 ---
+language: "[en]"
 datasets:
 - Spotify Podcasts Dataset
 tags:
 - t5
+- summarisation
 - pytorch
 - lm-head
 metrics:
 - ROUGE
 pipeline:
+-summarisation
 ---
 # T5 for Automatic Podcast Summarisation
 ## Intended uses & limitations
 This model is intended to be used for automatic podcast summarisation. As creator provided descriptions
+were used for training, the model also learned to generate promotional material (links, hashtags, etc) in its summaries, as such
 some post processing may be required on the model's outputs.
+If using on Colab, the instance will crash if the number of tokens in the transcript exceeds 7000. I discovered that the model
+generated reasonable summaries even when the podcast transcript was truncated to reduce the number of tokens.
 #### How to use
+The model can be used with the summarisation as follows:
+```python
+from transformers import pipeline
+summarizer = pipeline("summarization", model="paulowoicho/t5-podcast-summarisation", tokenizer="paulowoicho/t5-podcast-summarisation")
+summary = summarizer(podcast_transcript, min_length=5, max_length=20)
+print(summary[0]['summary_text'])
 ```
 ## Training data