Word timestamps and "return_language" at the same time
#31
by
Oscaarjs
- opened
When setting up a whisper pipeline like this;
pipe = pipeline(
"automatic-speech-recognition",
model=self.model,
tokenizer=self.processor.tokenizer,
feature_extractor=self.processor.feature_extractor,
torch_dtype=self.torch_dtype,
device=self.device,
)
and then calling it like this:
pipe(
file_path,
chunk_length_s=30,
batch_size=1
return_timestamps="word",
return_language=True,
)
yields the output:
{'text': "...", 'chunks': [{'text': ' My', 'timestamp': (0.0, 0.42)}, {'text': ' name', 'timestamp': (0.42, 0.56)}, {'text': ' is', 'timestamp': (0.56, 0.76)}, {'text': ' Sojdri,', 'timestamp': (0.76, 1.38)}, {'text': " I'm", 'timestamp': (1.38, 1.68)}, {'text': ' 12', 'timestamp': (1.68, 1.9)}, {'text': ' years', 'timestamp': (1.9, 2.14)}, {'text': ' old,', 'timestamp': (2.14, 2.68)}, {'text': ' I', 'timestamp': (2.68, 2.82)}, {'text': ' love', 'timestamp': (2.82, 3.04)}, {'text': ' my', 'timestamp': (3.04, 3.3)}, {'text': ' mom,', 'timestamp': (3.3, 3.8)}, {'text': ' my', 'timestamp': (3.8, 3.9)}, {'text': ' dad,', 'timestamp': (3.9, 4.3)}, {'text': ' my', 'timestamp': (4.3, 4.32)}, {'text': ' older', 'timestamp': (4.32, 4.6)}, {'text': ' brother,', 'timestamp': (4.6, 5.12)}, {'text': ' Ryan,', 'timestamp': (5.12, 5.44)}, {'text': " who's", 'timestamp': (5.44, 5.54)}, {'text': ' 16', 'timestamp': (5.54, 5.82)}, {'text': ' years', 'timestamp': (5.82, 6.08)}, {'text': ' old.', 'timestamp': (6.08, 6.62)}, {'text': ' My', 'timestamp': (6.68, 6.86)}, {'text': ' favorite', 'timestamp': (6.86, 7.16)}, {'text': ' subject', 'timestamp': (7.16, 7.66)}, {'text': ' is', 'timestamp': (7.66, 8.16)}, {'text': ' history,', 'timestamp': (8.16, 9.28)}, {'text': ' and', 'timestamp': (9.28, 9.44)}, {'text': ' my', 'timestamp': (9.44, 9.62)}, {'text': ' favorite', 'timestamp': (9.62, 9.82)}, {'text': ' sport', 'timestamp': (9.82, 10.3)}, {'text': ' is', 'timestamp': (10.3, 10.6)}, {'text': ' hockey.', 'timestamp': (10.6, 12.38)}]}
where language is missing.
But if I do:
pipe(
file_path,
chunk_length_s=30,
batch_size=1,
return_timestamps=True,
return_language=True,
)
I get:
{'text': "...", 'chunks': [{'language': 'english', 'timestamp': (0.0, 6.48), 'text': " My name is Sojdri, I'm 12 years old, I love my mom, my dad, my older brother, Ryan, who's 16 years old."}, {'language': 'english', 'timestamp': (6.48, 11.52), 'text': ' My favorite subject is history, and my favorite sport is hockey.'}]}
In which language is preserved
Is this a bug, or is it something that prevents both of them from working at the same time?
This is a bug, but don't know when it gets fixed. Anyway, I am impressed that you are just a kid and delving into this. Big hugs [:hugginface]
well the bug is still there
same problem, anyone to solve it?
I ran into this bug as well and opened an issue with ideas how to fix it at https://github.com/huggingface/transformers/issues/29520