adding pauses and dealing with numbers
#14
by
dekislev
- opened
just wanted to share what worked for me
i noticed the model has a bit of an issue dealing with numbers and punctuations. but it deals quite well with a ','
so processed my text with:
text = text.replace(".", ",").replace("!", ",").replace("?", ",").replace(":", ",").replace(";", ",")
text= text.replace("(",',').replace(")",',').replace("[",',').replace("]",',').replace("{",',').replace("}",',')
text= text.replace('"',',').replace("β",',').replace("β",',')
text= text.replace("-",' ').replace("_",' ').replace("β",' ').replace("β",' ').replace("β¦",' ')
in addition i saw it has a bit of a problem pronouncing numbers like years.. so even before the replacing i processed it with
from num2words import num2words
import re
def convert_numbers_to_text(text):
# Regular expression pattern to match numbers
pattern = r'\b\d+\b'
def replace(match):
number = int(match.group())
return num2words(number)
# Replace numbers in the text with their textual representation
converted_text = re.sub(pattern, replace, text)
return converted_text
text= convert_numbers_to_text(text)
hope it helps you too
That did the job, thanks a lot!
I actually ended up splitting the paragraphs by the "dot"s in there and feed them to the model separately which showed a better result