|
--- |
|
license: apache-2.0 |
|
library_name: onnx |
|
tags: |
|
- punctuation |
|
- sentence boundary detection |
|
- truecasing |
|
language: |
|
- af |
|
- am |
|
- ar |
|
- bg |
|
- bn |
|
- de |
|
- el |
|
- en |
|
- es |
|
- et |
|
- fa |
|
- fi |
|
- fr |
|
- gu |
|
- hi |
|
- hr |
|
- hu |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- kk |
|
- kn |
|
- ko |
|
- ky |
|
- lt |
|
- lv |
|
- mk |
|
- ml |
|
- mr |
|
- nl |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- rw |
|
- so |
|
- sr |
|
- sw |
|
- ta |
|
- te |
|
- tr |
|
- uk |
|
- zh |
|
--- |
|
# Model Overview |
|
This model accepts as input lower-cased, unpunctuated, unsegmented text in 47 languages and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation). |
|
|
|
All languages are processed with the same algorithm with no need for language tags or language-specific branches in the graph. |
|
This includes continuous-script and non-continuous script languages, predicting language-specific punctuation, etc. |
|
|
|
# Model Details |
|
|
|
This model generally follows the graph shown below, with brief descriptions for each step following. |
|
|
|
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1677025540482-62d34c813eebd640a4f97587.png) |
|
|
|
|
|
1. **Encoding**: |
|
The model begins by tokenizing the text with a subword tokenizer. |
|
The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k. |
|
Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512. |
|
|
|
2. **Post-punctuation**: |
|
The encoded sequence is then fed into a classification network to predict "post" punctuation tokens. |
|
Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation. |
|
Post punctation is predicted once per subword - further discussion is below. |
|
|
|
3. **Re-encoding** |
|
All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation. |
|
Therefore, we must conditional all further predictions on the post punctuation tokens. |
|
For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token. |
|
Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding. |
|
The concatenated joint representation is re-encoded to confer global context to each time step to incorporate puncuation predictions into subsequent tasks. |
|
|
|
4. **Pre-punctuation** |
|
After the re-encoding, another classification network predicts "pre" punctuation, or punctation tokens that may appear before a word. |
|
In practice, this means the inverted question mark for Spanish and Asturian, `¿`. |
|
Note that a `¿` can only appear if a `?` is predicted, hence the conditioning. |
|
|
|
5. **Sentence boundary detection** |
|
Parallel to the "pre" punctuation, another classification network predicts sentence boundaries from the re-encoded text. |
|
In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning. |
|
|
|
6. **Shift and concat sentence boundaries** |
|
In many languages, the first character of each sentence should be upper-cased. |
|
Thus, we should feed the sentence boundary information to the true-case classification network. |
|
Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence. |
|
Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence. |
|
Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head. |
|
|
|
7. **True-case prediction** |
|
Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing. |
|
Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken. |
|
(In practice, `N` is the longest possible subword, and the extra predictions are ignored). |
|
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald". |
|
|
|
|
|
## Post-Punctuation Tokens |
|
This model predicts the following set of "post" punctuation tokens: |
|
|
|
| Token | Description | Relavant Languages | |
|
| ---: | :---------- | :----------- | |
|
| . | Latin full stop | Many | |
|
| , | Latin comma | Many | |
|
| ? | Latin question mark | Many | |
|
| ? | Full-width question mark | Chinese, Japanese | |
|
| , | Full-width comma | Chinese, Japanese | |
|
| 。 | Full-width full stop | Chinese, Japanese | |
|
| 、 | Ideographic comma | Chinese, Japanese | |
|
| ・ | Middle dot | Japanese | |
|
| । | Danda | Hindi, Bengali, Oriya | |
|
| ؟ | Arabic question mark | Arabic | |
|
| ; | Greek question mark | Greek | |
|
| ። | Ethiopic full stop | Amharic | |
|
| ፣ | Ethiopic comma | Amharic | |
|
| ፧ | Ethiopic question mark | Amharic | |
|
|
|
|
|
## Pre-Punctuation Tokens |
|
This model predicts the following set of "post" punctuation tokens: |
|
|
|
| Token | Description | Relavant Languages | |
|
| ---: | :---------- | :----------- | |
|
| ¿ | Inverted question mark | Spanish | |
|
|
|
|
|
# Usage |
|
This model is released in two parts: |
|
|
|
1. The ONNX graph |
|
2. The SentencePiece tokenizer |
|
|
|
|
|
|
|
# Training Details |
|
This model was trained in the NeMo framework. |
|
|
|
## Training Data |
|
This model was trained with News Crawl data from WMT. |
|
|
|
1M lines of text for each language was used, except for a few low-resource languages which may have used less. |
|
|
|
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author. |
|
|
|
# Limitations |
|
This model was trained on news data, and may not perform well on conversational or informal data. |
|
|
|
This is also a base-sized model with many languages and many tasks, so capacity may be limited. |
|
|
|
This model predicts punctuation only once per subword. |
|
This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation. |
|
This concession was accepted on two grounds: |
|
1. Such acronyms are rare, especially in the context of multi-lingual models |
|
2. Punctuated acronyms are typically pronounced as individual characters, e.g., 'U.S.' vs. 'NATO'. |
|
Since the expected use-case of this model is the output of an ASR system, it is presumed that such |
|
pronunciations would be transcribed as separate tokens, e.g, 'u s' vs. 'us' (though this depends on the model's pre-processing). |
|
|
|
# Evaluation |
|
|