1-800-BAD-CODE
/

sentence_boundary_detection_multilang

ONNX

NeMo

sentence boundary detection

token classification

nlp

Model card Files Files and versions Community

1-800-BAD-CODE commited on Jan 1, 2023

Commit

acb3269

•

1 Parent(s): 3e20b04

Update README.md

Browse files

Files changed (1) hide show

README.md +50 -5

README.md CHANGED Viewed

@@ -1,5 +1,6 @@
 ---
 license: apache-2.0
 language:
   - ar
   - bn
@@ -31,19 +32,63 @@ language:
 # Model Overview
 This model performs sentence boundary prediction (SBD) with 25 languages.
-This model accepts as input arbitraily-long, punctuated texts and produces as output the consituent sentences of the input.
 # Model Architecture
-This is a data-driven approach to SBD.
-Input texts are encoded with a SentencePiece model, then encoded with a BERT-style encoder, then projected to sentence boundary probabilities via a linear layer.
-For each input token `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
 # Example Usage
-This model has been exported to ONNX alongside an SentencePiece tokenizer.
 ```bash
 ```

 ---
 license: apache-2.0
+library_name: onnx
 language:
   - ar
   - bn
 # Model Overview
 This model performs sentence boundary prediction (SBD) with 25 languages.
+This model segments a long, punctuated text into one or more constituent sentences.
 # Model Architecture
+This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT encoder, and a linear decoder/classifier.
+Given that this is an easy NLP task, the model contains ~5M parameters.
+The BERT encoder is based on the following configuration:
+* 8 heads
+* 4 layers
+* 128 hidden dim
+* 512 intermediate/ff dim
+* 32000 embeddings/vocab tokens
+# Model Inputs and Outputs
+The model inputs should be **punctuated** texts.
+The classification and SPE models have both been trained with 50% of the training data lower-cased.
+The model should perform similarly with either lower- or true-cased data.
+All-capitalized text will probably not work well.
+The inputs should be packed into a batch with shape `[B, T]` , with padding being the SPE model's `<pad>` token ID.
+For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
+## Language input specifics
+The training data was pre-processed for language-specific punctuation and spacing rules.
+* All spaces were removed from continuous-script languages (Chinese, Japanese). Inputs in these languages should not contain spaces.
+* Chinese punctuation: Chinese and Japanese use full-width periods, question marks, and commas. Chinese input with Latin punctuation may not work well.
+* Hindi/Bengali punctuation: These languages use the danda `।` as a full-stop, not a `.`.
+* Arabic punctuation: Uses reverse question marks `؟`, not a `?`.
 # Example Usage
+This model has been exported to `ONNX` (opset 17) alongside the associated `SentencePiece` tokenizer.
+The predictions are applied to the input by separating the token sequence where the predicted value exceeds a threshold for sentence boundary classification.
 ```bash
 ```
+# Known Issues
+This is essentially a prototype model, and has some issues. These will be improved in a later version.
+If you're interested in any particular aspect being improved, let me know for the next version.
+## Limited vocabulary
+This model has 25 languages and a tokenizer with only 32k tokens.
+Chinese has a lot of out-of-vocabulary tokens, which will manifeset as the unknown surface appearing in the outputs of some Chinese texts.
+This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
+## Noisy training data
+This model was trained on `OpenSubtitles`, data which is natoriously noisy. The model may have learned some bad habits from this data.