1-800-BAD-CODE
/

sentence_boundary_detection_multilang

ONNX

NeMo

sentence boundary detection

token classification

nlp

Model card Files Files and versions Community

1-800-BAD-CODE commited on Jan 1, 2023

Commit

f25200c

•

1 Parent(s): 0647eaf

Update README.md

Browse files

Files changed (1) hide show

README.md +83 -23

README.md CHANGED Viewed

@@ -29,15 +29,21 @@ language:
   - zh
 ---
 # Model Overview
-This model performs sentence boundary prediction (SBD) with 25 languages.
 This model segments a long, punctuated text into one or more constituent sentences.
-# Model Architecture
-This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT encoder, and a linear decoder/classifier.
-Given that this is an easy NLP task, the model contains ~5M parameters.
 The BERT encoder is based on the following configuration:
@@ -47,26 +53,67 @@ The BERT encoder is based on the following configuration:
 * 512 intermediate/ff dim
 * 32000 embeddings/vocab tokens
 # Model Inputs and Outputs
 The model inputs should be **punctuated** texts.
-The classification and SPE models have both been trained with 50% of the training data lower-cased.
-The model should perform similarly with either lower- or true-cased data.
-All-capitalized text will probably not work well.
 The inputs should be packed into a batch with shape `[B, T]` , with padding being the SPE model's `<pad>` token ID.
-The model was trained on a maximum sequence length of 256, but will not crash until sequences exceed 512.
 Optimal handling of longer sequences would require some inference-time logic (wrapping/overlapping inputs and re-combining outputs).
 For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
-## Language input specifics
-The training data was pre-processed for language-specific punctuation and spacing rules.
-* All spaces were removed from continuous-script languages (Chinese, Japanese). Inputs in these languages should not contain spaces.
-* Chinese punctuation: Chinese and Japanese use full-width periods, question marks, and commas. Chinese input with Latin punctuation may not work well.
-* Hindi/Bengali punctuation: These languages use the danda `।` as a full-stop, not a `.`.
-* Arabic punctuation: Uses reverse question marks `؟`, not a `?`.
 # Example Usage
@@ -74,9 +121,9 @@ This model has been exported to `ONNX` (opset 17) alongside the associated `Sent
 The predictions are applied to the input by separating the token sequence where the predicted value exceeds a threshold for sentence boundary classification.
-This model can be run directly with a couple of dependencies which most developers likely already have installed.
-The following snipper will install the dependencies, clone this repo, and run an example script which points to the local files.
 ```bash
 $ pip install sentencepiece onnxruntime
@@ -275,15 +322,13 @@ Outputs:
 	let him go.
 	let him go.
 	let me see your license and i.d. card.
 ```
 </details>
-# Known Issues
-This is essentially a prototype model, and has some issues. These will be improved in a later version.
 If you're interested in any particular aspect being improved, let me know for the next version.
@@ -295,5 +340,20 @@ Chinese has a lot of out-of-vocabulary tokens, which will manifeset as the unkno
 This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
 ## Noisy training data
-This model was trained on `OpenSubtitles`, data which is natoriously noisy. The model may have learned some bad habits from this data.

   - zh
 ---
 # Model Overview
+This model performs text sentence boundary prediction (SBD) with 25 common languages.
 This model segments a long, punctuated text into one or more constituent sentences.
+A key feature is that the model is multi-lingual and language-agnostic at inference time.
+Therefore, language tags do not need to be used and a single batch can contain multiple languages.
+## Architecture
+This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier.
+Given that this is a relatively-easy NLP task, the model contains ~5M parameters (~4M of which are embeddings).
+This makes the model very fast and cheap at inference time, as SBD should be.
 The BERT encoder is based on the following configuration:
 * 512 intermediate/ff dim
 * 32000 embeddings/vocab tokens
+## Training
+This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
+Model was trained on an A100 for ~150k steps with a batch size of 256.
+### Training Data
+This model was trained on `OpenSubtitles`.
+Although this corpus is very noisy, it is one of few large-scale text corpora which have been manually segmented.
+We must avoid using an automatically-segmented corpus this for at least two reasons:
+1. Our deep-learning model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
+2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (i.e., me).
+Heuristics were used to attempt to clean the data before training.
+Some examples of the cleaning are:
+* Drop sentences which start with a lower-case letter. Assume these lines are errorful.
+* For inputs that do not end with a full stop, append the default full stop for that language. Assume that for single-sentence declarative sentences, full stops are not important for subtitles.
+* Drop inputs that have more than 20 words (or 32 chars, for continuous-script languages). Assume these lines contain more than one sentence, and therefore we cannot create reliable targets.
+* Drop objectively junk lines: all punctuation/special characters, empty lines, etc.
+* Normalize punctuation: no more than one consecutive punctuation token (except Spanish, where inverted punctuation can appear after non-inverted punctuation).
+### Example Generation
+To create examples for the model, we
+1. Assume each input line is exactly one sentence
+2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
+For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
+This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
+If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
+50% of input texts were lower-cased for both the tokenizer and classification models.
+This provides some augmentation, but more importantly allows for this model to inserted into an NLP pipeline either before or after true-casing.
+Using this model before true-casing would allow the true-casing model to exploit the conditional probability of sentence boundaries w.r.t. capitalization.
+### Language Specific Rules
+The training data was pre-processed for language-specific punctuation and spacing rules.
+The following guidelines were used during training. If inference inputs differ, the model may perform poorly.
+* All spaces were removed from continuous-script languages (Chinese, Japanese).
+* Chinese: Chinese and Japanese use full-width periods "。", question marks "？", and commas "，".
+* Hindi/Bengali: These languages use the danda "।" as a full-stop, not ".".
+* Arabic: Uses reverse question marks "؟", not "?".
 # Model Inputs and Outputs
 The model inputs should be **punctuated** texts.
 The inputs should be packed into a batch with shape `[B, T]` , with padding being the SPE model's `<pad>` token ID.
+The `<pad>` ID is required to generate a proper attention mask.
+The model was trained on a maximum sequence length of 256 (subwords), and may crash or perform poorly if a longer batch is processed.
 Optimal handling of longer sequences would require some inference-time logic (wrapping/overlapping inputs and re-combining outputs).
 For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
 # Example Usage
 The predictions are applied to the input by separating the token sequence where the predicted value exceeds a threshold for sentence boundary classification.
+This model can be run directly with a couple of dependencies which most developers may already have installed.
+The following snippet will install the dependencies, clone this repo, and run an example script which points to the local files.
 ```bash
 $ pip install sentencepiece onnxruntime
 	let him go.
 	let him go.
 	let me see your license and i.d. card.
 ```
 </details>
+# Limitations and known issues
+This a prototype model, and has some issues. These will be improved in a later version.
 If you're interested in any particular aspect being improved, let me know for the next version.
 This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
 ## Noisy training data
+This model was trained on `OpenSubtitles`, data which is notoriously noisy. The model may have learned some bad habits from this data.
+## Language-specific expectations
+As discussed in a previous section, each language should be formatted and punctuated per that languages rules.
+E.g., Chinese text should contain full-width periods, not latin periods, and contain no space.
+In practice, data often does not adhere to these rules, but the model has not been augmented to deal with this potential issue.
+## Metrics
+It's difficult to properly evaluate this model, since we rely on the proposition that the input data contains exactly one sentence per line.
+In reality, the data sets used thus far are noisy and often contain more than one sentence per line.
+Metrics are not published for now, and evaluation is limited to manual spot-checking.
+Sufficient test sets for this analytic are being looked into.