1-800-BAD-CODE
/

sentence_boundary_detection_multilang

ONNX

NeMo

sentence boundary detection

token classification

nlp

Model card Files Files and versions Community

1-800-BAD-CODE commited on Jan 1, 2023

Commit

138a7d5

•

1 Parent(s): 7eb80ca

Update README.md

Browse files

Files changed (1) hide show

README.md +9 -5

README.md CHANGED Viewed

@@ -39,8 +39,11 @@ This model segments a long, punctuated text into one or more constituent sentenc
 A key feature is that the model is multi-lingual and language-agnostic at inference time.
 Therefore, language tags do not need to be used and a single batch can contain multiple languages.
 ## Architecture
-This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier.
 Given that this is a relatively-easy NLP task, the model contains \~5M parameters (\~4M of which are embeddings).
 This makes the model very fast and cheap at inference time, as SBD should be.
@@ -56,8 +59,8 @@ The BERT encoder is based on the following configuration:
 ## Training
 This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
-Model was trained on an A100 for \~150k steps with a batch size of 256, with a $3 budget on the [Lambda cloud](https://cloud.lambdalabs.com/).
-Model was allowed to converge with 25M training sentences (1M per language).
 ### Training Data
 This model was trained on `OpenSubtitles`.
@@ -67,7 +70,7 @@ Although this corpus is very noisy, it is one of few large-scale text corpora wh
 We must avoid using an automatically-segmented corpus this for at least two reasons:
 1. Our deep-learning model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
-2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (i.e., me).
 Heuristics were used to attempt to clean the data before training.
 Some examples of the cleaning are:
@@ -85,6 +88,7 @@ To create examples for the model, we
 2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
 For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
 This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
 If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
@@ -337,7 +341,7 @@ If you're interested in any particular aspect being improved, let me know for th
 ## Limited vocabulary
 This model has 25 languages and a tokenizer with only 32k tokens.
-Chinese has a lot of out-of-vocabulary tokens, which will manifeset as the unknown surface appearing in the outputs of some Chinese texts.
 This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.

 A key feature is that the model is multi-lingual and language-agnostic at inference time.
 Therefore, language tags do not need to be used and a single batch can contain multiple languages.
+As emphasized later in this card, this is a prototype model and there will be future versions which are cheap to train.
+Feel free to provide input, suggestions, or requests in a discussion.
 ## Architecture
+This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.
 Given that this is a relatively-easy NLP task, the model contains \~5M parameters (\~4M of which are embeddings).
 This makes the model very fast and cheap at inference time, as SBD should be.
 ## Training
 This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
+Training was performed on an A100 for \~150k steps with a batch size of 256, with a $3 budget on the [Lambda cloud](https://cloud.lambdalabs.com/).
+Model was roughly converged with 25M training sentences (1M per language).
 ### Training Data
 This model was trained on `OpenSubtitles`.
 We must avoid using an automatically-segmented corpus this for at least two reasons:
 1. Our deep-learning model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
+2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (e.g., me).
 Heuristics were used to attempt to clean the data before training.
 Some examples of the cleaning are:
 2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
 For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
+The number of sentences to use was chosen random and uniformly, so each example had, on average, 4 sentence boundaries.
 This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
 If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
 ## Limited vocabulary
 This model has 25 languages and a tokenizer with only 32k tokens.
+Chinese has a lot of out-of-vocabulary tokens, which will manifest as the unknown surface appearing in the outputs of some Chinese texts.
 This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.