1-800-BAD-CODE commited on
Commit
138a7d5
1 Parent(s): 7eb80ca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -5
README.md CHANGED
@@ -39,8 +39,11 @@ This model segments a long, punctuated text into one or more constituent sentenc
39
  A key feature is that the model is multi-lingual and language-agnostic at inference time.
40
  Therefore, language tags do not need to be used and a single batch can contain multiple languages.
41
 
 
 
 
42
  ## Architecture
43
- This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier.
44
 
45
  Given that this is a relatively-easy NLP task, the model contains \~5M parameters (\~4M of which are embeddings).
46
  This makes the model very fast and cheap at inference time, as SBD should be.
@@ -56,8 +59,8 @@ The BERT encoder is based on the following configuration:
56
  ## Training
57
  This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
58
 
59
- Model was trained on an A100 for \~150k steps with a batch size of 256, with a $3 budget on the [Lambda cloud](https://cloud.lambdalabs.com/).
60
- Model was allowed to converge with 25M training sentences (1M per language).
61
 
62
  ### Training Data
63
  This model was trained on `OpenSubtitles`.
@@ -67,7 +70,7 @@ Although this corpus is very noisy, it is one of few large-scale text corpora wh
67
  We must avoid using an automatically-segmented corpus this for at least two reasons:
68
 
69
  1. Our deep-learning model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
70
- 2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (i.e., me).
71
 
72
  Heuristics were used to attempt to clean the data before training.
73
  Some examples of the cleaning are:
@@ -85,6 +88,7 @@ To create examples for the model, we
85
  2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
86
 
87
  For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
 
88
 
89
  This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
90
  If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
@@ -337,7 +341,7 @@ If you're interested in any particular aspect being improved, let me know for th
337
  ## Limited vocabulary
338
  This model has 25 languages and a tokenizer with only 32k tokens.
339
 
340
- Chinese has a lot of out-of-vocabulary tokens, which will manifeset as the unknown surface appearing in the outputs of some Chinese texts.
341
 
342
  This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
343
 
 
39
  A key feature is that the model is multi-lingual and language-agnostic at inference time.
40
  Therefore, language tags do not need to be used and a single batch can contain multiple languages.
41
 
42
+ As emphasized later in this card, this is a prototype model and there will be future versions which are cheap to train.
43
+ Feel free to provide input, suggestions, or requests in a discussion.
44
+
45
  ## Architecture
46
+ This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.
47
 
48
  Given that this is a relatively-easy NLP task, the model contains \~5M parameters (\~4M of which are embeddings).
49
  This makes the model very fast and cheap at inference time, as SBD should be.
 
59
  ## Training
60
  This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
61
 
62
+ Training was performed on an A100 for \~150k steps with a batch size of 256, with a $3 budget on the [Lambda cloud](https://cloud.lambdalabs.com/).
63
+ Model was roughly converged with 25M training sentences (1M per language).
64
 
65
  ### Training Data
66
  This model was trained on `OpenSubtitles`.
 
70
  We must avoid using an automatically-segmented corpus this for at least two reasons:
71
 
72
  1. Our deep-learning model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
73
+ 2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (e.g., me).
74
 
75
  Heuristics were used to attempt to clean the data before training.
76
  Some examples of the cleaning are:
 
88
  2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
89
 
90
  For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
91
+ The number of sentences to use was chosen random and uniformly, so each example had, on average, 4 sentence boundaries.
92
 
93
  This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
94
  If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
 
341
  ## Limited vocabulary
342
  This model has 25 languages and a tokenizer with only 32k tokens.
343
 
344
+ Chinese has a lot of out-of-vocabulary tokens, which will manifest as the unknown surface appearing in the outputs of some Chinese texts.
345
 
346
  This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
347