1-800-BAD-CODE commited on
Commit
f25200c
1 Parent(s): 0647eaf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -23
README.md CHANGED
@@ -29,15 +29,21 @@ language:
29
  - zh
30
  ---
31
 
 
32
  # Model Overview
33
- This model performs sentence boundary prediction (SBD) with 25 languages.
 
34
 
35
  This model segments a long, punctuated text into one or more constituent sentences.
36
 
37
- # Model Architecture
38
- This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT encoder, and a linear decoder/classifier.
 
 
 
39
 
40
- Given that this is an easy NLP task, the model contains ~5M parameters.
 
41
 
42
  The BERT encoder is based on the following configuration:
43
 
@@ -47,26 +53,67 @@ The BERT encoder is based on the following configuration:
47
  * 512 intermediate/ff dim
48
  * 32000 embeddings/vocab tokens
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  # Model Inputs and Outputs
51
  The model inputs should be **punctuated** texts.
52
 
53
- The classification and SPE models have both been trained with 50% of the training data lower-cased.
54
- The model should perform similarly with either lower- or true-cased data.
55
- All-capitalized text will probably not work well.
56
-
57
  The inputs should be packed into a batch with shape `[B, T]` , with padding being the SPE model's `<pad>` token ID.
58
- The model was trained on a maximum sequence length of 256, but will not crash until sequences exceed 512.
 
 
59
  Optimal handling of longer sequences would require some inference-time logic (wrapping/overlapping inputs and re-combining outputs).
60
 
61
  For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
62
 
63
- ## Language input specifics
64
- The training data was pre-processed for language-specific punctuation and spacing rules.
65
-
66
- * All spaces were removed from continuous-script languages (Chinese, Japanese). Inputs in these languages should not contain spaces.
67
- * Chinese punctuation: Chinese and Japanese use full-width periods, question marks, and commas. Chinese input with Latin punctuation may not work well.
68
- * Hindi/Bengali punctuation: These languages use the danda `।` as a full-stop, not a `.`.
69
- * Arabic punctuation: Uses reverse question marks `؟`, not a `?`.
70
 
71
  # Example Usage
72
 
@@ -74,9 +121,9 @@ This model has been exported to `ONNX` (opset 17) alongside the associated `Sent
74
 
75
  The predictions are applied to the input by separating the token sequence where the predicted value exceeds a threshold for sentence boundary classification.
76
 
77
- This model can be run directly with a couple of dependencies which most developers likely already have installed.
78
 
79
- The following snipper will install the dependencies, clone this repo, and run an example script which points to the local files.
80
 
81
  ```bash
82
  $ pip install sentencepiece onnxruntime
@@ -275,15 +322,13 @@ Outputs:
275
  let him go.
276
  let him go.
277
  let me see your license and i.d. card.
278
-
279
-
280
  ```
281
 
282
  </details>
283
 
284
 
285
- # Known Issues
286
- This is essentially a prototype model, and has some issues. These will be improved in a later version.
287
 
288
  If you're interested in any particular aspect being improved, let me know for the next version.
289
 
@@ -295,5 +340,20 @@ Chinese has a lot of out-of-vocabulary tokens, which will manifeset as the unkno
295
  This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
296
 
297
  ## Noisy training data
298
- This model was trained on `OpenSubtitles`, data which is natoriously noisy. The model may have learned some bad habits from this data.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
299
 
 
 
29
  - zh
30
  ---
31
 
32
+
33
  # Model Overview
34
+
35
+ This model performs text sentence boundary prediction (SBD) with 25 common languages.
36
 
37
  This model segments a long, punctuated text into one or more constituent sentences.
38
 
39
+ A key feature is that the model is multi-lingual and language-agnostic at inference time.
40
+ Therefore, language tags do not need to be used and a single batch can contain multiple languages.
41
+
42
+ ## Architecture
43
+ This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier.
44
 
45
+ Given that this is a relatively-easy NLP task, the model contains ~5M parameters (~4M of which are embeddings).
46
+ This makes the model very fast and cheap at inference time, as SBD should be.
47
 
48
  The BERT encoder is based on the following configuration:
49
 
 
53
  * 512 intermediate/ff dim
54
  * 32000 embeddings/vocab tokens
55
 
56
+ ## Training
57
+ This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
58
+
59
+ Model was trained on an A100 for ~150k steps with a batch size of 256.
60
+
61
+ ### Training Data
62
+ This model was trained on `OpenSubtitles`.
63
+
64
+ Although this corpus is very noisy, it is one of few large-scale text corpora which have been manually segmented.
65
+
66
+ We must avoid using an automatically-segmented corpus this for at least two reasons:
67
+
68
+ 1. Our deep-learning model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
69
+ 2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (i.e., me).
70
+
71
+ Heuristics were used to attempt to clean the data before training.
72
+ Some examples of the cleaning are:
73
+
74
+ * Drop sentences which start with a lower-case letter. Assume these lines are errorful.
75
+ * For inputs that do not end with a full stop, append the default full stop for that language. Assume that for single-sentence declarative sentences, full stops are not important for subtitles.
76
+ * Drop inputs that have more than 20 words (or 32 chars, for continuous-script languages). Assume these lines contain more than one sentence, and therefore we cannot create reliable targets.
77
+ * Drop objectively junk lines: all punctuation/special characters, empty lines, etc.
78
+ * Normalize punctuation: no more than one consecutive punctuation token (except Spanish, where inverted punctuation can appear after non-inverted punctuation).
79
+
80
+ ### Example Generation
81
+ To create examples for the model, we
82
+
83
+ 1. Assume each input line is exactly one sentence
84
+ 2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
85
+
86
+ For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
87
+
88
+ This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
89
+ If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
90
+
91
+ 50% of input texts were lower-cased for both the tokenizer and classification models.
92
+ This provides some augmentation, but more importantly allows for this model to inserted into an NLP pipeline either before or after true-casing.
93
+ Using this model before true-casing would allow the true-casing model to exploit the conditional probability of sentence boundaries w.r.t. capitalization.
94
+
95
+ ### Language Specific Rules
96
+ The training data was pre-processed for language-specific punctuation and spacing rules.
97
+
98
+ The following guidelines were used during training. If inference inputs differ, the model may perform poorly.
99
+
100
+ * All spaces were removed from continuous-script languages (Chinese, Japanese).
101
+ * Chinese: Chinese and Japanese use full-width periods "。", question marks "?", and commas ",".
102
+ * Hindi/Bengali: These languages use the danda "।" as a full-stop, not ".".
103
+ * Arabic: Uses reverse question marks "؟", not "?".
104
+
105
+
106
  # Model Inputs and Outputs
107
  The model inputs should be **punctuated** texts.
108
 
 
 
 
 
109
  The inputs should be packed into a batch with shape `[B, T]` , with padding being the SPE model's `<pad>` token ID.
110
+ The `<pad>` ID is required to generate a proper attention mask.
111
+
112
+ The model was trained on a maximum sequence length of 256 (subwords), and may crash or perform poorly if a longer batch is processed.
113
  Optimal handling of longer sequences would require some inference-time logic (wrapping/overlapping inputs and re-combining outputs).
114
 
115
  For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
116
 
 
 
 
 
 
 
 
117
 
118
  # Example Usage
119
 
 
121
 
122
  The predictions are applied to the input by separating the token sequence where the predicted value exceeds a threshold for sentence boundary classification.
123
 
124
+ This model can be run directly with a couple of dependencies which most developers may already have installed.
125
 
126
+ The following snippet will install the dependencies, clone this repo, and run an example script which points to the local files.
127
 
128
  ```bash
129
  $ pip install sentencepiece onnxruntime
 
322
  let him go.
323
  let him go.
324
  let me see your license and i.d. card.
 
 
325
  ```
326
 
327
  </details>
328
 
329
 
330
+ # Limitations and known issues
331
+ This a prototype model, and has some issues. These will be improved in a later version.
332
 
333
  If you're interested in any particular aspect being improved, let me know for the next version.
334
 
 
340
  This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
341
 
342
  ## Noisy training data
343
+ This model was trained on `OpenSubtitles`, data which is notoriously noisy. The model may have learned some bad habits from this data.
344
+
345
+
346
+ ## Language-specific expectations
347
+ As discussed in a previous section, each language should be formatted and punctuated per that languages rules.
348
+
349
+ E.g., Chinese text should contain full-width periods, not latin periods, and contain no space.
350
+
351
+ In practice, data often does not adhere to these rules, but the model has not been augmented to deal with this potential issue.
352
+
353
+ ## Metrics
354
+ It's difficult to properly evaluate this model, since we rely on the proposition that the input data contains exactly one sentence per line.
355
+ In reality, the data sets used thus far are noisy and often contain more than one sentence per line.
356
+
357
+ Metrics are not published for now, and evaluation is limited to manual spot-checking.
358
 
359
+ Sufficient test sets for this analytic are being looked into.