1-800-BAD-CODE
commited on
Commit
•
f25200c
1
Parent(s):
0647eaf
Update README.md
Browse files
README.md
CHANGED
@@ -29,15 +29,21 @@ language:
|
|
29 |
- zh
|
30 |
---
|
31 |
|
|
|
32 |
# Model Overview
|
33 |
-
|
|
|
34 |
|
35 |
This model segments a long, punctuated text into one or more constituent sentences.
|
36 |
|
37 |
-
|
38 |
-
|
|
|
|
|
|
|
39 |
|
40 |
-
Given that this is
|
|
|
41 |
|
42 |
The BERT encoder is based on the following configuration:
|
43 |
|
@@ -47,26 +53,67 @@ The BERT encoder is based on the following configuration:
|
|
47 |
* 512 intermediate/ff dim
|
48 |
* 32000 embeddings/vocab tokens
|
49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
# Model Inputs and Outputs
|
51 |
The model inputs should be **punctuated** texts.
|
52 |
|
53 |
-
The classification and SPE models have both been trained with 50% of the training data lower-cased.
|
54 |
-
The model should perform similarly with either lower- or true-cased data.
|
55 |
-
All-capitalized text will probably not work well.
|
56 |
-
|
57 |
The inputs should be packed into a batch with shape `[B, T]` , with padding being the SPE model's `<pad>` token ID.
|
58 |
-
The
|
|
|
|
|
59 |
Optimal handling of longer sequences would require some inference-time logic (wrapping/overlapping inputs and re-combining outputs).
|
60 |
|
61 |
For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
|
62 |
|
63 |
-
## Language input specifics
|
64 |
-
The training data was pre-processed for language-specific punctuation and spacing rules.
|
65 |
-
|
66 |
-
* All spaces were removed from continuous-script languages (Chinese, Japanese). Inputs in these languages should not contain spaces.
|
67 |
-
* Chinese punctuation: Chinese and Japanese use full-width periods, question marks, and commas. Chinese input with Latin punctuation may not work well.
|
68 |
-
* Hindi/Bengali punctuation: These languages use the danda `।` as a full-stop, not a `.`.
|
69 |
-
* Arabic punctuation: Uses reverse question marks `؟`, not a `?`.
|
70 |
|
71 |
# Example Usage
|
72 |
|
@@ -74,9 +121,9 @@ This model has been exported to `ONNX` (opset 17) alongside the associated `Sent
|
|
74 |
|
75 |
The predictions are applied to the input by separating the token sequence where the predicted value exceeds a threshold for sentence boundary classification.
|
76 |
|
77 |
-
This model can be run directly with a couple of dependencies which most developers
|
78 |
|
79 |
-
The following
|
80 |
|
81 |
```bash
|
82 |
$ pip install sentencepiece onnxruntime
|
@@ -275,15 +322,13 @@ Outputs:
|
|
275 |
let him go.
|
276 |
let him go.
|
277 |
let me see your license and i.d. card.
|
278 |
-
|
279 |
-
|
280 |
```
|
281 |
|
282 |
</details>
|
283 |
|
284 |
|
285 |
-
#
|
286 |
-
This
|
287 |
|
288 |
If you're interested in any particular aspect being improved, let me know for the next version.
|
289 |
|
@@ -295,5 +340,20 @@ Chinese has a lot of out-of-vocabulary tokens, which will manifeset as the unkno
|
|
295 |
This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
|
296 |
|
297 |
## Noisy training data
|
298 |
-
This model was trained on `OpenSubtitles`, data which is
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
299 |
|
|
|
|
29 |
- zh
|
30 |
---
|
31 |
|
32 |
+
|
33 |
# Model Overview
|
34 |
+
|
35 |
+
This model performs text sentence boundary prediction (SBD) with 25 common languages.
|
36 |
|
37 |
This model segments a long, punctuated text into one or more constituent sentences.
|
38 |
|
39 |
+
A key feature is that the model is multi-lingual and language-agnostic at inference time.
|
40 |
+
Therefore, language tags do not need to be used and a single batch can contain multiple languages.
|
41 |
+
|
42 |
+
## Architecture
|
43 |
+
This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier.
|
44 |
|
45 |
+
Given that this is a relatively-easy NLP task, the model contains ~5M parameters (~4M of which are embeddings).
|
46 |
+
This makes the model very fast and cheap at inference time, as SBD should be.
|
47 |
|
48 |
The BERT encoder is based on the following configuration:
|
49 |
|
|
|
53 |
* 512 intermediate/ff dim
|
54 |
* 32000 embeddings/vocab tokens
|
55 |
|
56 |
+
## Training
|
57 |
+
This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
|
58 |
+
|
59 |
+
Model was trained on an A100 for ~150k steps with a batch size of 256.
|
60 |
+
|
61 |
+
### Training Data
|
62 |
+
This model was trained on `OpenSubtitles`.
|
63 |
+
|
64 |
+
Although this corpus is very noisy, it is one of few large-scale text corpora which have been manually segmented.
|
65 |
+
|
66 |
+
We must avoid using an automatically-segmented corpus this for at least two reasons:
|
67 |
+
|
68 |
+
1. Our deep-learning model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
|
69 |
+
2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (i.e., me).
|
70 |
+
|
71 |
+
Heuristics were used to attempt to clean the data before training.
|
72 |
+
Some examples of the cleaning are:
|
73 |
+
|
74 |
+
* Drop sentences which start with a lower-case letter. Assume these lines are errorful.
|
75 |
+
* For inputs that do not end with a full stop, append the default full stop for that language. Assume that for single-sentence declarative sentences, full stops are not important for subtitles.
|
76 |
+
* Drop inputs that have more than 20 words (or 32 chars, for continuous-script languages). Assume these lines contain more than one sentence, and therefore we cannot create reliable targets.
|
77 |
+
* Drop objectively junk lines: all punctuation/special characters, empty lines, etc.
|
78 |
+
* Normalize punctuation: no more than one consecutive punctuation token (except Spanish, where inverted punctuation can appear after non-inverted punctuation).
|
79 |
+
|
80 |
+
### Example Generation
|
81 |
+
To create examples for the model, we
|
82 |
+
|
83 |
+
1. Assume each input line is exactly one sentence
|
84 |
+
2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
|
85 |
+
|
86 |
+
For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
|
87 |
+
|
88 |
+
This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
|
89 |
+
If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
|
90 |
+
|
91 |
+
50% of input texts were lower-cased for both the tokenizer and classification models.
|
92 |
+
This provides some augmentation, but more importantly allows for this model to inserted into an NLP pipeline either before or after true-casing.
|
93 |
+
Using this model before true-casing would allow the true-casing model to exploit the conditional probability of sentence boundaries w.r.t. capitalization.
|
94 |
+
|
95 |
+
### Language Specific Rules
|
96 |
+
The training data was pre-processed for language-specific punctuation and spacing rules.
|
97 |
+
|
98 |
+
The following guidelines were used during training. If inference inputs differ, the model may perform poorly.
|
99 |
+
|
100 |
+
* All spaces were removed from continuous-script languages (Chinese, Japanese).
|
101 |
+
* Chinese: Chinese and Japanese use full-width periods "。", question marks "?", and commas ",".
|
102 |
+
* Hindi/Bengali: These languages use the danda "।" as a full-stop, not ".".
|
103 |
+
* Arabic: Uses reverse question marks "؟", not "?".
|
104 |
+
|
105 |
+
|
106 |
# Model Inputs and Outputs
|
107 |
The model inputs should be **punctuated** texts.
|
108 |
|
|
|
|
|
|
|
|
|
109 |
The inputs should be packed into a batch with shape `[B, T]` , with padding being the SPE model's `<pad>` token ID.
|
110 |
+
The `<pad>` ID is required to generate a proper attention mask.
|
111 |
+
|
112 |
+
The model was trained on a maximum sequence length of 256 (subwords), and may crash or perform poorly if a longer batch is processed.
|
113 |
Optimal handling of longer sequences would require some inference-time logic (wrapping/overlapping inputs and re-combining outputs).
|
114 |
|
115 |
For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
|
116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
117 |
|
118 |
# Example Usage
|
119 |
|
|
|
121 |
|
122 |
The predictions are applied to the input by separating the token sequence where the predicted value exceeds a threshold for sentence boundary classification.
|
123 |
|
124 |
+
This model can be run directly with a couple of dependencies which most developers may already have installed.
|
125 |
|
126 |
+
The following snippet will install the dependencies, clone this repo, and run an example script which points to the local files.
|
127 |
|
128 |
```bash
|
129 |
$ pip install sentencepiece onnxruntime
|
|
|
322 |
let him go.
|
323 |
let him go.
|
324 |
let me see your license and i.d. card.
|
|
|
|
|
325 |
```
|
326 |
|
327 |
</details>
|
328 |
|
329 |
|
330 |
+
# Limitations and known issues
|
331 |
+
This a prototype model, and has some issues. These will be improved in a later version.
|
332 |
|
333 |
If you're interested in any particular aspect being improved, let me know for the next version.
|
334 |
|
|
|
340 |
This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
|
341 |
|
342 |
## Noisy training data
|
343 |
+
This model was trained on `OpenSubtitles`, data which is notoriously noisy. The model may have learned some bad habits from this data.
|
344 |
+
|
345 |
+
|
346 |
+
## Language-specific expectations
|
347 |
+
As discussed in a previous section, each language should be formatted and punctuated per that languages rules.
|
348 |
+
|
349 |
+
E.g., Chinese text should contain full-width periods, not latin periods, and contain no space.
|
350 |
+
|
351 |
+
In practice, data often does not adhere to these rules, but the model has not been augmented to deal with this potential issue.
|
352 |
+
|
353 |
+
## Metrics
|
354 |
+
It's difficult to properly evaluate this model, since we rely on the proposition that the input data contains exactly one sentence per line.
|
355 |
+
In reality, the data sets used thus far are noisy and often contain more than one sentence per line.
|
356 |
+
|
357 |
+
Metrics are not published for now, and evaluation is limited to manual spot-checking.
|
358 |
|
359 |
+
Sufficient test sets for this analytic are being looked into.
|