1-800-BAD-CODE
commited on
Commit
•
138a7d5
1
Parent(s):
7eb80ca
Update README.md
Browse files
README.md
CHANGED
@@ -39,8 +39,11 @@ This model segments a long, punctuated text into one or more constituent sentenc
|
|
39 |
A key feature is that the model is multi-lingual and language-agnostic at inference time.
|
40 |
Therefore, language tags do not need to be used and a single batch can contain multiple languages.
|
41 |
|
|
|
|
|
|
|
42 |
## Architecture
|
43 |
-
This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier.
|
44 |
|
45 |
Given that this is a relatively-easy NLP task, the model contains \~5M parameters (\~4M of which are embeddings).
|
46 |
This makes the model very fast and cheap at inference time, as SBD should be.
|
@@ -56,8 +59,8 @@ The BERT encoder is based on the following configuration:
|
|
56 |
## Training
|
57 |
This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
|
58 |
|
59 |
-
|
60 |
-
Model was
|
61 |
|
62 |
### Training Data
|
63 |
This model was trained on `OpenSubtitles`.
|
@@ -67,7 +70,7 @@ Although this corpus is very noisy, it is one of few large-scale text corpora wh
|
|
67 |
We must avoid using an automatically-segmented corpus this for at least two reasons:
|
68 |
|
69 |
1. Our deep-learning model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
|
70 |
-
2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (
|
71 |
|
72 |
Heuristics were used to attempt to clean the data before training.
|
73 |
Some examples of the cleaning are:
|
@@ -85,6 +88,7 @@ To create examples for the model, we
|
|
85 |
2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
|
86 |
|
87 |
For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
|
|
|
88 |
|
89 |
This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
|
90 |
If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
|
@@ -337,7 +341,7 @@ If you're interested in any particular aspect being improved, let me know for th
|
|
337 |
## Limited vocabulary
|
338 |
This model has 25 languages and a tokenizer with only 32k tokens.
|
339 |
|
340 |
-
Chinese has a lot of out-of-vocabulary tokens, which will
|
341 |
|
342 |
This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
|
343 |
|
|
|
39 |
A key feature is that the model is multi-lingual and language-agnostic at inference time.
|
40 |
Therefore, language tags do not need to be used and a single batch can contain multiple languages.
|
41 |
|
42 |
+
As emphasized later in this card, this is a prototype model and there will be future versions which are cheap to train.
|
43 |
+
Feel free to provide input, suggestions, or requests in a discussion.
|
44 |
+
|
45 |
## Architecture
|
46 |
+
This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.
|
47 |
|
48 |
Given that this is a relatively-easy NLP task, the model contains \~5M parameters (\~4M of which are embeddings).
|
49 |
This makes the model very fast and cheap at inference time, as SBD should be.
|
|
|
59 |
## Training
|
60 |
This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.
|
61 |
|
62 |
+
Training was performed on an A100 for \~150k steps with a batch size of 256, with a $3 budget on the [Lambda cloud](https://cloud.lambdalabs.com/).
|
63 |
+
Model was roughly converged with 25M training sentences (1M per language).
|
64 |
|
65 |
### Training Data
|
66 |
This model was trained on `OpenSubtitles`.
|
|
|
70 |
We must avoid using an automatically-segmented corpus this for at least two reasons:
|
71 |
|
72 |
1. Our deep-learning model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
|
73 |
+
2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (e.g., me).
|
74 |
|
75 |
Heuristics were used to attempt to clean the data before training.
|
76 |
Some examples of the cleaning are:
|
|
|
88 |
2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets
|
89 |
|
90 |
For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
|
91 |
+
The number of sentences to use was chosen random and uniformly, so each example had, on average, 4 sentence boundaries.
|
92 |
|
93 |
This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long.
|
94 |
If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.
|
|
|
341 |
## Limited vocabulary
|
342 |
This model has 25 languages and a tokenizer with only 32k tokens.
|
343 |
|
344 |
+
Chinese has a lot of out-of-vocabulary tokens, which will manifest as the unknown surface appearing in the outputs of some Chinese texts.
|
345 |
|
346 |
This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
|
347 |
|