1-800-BAD-CODE
commited on
Commit
•
acb3269
1
Parent(s):
3e20b04
Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,6 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
3 |
language:
|
4 |
- ar
|
5 |
- bn
|
@@ -31,19 +32,63 @@ language:
|
|
31 |
# Model Overview
|
32 |
This model performs sentence boundary prediction (SBD) with 25 languages.
|
33 |
|
34 |
-
This model
|
35 |
|
36 |
# Model Architecture
|
37 |
-
This is a data-driven approach to SBD.
|
38 |
|
39 |
-
|
40 |
|
41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
# Example Usage
|
44 |
|
45 |
-
This model has been exported to ONNX alongside
|
|
|
|
|
|
|
46 |
|
47 |
```bash
|
48 |
|
49 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
library_name: onnx
|
4 |
language:
|
5 |
- ar
|
6 |
- bn
|
|
|
32 |
# Model Overview
|
33 |
This model performs sentence boundary prediction (SBD) with 25 languages.
|
34 |
|
35 |
+
This model segments a long, punctuated text into one or more constituent sentences.
|
36 |
|
37 |
# Model Architecture
|
38 |
+
This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT encoder, and a linear decoder/classifier.
|
39 |
|
40 |
+
Given that this is an easy NLP task, the model contains ~5M parameters.
|
41 |
|
42 |
+
The BERT encoder is based on the following configuration:
|
43 |
+
|
44 |
+
* 8 heads
|
45 |
+
* 4 layers
|
46 |
+
* 128 hidden dim
|
47 |
+
* 512 intermediate/ff dim
|
48 |
+
* 32000 embeddings/vocab tokens
|
49 |
+
|
50 |
+
# Model Inputs and Outputs
|
51 |
+
The model inputs should be **punctuated** texts.
|
52 |
+
|
53 |
+
The classification and SPE models have both been trained with 50% of the training data lower-cased.
|
54 |
+
The model should perform similarly with either lower- or true-cased data.
|
55 |
+
All-capitalized text will probably not work well.
|
56 |
+
|
57 |
+
The inputs should be packed into a batch with shape `[B, T]` , with padding being the SPE model's `<pad>` token ID.
|
58 |
+
|
59 |
+
For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
|
60 |
+
|
61 |
+
## Language input specifics
|
62 |
+
The training data was pre-processed for language-specific punctuation and spacing rules.
|
63 |
+
|
64 |
+
* All spaces were removed from continuous-script languages (Chinese, Japanese). Inputs in these languages should not contain spaces.
|
65 |
+
* Chinese punctuation: Chinese and Japanese use full-width periods, question marks, and commas. Chinese input with Latin punctuation may not work well.
|
66 |
+
* Hindi/Bengali punctuation: These languages use the danda `।` as a full-stop, not a `.`.
|
67 |
+
* Arabic punctuation: Uses reverse question marks `؟`, not a `?`.
|
68 |
|
69 |
# Example Usage
|
70 |
|
71 |
+
This model has been exported to `ONNX` (opset 17) alongside the associated `SentencePiece` tokenizer.
|
72 |
+
|
73 |
+
The predictions are applied to the input by separating the token sequence where the predicted value exceeds a threshold for sentence boundary classification.
|
74 |
+
|
75 |
|
76 |
```bash
|
77 |
|
78 |
```
|
79 |
+
|
80 |
+
# Known Issues
|
81 |
+
This is essentially a prototype model, and has some issues. These will be improved in a later version.
|
82 |
+
|
83 |
+
If you're interested in any particular aspect being improved, let me know for the next version.
|
84 |
+
|
85 |
+
## Limited vocabulary
|
86 |
+
This model has 25 languages and a tokenizer with only 32k tokens.
|
87 |
+
|
88 |
+
Chinese has a lot of out-of-vocabulary tokens, which will manifeset as the unknown surface appearing in the outputs of some Chinese texts.
|
89 |
+
|
90 |
+
This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
|
91 |
+
|
92 |
+
## Noisy training data
|
93 |
+
This model was trained on `OpenSubtitles`, data which is natoriously noisy. The model may have learned some bad habits from this data.
|
94 |
+
|