1-800-BAD-CODE commited on
Commit
acb3269
1 Parent(s): 3e20b04

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -5
README.md CHANGED
@@ -1,5 +1,6 @@
1
  ---
2
  license: apache-2.0
 
3
  language:
4
  - ar
5
  - bn
@@ -31,19 +32,63 @@ language:
31
  # Model Overview
32
  This model performs sentence boundary prediction (SBD) with 25 languages.
33
 
34
- This model accepts as input arbitraily-long, punctuated texts and produces as output the consituent sentences of the input.
35
 
36
  # Model Architecture
37
- This is a data-driven approach to SBD.
38
 
39
- Input texts are encoded with a SentencePiece model, then encoded with a BERT-style encoder, then projected to sentence boundary probabilities via a linear layer.
40
 
41
- For each input token `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  # Example Usage
44
 
45
- This model has been exported to ONNX alongside an SentencePiece tokenizer.
 
 
 
46
 
47
  ```bash
48
 
49
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ library_name: onnx
4
  language:
5
  - ar
6
  - bn
 
32
  # Model Overview
33
  This model performs sentence boundary prediction (SBD) with 25 languages.
34
 
35
+ This model segments a long, punctuated text into one or more constituent sentences.
36
 
37
  # Model Architecture
38
+ This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT encoder, and a linear decoder/classifier.
39
 
40
+ Given that this is an easy NLP task, the model contains ~5M parameters.
41
 
42
+ The BERT encoder is based on the following configuration:
43
+
44
+ * 8 heads
45
+ * 4 layers
46
+ * 128 hidden dim
47
+ * 512 intermediate/ff dim
48
+ * 32000 embeddings/vocab tokens
49
+
50
+ # Model Inputs and Outputs
51
+ The model inputs should be **punctuated** texts.
52
+
53
+ The classification and SPE models have both been trained with 50% of the training data lower-cased.
54
+ The model should perform similarly with either lower- or true-cased data.
55
+ All-capitalized text will probably not work well.
56
+
57
+ The inputs should be packed into a batch with shape `[B, T]` , with padding being the SPE model's `<pad>` token ID.
58
+
59
+ For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).
60
+
61
+ ## Language input specifics
62
+ The training data was pre-processed for language-specific punctuation and spacing rules.
63
+
64
+ * All spaces were removed from continuous-script languages (Chinese, Japanese). Inputs in these languages should not contain spaces.
65
+ * Chinese punctuation: Chinese and Japanese use full-width periods, question marks, and commas. Chinese input with Latin punctuation may not work well.
66
+ * Hindi/Bengali punctuation: These languages use the danda `।` as a full-stop, not a `.`.
67
+ * Arabic punctuation: Uses reverse question marks `؟`, not a `?`.
68
 
69
  # Example Usage
70
 
71
+ This model has been exported to `ONNX` (opset 17) alongside the associated `SentencePiece` tokenizer.
72
+
73
+ The predictions are applied to the input by separating the token sequence where the predicted value exceeds a threshold for sentence boundary classification.
74
+
75
 
76
  ```bash
77
 
78
  ```
79
+
80
+ # Known Issues
81
+ This is essentially a prototype model, and has some issues. These will be improved in a later version.
82
+
83
+ If you're interested in any particular aspect being improved, let me know for the next version.
84
+
85
+ ## Limited vocabulary
86
+ This model has 25 languages and a tokenizer with only 32k tokens.
87
+
88
+ Chinese has a lot of out-of-vocabulary tokens, which will manifeset as the unknown surface appearing in the outputs of some Chinese texts.
89
+
90
+ This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.
91
+
92
+ ## Noisy training data
93
+ This model was trained on `OpenSubtitles`, data which is natoriously noisy. The model may have learned some bad habits from this data.
94
+