File size: 22,165 Bytes
c372906
 
acb3269
3e20b04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c372906
3e20b04
f25200c
3e20b04
f25200c
 
3e20b04
acb3269
3e20b04
f25200c
 
 
138a7d5
 
 
f25200c
138a7d5
3e20b04
8a502c1
f25200c
3e20b04
acb3269
 
 
 
 
 
 
 
f25200c
 
 
138a7d5
 
f25200c
 
 
 
 
 
 
 
 
138a7d5
f25200c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138a7d5
f25200c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
acb3269
 
 
 
f25200c
 
 
0647eaf
acb3269
 
 
3e20b04
 
 
acb3269
 
8a502c1
 
acb3269
f25200c
0647eaf
f25200c
3e20b04
 
0647eaf
 
 
8a502c1
0647eaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e20b04
acb3269
0647eaf
 
 
f25200c
 
acb3269
 
 
 
 
 
138a7d5
acb3269
 
 
 
f25200c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
acb3269
f25200c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
---
license: apache-2.0
library_name: onnx
language:
  - ar
  - bn
  - de 
  - en
  - es
  - et
  - fi
  - fr
  - hi
  - id
  - is
  - it
  - ja
  - lt
  - lv
  - ko
  - nl
  - no
  - pl
  - pt
  - ru
  - tr
  - sv
  - uk
  - zh
---


# Model Overview

This model performs text sentence boundary prediction (SBD) with 25 common languages.

This model segments a long, punctuated text into one or more constituent sentences.

A key feature is that the model is multi-lingual and language-agnostic at inference time. 
Therefore, language tags do not need to be used and a single batch can contain multiple languages.

As emphasized later in this card, this is a prototype model and there will be future versions which are cheap to train. 
Feel free to provide input, suggestions, or requests in a discussion.

## Architecture
This is a data-driven approach to SBD. The model uses a `SentencePiece` tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.

Given that this is a relatively-easy NLP task, the model contains \~5M parameters (\~4M of which are embeddings).
This makes the model very fast and cheap at inference time, as SBD should be.

The BERT encoder is based on the following configuration:

* 8 heads
* 4 layers
* 128 hidden dim
* 512 intermediate/ff dim
* 32000 embeddings/vocab tokens

## Training
This model was trained on a personal fork of [NeMo](http://github.com/NVIDIA/NeMo), specifically this [sbd](https://github.com/1-800-BAD-CODE/NeMo/tree/sbd) branch.

Training was performed on an A100 for \~150k steps with a batch size of 256, with a $3 budget on the [Lambda cloud](https://cloud.lambdalabs.com/).
Model was roughly converged with 25M training sentences (1M per language).

### Training Data
This model was trained on `OpenSubtitles`. 

Although this corpus is very noisy, it is one of few large-scale text corpora which have been manually segmented.

We must avoid using an automatically-segmented corpus this for at least two reasons:

1. Our deep-learning model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
2. Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (e.g., me).

Heuristics were used to attempt to clean the data before training. 
Some examples of the cleaning are:

* Drop sentences which start with a lower-case letter. Assume these lines are errorful. 
* For inputs that do not end with a full stop, append the default full stop for that language. Assume that for single-sentence declarative sentences, full stops are not important for subtitles.
* Drop inputs that have more than 20 words (or 32 chars, for continuous-script languages). Assume these lines contain more than one sentence, and therefore we cannot create reliable targets.
* Drop objectively junk lines: all punctuation/special characters, empty lines, etc.
* Normalize punctuation: no more than one consecutive punctuation token (except Spanish, where inverted punctuation can appear after non-inverted punctuation).

### Example Generation
To create examples for the model, we 

1. Assume each input line is exactly one sentence
2. Concatenate sentences together, with the concatenation points becoming the sentence boundary targets

For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries).
The number of sentences to use was chosen random and uniformly, so each example had, on average, 4 sentence boundaries.

This model uses a maximum sequence length of 256, which for `OpenSubtitles` is relatively long. 
If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.

50% of input texts were lower-cased for both the tokenizer and classification models.
This provides some augmentation, but more importantly allows for this model to inserted into an NLP pipeline either before or after true-casing.
Using this model before true-casing would allow the true-casing model to exploit the conditional probability of sentence boundaries w.r.t. capitalization.

### Language Specific Rules
The training data was pre-processed for language-specific punctuation and spacing rules.

The following guidelines were used during training. If inference inputs differ, the model may perform poorly.

* All spaces were removed from continuous-script languages (Chinese, Japanese).
* Chinese: Chinese and Japanese use full-width periods "。", question marks "?", and commas ",". 
* Hindi/Bengali: These languages use the danda "।" as a full-stop, not ".".
* Arabic: Uses reverse question marks "؟", not "?".


# Model Inputs and Outputs
The model inputs should be **punctuated** texts.  

The inputs should be packed into a batch with shape `[B, T]` , with padding being the SPE model's `<pad>` token ID.
The `<pad>` ID is required to generate a proper attention mask.

The model was trained on a maximum sequence length of 256 (subwords), and may crash or perform poorly if a longer batch is processed. 
Optimal handling of longer sequences would require some inference-time logic (wrapping/overlapping inputs and re-combining outputs).

For each input subword `t`, this model predicts the probability that `t` is the final token of a sentence (i.e., a sentence boundary).


# Example Usage

This model has been exported to `ONNX` (opset 17) alongside the associated `SentencePiece` tokenizer.

This model runs with a script after checking out this repo; if there is any interest in it running in the HF API, let me know. 
For now, I assume no one cares.

This model can be run directly with a couple of dependencies which most developers may already have installed.

The following snippet will install the dependencies, clone this repo, and run an example script which points to the local files.

```bash
$ pip install sentencepiece onnxruntime
$ git clone https://huggingface.co/1-800-BAD-CODE/sentence_boundary_detection_multilang
$ cd sentence_boundary_detection_multilang
# Inspect the content before running an arbitrary file
# $ python run_example.py
```

This script runs the following examples:

* One text from each language. Randomly selected from the test sets, and not necessarily difficult or interesting.
* An English text with many acroynms. The model should not predict sentence boundaries simply because a period appears.
* An English text with many acronyms, all lower-cased.

We should see the following:

<details>
  <summary>Example output</summary>
  
```text
Example from each language:
Input 0: إنه يبدو غالياً قليلاً بالنسبة لمبنى يحتاج إلى بعض الإصلاحات، هذا كل ما فى الأمر. هذا جيد. ماما وادك بالمنزل.
Outputs:
	إنه يبدو غالياً قليلاً بالنسبة لمبنى يحتاج إلى بعض الإصلاحات، هذا كل ما فى الأمر.
	هذا جيد.
	ماما وادك بالمنزل.
	
Input 1: তবে ট্রেলারে তার দেখা মিলল না। পরে ব্যাগটি তল্লাশি করে ইয়াবা, চারটি স্বর্ণের বার ও নগদ সাড়ে চার লাখ টাকা পাওয়া যায়। এতে মাছটির দাম বেড়ে গেছে।
Outputs:
	তবে ট্রেলারে তার দেখা মিলল না।
	পরে ব্যাগটি তল্লাশি করে ইয়াবা, চারটি স্বর্ণের বার ও নগদ সাড়ে চার লাখ টাকা পাওয়া যায়।
	এতে মাছটির দাম বেড়ে গেছে।
	
Input 2: Auf alle Fälle sind 90 Minuten voller Einsatz und Konzentration gefordert. Amos Vogel legte seinen Fokus auf die Kontextualisierung der einzelnen Filme durch gezielte Programmarbeit. Georgi verspricht Es wird Feierlichkeiten geben.
Outputs:
	Auf alle Fälle sind 90 Minuten voller Einsatz und Konzentration gefordert.
	Amos Vogel legte seinen Fokus auf die Kontextualisierung der einzelnen Filme durch gezielte Programmarbeit.
	Georgi verspricht Es wird Feierlichkeiten geben.
	
Input 3: Irwin wrote on Instagram. However, money does talk and the Roosters reported $130000 offer might not be up to scratch. There are doctors, but since they joined the civil disobedience protest movement, they can't come to the hospital.
Outputs:
	Irwin wrote on Instagram.
	However, money does talk and the Roosters reported $130000 offer might not be up to scratch.
	There are doctors, but since they joined the civil disobedience protest movement, they can't come to the hospital.
	
Input 4: Y sí, efectivamente, nos estamos refiriendo a la NASA, la ESA, etc. También tenía un servicio de bar y venta de refrescos y bebidas alcohólicas, así como otros productos. En cuanto a la Libre Determinación, su importe se distribuirá de forma anual.
Outputs:
	Y sí, efectivamente, nos estamos refiriendo a la NASA, la ESA, etc.
	También tenía un servicio de bar y venta de refrescos y bebidas alcohólicas, así como otros productos.
	En cuanto a la Libre Determinación, su importe se distribuirá de forma anual.
	
Input 5: Hiljuti suure tagasituleku teinud Mike Tyson naaseb taas poksiringi. Riigikogu liikme Kert Kingo hinnangul on valitsus riigikokku toonud seaduseelnõu, millega tahetakse riigis kehtestada totalitaarset kontrolli. Kehakaalu tõusu tingib liigne energia tarbimine, mida organism ei jõua päeva jooksul ära kulutada.
Outputs:
	Hiljuti suure tagasituleku teinud Mike Tyson naaseb taas poksiringi.
	Riigikogu liikme Kert Kingo hinnangul on valitsus riigikokku toonud seaduseelnõu, millega tahetakse riigis kehtestada totalitaarset kontrolli.
	Kehakaalu tõusu tingib liigne energia tarbimine, mida organism ei jõua päeva jooksul ära kulutada.
	
Input 6: Yksi ongelma ratkaistu. Kuinka kauan olette olleet Meksikossa? Kävin jokin aika sitten tapaamassa komisario Fournieria tärkeässä asiassa.
Outputs:
	Yksi ongelma ratkaistu.
	Kuinka kauan olette olleet Meksikossa?
	Kävin jokin aika sitten tapaamassa komisario Fournieria tärkeässä asiassa.
	
Input 7: Avec cette série, on sent quelque chose. En rafale, il pourra dépasser les 65 kmh. Le mercure restera homogène à 16 °C. Désormais à 10 contre 10 avec un but de retard, Arsenal est complètement relancé dans cette demi finale aller.
Outputs:
	Avec cette série, on sent quelque chose.
	En rafale, il pourra dépasser les 65 kmh.
	Le mercure restera homogène à 16 °C.
	Désormais à 10 contre 10 avec un but de retard, Arsenal est complètement relancé dans cette demi finale aller.
	
Input 8: प्रभु चरणणों में यही विनती है कि हमारे विचाररों को सदैव पवित्र बनाए रखने की कृपा बनी रहे। भारत में यह सेरेमनी डिज्नी स्टार के चैनललों पर देखी जा सकती है। इस कार्रवाई से होटललों और ललांज में अफरा तफरी मच गई है।
Outputs:
	प्रभु चरणणों में यही विनती है कि हमारे विचाररों को सदैव पवित्र बनाए रखने की कृपा बनी रहे।
	भारत में यह सेरेमनी डिज्नी स्टार के चैनललों पर देखी जा सकती है।
	इस कार्रवाई से होटललों और ललांज में अफरा तफरी मच गई है।
	
Input 9: Masyarakat juga harus menjalankan protokol kesehatan dengan disiplin dan juga segera divaksinasi bagi yang belum. Pelaku ingin nikah, tetapi tidak direstui bapaknya. Untuk selanjutnya, setiap produksi Seven Bucks, baik TV, film atau apa pun, kami tidak akan lagi menggunakan senjata sungguhan.
Outputs:
	Masyarakat juga harus menjalankan protokol kesehatan dengan disiplin dan juga segera divaksinasi bagi yang belum.
	Pelaku ingin nikah, tetapi tidak direstui bapaknya.
	Untuk selanjutnya, setiap produksi Seven Bucks, baik TV, film atau apa pun, kami tidak akan lagi menggunakan senjata sungguhan.
	
Input 10: Síðustu vikurnar voru móður okkar hreint kvalræði. Helena Ólafsdóttir og Guðlaug Jónsdóttir knattspyrnukempur gengu í heilagt hjónaband um helgina. Þess vegna er alltaf verið að tala um að vera ekki við hraunjaðarinn.
Outputs:
	Síðustu vikurnar voru móður okkar hreint kvalræði.
	Helena Ólafsdóttir og Guðlaug Jónsdóttir knattspyrnukempur gengu í heilagt hjónaband um helgina.
	Þess vegna er alltaf verið að tala um að vera ekki við hraunjaðarinn.
	
Input 11: Credevo di potermi fidare di lei. Andare con loro, portarli fuori dal porto? I romani, prima di andarsene, sommersero la città. I tuoi antenati custodirono il segreto.
Outputs:
	Credevo di potermi fidare di lei.
	Andare con loro, portarli fuori dal porto?
	I romani, prima di andarsene, sommersero la città.
	I tuoi antenati custodirono il segreto.
	
Input 12: 毛先をそろえる程度で?持たないよ。レベッカ上品な物言いで頼む。またイェーガーに乗ったらあなたは死にます。
Outputs:
	毛先をそろえる程度で?
	持たないよ。
	レベッカ上品な物言いで頼む。
	またイェーガーに乗ったらあなたは死にます。
	
Input 13: Mokytojau, nusiimk skrybėlę. Kur aš tokia eisiu? Mes galime važiuoti kitą vakarą. Tik arbatos, ačiū.
Outputs:
	Mokytojau, nusiimk skrybėlę.
	Kur aš tokia eisiu?
	Mes galime važiuoti kitą vakarą.
	Tik arbatos, ačiū.
	
Input 14: Viņa ir gatava izveidot uzbrucēja fotorobotu. Jā. Dažas no labākajām snaudām mūžā esmu izbaudījis Karaliskajā Šekspīra teātrī. Kas viņu iedrošinās vēl vairāk.
Outputs:
	Viņa ir gatava izveidot uzbrucēja fotorobotu.
	Jā.
	Dažas no labākajām snaudām mūžā esmu izbaudījis Karaliskajā Šekspīra teātrī.
	Kas viņu iedrošinās vēl vairāk.
	
Input 15: 하지만 원전 추진에 관한 내부 문건이 분명히 더 있을 겁니다. 한국은 어떤가. 반면 글로벌 제약사들은 이미 콜드체인에 관해 큰 관심을 갖고 있었다.
Outputs:
	하지만 원전 추진에 관한 내부 문건이 분명히 더 있을 겁니다.
	한국은 어떤가.
	반면 글로벌 제약사들은 이미 콜드체인에 관해 큰 관심을 갖고 있었다.
	
Input 16: Ash Ketchum drømmer om eventyr. Ingen i Starfleet kunne gjøre dette. Vil du bli påkjørt? Det går så bra her.
Outputs:
	Ash Ketchum drømmer om eventyr.
	Ingen i Starfleet kunne gjøre dette.
	Vil du bli påkjørt?
	Det går så bra her.
	
Input 17: Ik heb een paar dagen nodig om de mannen weer te verzamelen. Heeft ze een interessant accent? We moeten deze kant op. Ik kon er niks aan doen.
Outputs:
	Ik heb een paar dagen nodig om de mannen weer te verzamelen.
	Heeft ze een interessant accent?
	We moeten deze kant op.
	Ik kon er niks aan doen.
	
Input 18: Potraktowaliśmy ją bardzo poważnie. Prace Larsa Vilksa są obecnie wystawiane na wystawie Sztuka polityczna w Centrum Sztuki Współczesnej Zamek Ujazdowski w Warszawie. Czy do uznania choroby zawodowej trzeba wykazać, że nigdzie poza pracą nie było się narażonym na szkodliwe czynniki?
Outputs:
	Potraktowaliśmy ją bardzo poważnie.
	Prace Larsa Vilksa są obecnie wystawiane na wystawie Sztuka polityczna w Centrum Sztuki Współczesnej Zamek Ujazdowski w Warszawie.
	Czy do uznania choroby zawodowej trzeba wykazać, że nigdzie poza pracą nie było się narażonym na szkodliwe czynniki?
	
Input 19: John Kennedy, titular em sua ausência, está suspenso com três cartões amarelos. Esses três pilares para a imigração, citados pela pesquisa do ESCOE, começaram a cambalear com a pandemia. As poucas oportunidades que teve no profissional do Defensor o fizeram reavaliar o futuro.
Outputs:
	John Kennedy, titular em sua ausência, está suspenso com três cartões amarelos.
	Esses três pilares para a imigração, citados pela pesquisa do ESCOE, começaram a cambalear com a pandemia.
	As poucas oportunidades que teve no profissional do Defensor o fizeram reavaliar o futuro.
	
Input 20: Они не знали, что я просто угараю, следует из композиции Элджея. Нас слишком мало, чтобы потерять сотни жизней из за злоупотребления алкоголем, когда мы выйдем из пандемии коронавируса. Катар начал предлагать туры для болельщиков на ЧМ 2022.
Outputs:
	Они не знали, что я просто угараю, следует из композиции Элджея.
	Нас слишком мало, чтобы потерять сотни жизней из за злоупотребления алкоголем, когда мы выйдем из пандемии коронавируса.
	Катар начал предлагать туры для болельщиков на ЧМ 2022.
Input 21: Yani konjonktürel şartlarla pozisyon alan NATO üyeleri kurumsal işbirliğine zarar veren bir dönemi geçirdi NATO. Daha sonra insanları sosyal medyadan bilgilendirdik. Devam eden şampiyonada ülkemizi temsil eden diğer sporculara başarılar diliyorum dedi.
Outputs:
	Yani konjonktürel şartlarla pozisyon alan NATO üyeleri kurumsal işbirliğine zarar veren bir dönemi geçirdi NATO.
	Daha sonra insanları sosyal medyadan bilgilendirdik.
	Devam eden şampiyonada ülkemizi temsil eden diğer sporculara başarılar diliyorum dedi.
	
Input 22: Öppna dina fläskiga, sympatiska armar? Vi spelar. 50 dollar per poäng. Jag menari gär, förraveckan.
Outputs:
	Öppna dina fläskiga, sympatiska armar?
	Vi spelar.
	50 dollar per poäng.
	Jag menari gär, förraveckan.
	
Input 23: Треба бути тепер готовими до наступного. Багато дітей опинилися ізольованими вдома, а час, який вони проводять у мережі, збільшився у рази. Це мала б супроводжувати дискусія у ЗМІ та суспільстві, аби було зрозуміло, чому саме це пропаганда, були наведені докази.
Outputs:
	Треба бути тепер готовими до наступного.
	Багато дітей опинилися ізольованими вдома, а час, який вони проводять у мережі, збільшився у рази.
	Це мала б супроводжувати дискусія у ЗМІ та суспільстві, аби було зрозуміло, чому саме це пропаганда, були наведені докази.
	
Input 24: 不断刷新的数字背后是满满民生情。但另两名非盟官员说,莫德纳要到明年才能交付疫苗,导致谈判破裂。温室气体等大气本底观测是一项专业性很强的工作,容不得一点马虎。
Outputs:
	不断刷新的数字背后是满满民生情。
	但另两名非盟官员说,莫德纳要到明年才能交付疫苗,导致谈判破裂。
	温室气体等大气本底观测是一项专业性很强的工作,容不得一点马虎。
	
Example with lots of acronyms:
Input 0: R.J. MacReady, helicopter pilot, U.S. Outpost Number 31. How will I get to L.A.? He's a U.S. Marshal. Let him go. Let him go. Let me see your license and I.D. Card.
Outputs:
	R.J. MacReady, helicopter pilot, U.S. Outpost Number 31.
	How will I get to L.A.?
	He's a U.S. Marshal.
	Let him go.
	Let him go.
	Let me see your license and I.D. Card.
	
Example with lower-cased inputs:
Input 0: r.j. macready, helicopter pilot, u.s. outpost number 31. how will i get to l.a.? he's a u.s. marshal. let him go. let him go. let me see your license and i.d. card.
Outputs:
	r.j. macready, helicopter pilot, u.s. outpost number 31.
	how will i get to l.a.?
	he's a u.s. marshal.
	let him go.
	let him go.
	let me see your license and i.d. card.
```

</details>


# Limitations and known issues
This a prototype model, and has some issues. These will be improved in a later version.

If you're interested in any particular aspect being improved, let me know for the next version.

## Limited vocabulary
This model has 25 languages and a tokenizer with only 32k tokens. 

Chinese has a lot of out-of-vocabulary tokens, which will manifest as the unknown surface appearing in the outputs of some Chinese texts.

This also results in longer-than-necessary sequences of short tokens, but that shouldn't be visible on the surface given that this is a very small, fast model.

## Noisy training data
This model was trained on `OpenSubtitles`, data which is notoriously noisy. The model may have learned some bad habits from this data.


## Language-specific expectations
As discussed in a previous section, each language should be formatted and punctuated per that languages rules.

E.g., Chinese text should contain full-width periods, not latin periods, and contain no space.

In practice, data often does not adhere to these rules, but the model has not been augmented to deal with this potential issue.

## Metrics
It's difficult to properly evaluate this model, since we rely on the proposition that the input data contains exactly one sentence per line. 
In reality, the data sets used thus far are noisy and often contain more than one sentence per line.

Metrics are not published for now, and evaluation is limited to manual spot-checking.

Sufficient test sets for this analytic are being looked into.