File size: 4,147 Bytes
d749f81
 
 
 
1bbf0c3
 
d749f81
 
 
 
 
1bbf0c3
d749f81
 
 
f71ff82
d749f81
 
 
 
 
 
 
 
 
 
 
 
f71ff82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d749f81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
language:
- de
---
# Scene Segmenter for the Shared Task on Scene Segmentation

This is the scene segmenter model that is being used in [LLpro](https://github.com/cophi-wue/LLpro). On borders between sentences, it predicts one of the following labels:
- `B-Scene`: the preceding sentence began a new *Scene*.
- `B-Nonscene`: the preceding sentence began a new *Non-Scene*.
- `Scene`: the preceding sentence belongs to a *Scene*, but does not begin a new one – i.e., the scene continues.
- `Nonscene`: the preceding sentence belongs to a *Noncene*, but does not begin a new one – i.e., the non-scene continues.

Broadly speaking, the model is being used in a token classification setup. A sequence of multiple sentences is represented by interspersing the respective tokenizations with the special `[SEP]` token.
On these `[SEP]` tokens, the linear classification layer predicts one of the four above classes.

The model is trained on the dataset corresponding to the [KONVENS 2021 Shared Task on Scene Segmentation](http://lsx-events.informatik.uni-wuerzburg.de/stss-2021/task.html) ([Zehe et al., 2021](http://ceur-ws.org/Vol-3001/#paper1)) fine-tuning the domain-adapted [lkonle/fiction-gbert-large](https://huggingface.co/lkonle/fiction-gbert-large). ([Training code](https://github.com/cophi-wue/LLpro/blob/main/contrib/train_scene_segmenter.py))

F1-Score: 
- **40.22** on Track 1 (in-domain dime novels)
- **35.09** on Track 2 (out-of-domain high brow novels)

The respective test datasets are only available to the task organizers; the task organizers evaluated this model on their private test set and report above scores. See the [KONVENS paper](http://ceur-ws.org/Vol-3001/#paper1) for a description of their metric. 

---

**Demo Usage**:

```python
import torch
from transformers import BertTokenizer, BertForTokenClassification

tokenizer = BertTokenizer.from_pretrained('aehrm/stss-scene-segmenter')
model = BertForTokenClassification.from_pretrained('aehrm/stss-scene-segmenter', sep_token_id=tokenizer.sep_token_id).eval()


sentences = ['Und so begann unser kleines Abenteuer auf Hoher See...', 'Es war früh am Morgen, als wir in See stechen wollten.', 'Das Wasser war still.']
inputs = tokenizer(' [SEP] '.join(sentences), return_tensors='pt')

# inference on the model
with torch.no_grad():
    logits = model(**inputs).logits

# concentrate on the logits corresponding to the [SEP] tokens
relevant_logits = logits[inputs.input_ids == tokenizer.sep_token_id]

predicted_ids = relevant_logits.argmax(axis=1).numpy()
predicted_labels = [ model.config.id2label[x] for x in predicted_ids ]

# print the associated prediction for each sentence / [CLS] token
for label, sent in zip(predicted_labels, sentences):
    print(label, sent)
# >>> Scene Und so begann unser kleines Abenteuer auf Hoher See...
# >>> Scene-B Es war früh am Morgen, als wir in See stechen wollten.   (This sentence begins a new scene.)
# >>> Scene Das Wasser war still.

# alternatively, decode the respective bridge type
prev = None
for label, sent in zip(predicted_labels, sentences):
    bridge = None
    if prev == 'Scene' and label == 'Scene-B':
        bridge = 'SCENE-TO-SCENE'
    elif prev == 'Scene' and label == 'Nonscene-B':
        bridge = 'SCENE-TO-NONSCENE'
    elif prev == 'Nonscene' and label == 'Scene-B':
        bridge = 'NONSCENE-TO-SCENE'
    else:
        bridge = 'NOBORDER'

    if prev is not None:
        print(bridge)
    print(sent)
    prev = label
# >>> Und so begann unser kleines Abenteuer auf Hoher See...
# >>> SCENE-TO-SCENE
# >>> Es war früh am Morgen, als wir in See stechen wollten.
# >>> NOBORDER
# >>> Das Wasser war still.
```


**Cite**:

Please cite the following paper when using this model.

```
@inproceedings{ehrmanntraut-et-al-llpro-2023,
	location = {Ingolstadt, Germany},
	title = {{LLpro}: A Literary Language Processing Pipeline for {German} Narrative Text},
	booktitle = {Proceedings of the 10th Conference on Natural Language Processing ({KONVENS} 2022)},
	publisher = {{KONVENS} 2023 Organizers},
	author = {Ehrmanntraut, Anton and Konle, Leonard and Jannidis, Fotis},
	date = {2023},
}
```