|
--- |
|
language: |
|
- de |
|
--- |
|
# Scene Segmenter for the Shared Task on Scene Segmentation |
|
|
|
This is the scene segmenter model that is being used in [LLpro](https://github.com/cophi-wue/LLpro). On borders between sentences, it predicts one of the following labels: |
|
- `B-Scene`: the preceding sentence began a new *Scene*. |
|
- `B-Nonscene`: the preceding sentence began a new *Non-Scene*. |
|
- `Scene`: the preceding sentence belongs to a *Scene*, but does not begin a new one – i.e., the scene continues. |
|
- `Nonscene`: the preceding sentence belongs to a *Noncene*, but does not begin a new one – i.e., the non-scene continues. |
|
|
|
Broadly speaking, the model is being used in a token classification setup. A sequence of multiple sentences is represented by interspersing the respective tokenizations with the special `[SEP]` token. |
|
On these `[SEP]` tokens, the linear classification layer predicts one of the four above classes. |
|
|
|
The model is trained on the dataset corresponding to the [KONVENS 2021 Shared Task on Scene Segmentation](http://lsx-events.informatik.uni-wuerzburg.de/stss-2021/task.html) ([Zehe et al., 2021](http://ceur-ws.org/Vol-3001/#paper1)) fine-tuning the domain-adapted [lkonle/fiction-gbert-large](https://huggingface.co/lkonle/fiction-gbert-large). ([Training code](https://github.com/cophi-wue/LLpro/blob/main/contrib/train_scene_segmenter.py)) |
|
|
|
F1-Score: |
|
- **40.22** on Track 1 (in-domain dime novels) |
|
- **35.09** on Track 2 (out-of-domain high brow novels) |
|
|
|
The respective test datasets are only available to the task organizers; the task organizers evaluated this model on their private test set and report above scores. See the [KONVENS paper](http://ceur-ws.org/Vol-3001/#paper1) for a description of their metric. |
|
|
|
--- |
|
|
|
**Demo Usage**: |
|
|
|
```python |
|
import torch |
|
from transformers import BertTokenizer, BertForTokenClassification |
|
|
|
tokenizer = BertTokenizer.from_pretrained('aehrm/stss-scene-segmenter') |
|
model = BertForTokenClassification.from_pretrained('aehrm/stss-scene-segmenter', sep_token_id=tokenizer.sep_token_id).eval() |
|
|
|
|
|
sentences = ['Und so begann unser kleines Abenteuer auf Hoher See...', 'Es war früh am Morgen, als wir in See stechen wollten.', 'Das Wasser war still.'] |
|
inputs = tokenizer(' [SEP] '.join(sentences), return_tensors='pt') |
|
|
|
# inference on the model |
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
|
|
# concentrate on the logits corresponding to the [SEP] tokens |
|
relevant_logits = logits[inputs.input_ids == tokenizer.sep_token_id] |
|
|
|
predicted_ids = relevant_logits.argmax(axis=1).numpy() |
|
predicted_labels = [ model.config.id2label[x] for x in predicted_ids ] |
|
|
|
# print the associated prediction for each sentence / [CLS] token |
|
for label, sent in zip(predicted_labels, sentences): |
|
print(label, sent) |
|
# >>> Scene Und so begann unser kleines Abenteuer auf Hoher See... |
|
# >>> Scene-B Es war früh am Morgen, als wir in See stechen wollten. (This sentence begins a new scene.) |
|
# >>> Scene Das Wasser war still. |
|
|
|
# alternatively, decode the respective bridge type |
|
prev = None |
|
for label, sent in zip(predicted_labels, sentences): |
|
bridge = None |
|
if prev == 'Scene' and label == 'Scene-B': |
|
bridge = 'SCENE-TO-SCENE' |
|
elif prev == 'Scene' and label == 'Nonscene-B': |
|
bridge = 'SCENE-TO-NONSCENE' |
|
elif prev == 'Nonscene' and label == 'Scene-B': |
|
bridge = 'NONSCENE-TO-SCENE' |
|
else: |
|
bridge = 'NOBORDER' |
|
|
|
if prev is not None: |
|
print(bridge) |
|
print(sent) |
|
prev = label |
|
# >>> Und so begann unser kleines Abenteuer auf Hoher See... |
|
# >>> SCENE-TO-SCENE |
|
# >>> Es war früh am Morgen, als wir in See stechen wollten. |
|
# >>> NOBORDER |
|
# >>> Das Wasser war still. |
|
``` |
|
|
|
|
|
**Cite**: |
|
|
|
Please cite the following paper when using this model. |
|
|
|
``` |
|
@inproceedings{ehrmanntraut-et-al-llpro-2023, |
|
location = {Ingolstadt, Germany}, |
|
title = {{LLpro}: A Literary Language Processing Pipeline for {German} Narrative Text}, |
|
booktitle = {Proceedings of the 10th Conference on Natural Language Processing ({KONVENS} 2022)}, |
|
publisher = {{KONVENS} 2023 Organizers}, |
|
author = {Ehrmanntraut, Anton and Konle, Leonard and Jannidis, Fotis}, |
|
date = {2023}, |
|
} |
|
``` |