aehrm's picture
Update README.md
f71ff82
|
raw
history blame
4.15 kB
metadata
language:
  - de

Scene Segmenter for the Shared Task on Scene Segmentation

This is the scene segmenter model that is being used in LLpro. On borders between sentences, it predicts one of the following labels:

  • B-Scene: the preceding sentence began a new Scene.
  • B-Nonscene: the preceding sentence began a new Non-Scene.
  • Scene: the preceding sentence belongs to a Scene, but does not begin a new one – i.e., the scene continues.
  • Nonscene: the preceding sentence belongs to a Noncene, but does not begin a new one – i.e., the non-scene continues.

Broadly speaking, the model is being used in a token classification setup. A sequence of multiple sentences is represented by interspersing the respective tokenizations with the special [SEP] token. On these [SEP] tokens, the linear classification layer predicts one of the four above classes.

The model is trained on the dataset corresponding to the KONVENS 2021 Shared Task on Scene Segmentation (Zehe et al., 2021) fine-tuning the domain-adapted lkonle/fiction-gbert-large. (Training code)

F1-Score:

  • 40.22 on Track 1 (in-domain dime novels)
  • 35.09 on Track 2 (out-of-domain high brow novels)

The respective test datasets are only available to the task organizers; the task organizers evaluated this model on their private test set and report above scores. See the KONVENS paper for a description of their metric.


Demo Usage:

import torch
from transformers import BertTokenizer, BertForTokenClassification

tokenizer = BertTokenizer.from_pretrained('aehrm/stss-scene-segmenter')
model = BertForTokenClassification.from_pretrained('aehrm/stss-scene-segmenter', sep_token_id=tokenizer.sep_token_id).eval()


sentences = ['Und so begann unser kleines Abenteuer auf Hoher See...', 'Es war früh am Morgen, als wir in See stechen wollten.', 'Das Wasser war still.']
inputs = tokenizer(' [SEP] '.join(sentences), return_tensors='pt')

# inference on the model
with torch.no_grad():
    logits = model(**inputs).logits

# concentrate on the logits corresponding to the [SEP] tokens
relevant_logits = logits[inputs.input_ids == tokenizer.sep_token_id]

predicted_ids = relevant_logits.argmax(axis=1).numpy()
predicted_labels = [ model.config.id2label[x] for x in predicted_ids ]

# print the associated prediction for each sentence / [CLS] token
for label, sent in zip(predicted_labels, sentences):
    print(label, sent)
# >>> Scene Und so begann unser kleines Abenteuer auf Hoher See...
# >>> Scene-B Es war früh am Morgen, als wir in See stechen wollten.   (This sentence begins a new scene.)
# >>> Scene Das Wasser war still.

# alternatively, decode the respective bridge type
prev = None
for label, sent in zip(predicted_labels, sentences):
    bridge = None
    if prev == 'Scene' and label == 'Scene-B':
        bridge = 'SCENE-TO-SCENE'
    elif prev == 'Scene' and label == 'Nonscene-B':
        bridge = 'SCENE-TO-NONSCENE'
    elif prev == 'Nonscene' and label == 'Scene-B':
        bridge = 'NONSCENE-TO-SCENE'
    else:
        bridge = 'NOBORDER'

    if prev is not None:
        print(bridge)
    print(sent)
    prev = label
# >>> Und so begann unser kleines Abenteuer auf Hoher See...
# >>> SCENE-TO-SCENE
# >>> Es war früh am Morgen, als wir in See stechen wollten.
# >>> NOBORDER
# >>> Das Wasser war still.

Cite:

Please cite the following paper when using this model.

@inproceedings{ehrmanntraut-et-al-llpro-2023,
    location = {Ingolstadt, Germany},
    title = {{LLpro}: A Literary Language Processing Pipeline for {German} Narrative Text},
    booktitle = {Proceedings of the 10th Conference on Natural Language Processing ({KONVENS} 2022)},
    publisher = {{KONVENS} 2023 Organizers},
    author = {Ehrmanntraut, Anton and Konle, Leonard and Jannidis, Fotis},
    date = {2023},
}