quote extraction & attribution on DirectQuote dataset with BERT-based token classification π¬
this repository stores the code to train and perform inference with a DistilBERT model using the DirectQuote corpus (Zhang, et al. 2021).
directquote-variedStyles π¬ is a fine-tuned distilbert-base-cased model that performs token classification on a modified version of the DirectQuote dataset. It achieves the following results on the evaluation set:
- Loss: 0.2339
- Precision: 0.7440
- Recall: 0.9090
- F1: 0.8182
- Accuracy: 0.9355
Model description
directquote-variedStyles performs Quote Extraction and Attribution (QEA) on texts, enabling NLP applications to suitably process quotations in texts and corpora. Further implementations of QEA have been proposed in the realm of 'modular journalism' (See: 'Talking sense: using machine learning to understand quotes').
Intended uses & limitations
More information needed
Training and evaluation data
the DirectQuote dataset presented by Zhang, et al. (2021) represents a corpus of 19,760 paragraphs containing 10,279 direct quotations β this manually-annotated corpus is, as per the authors, "the largest and most complete corpus focusing on direct quotations in news texts" [1].
# DirectQuote Distribution of Data Sources
| Region | Name | Numbers |
|-------------|-------------------------------------|-------------|
| U.S. | Associated Press | 438 |
| | Cable News Network | 627 |
| | American Broadcasting Company | 240 |
| | New York Times | 5,642 |
| | CBS Broadcasting | 4,890 |
| UK | British Broadcasting Corporation | 926 |
| | Reuters | 5,836 |
| | The Guardian | 4,302 |
| Canada | The Globe and Mail | 1,955 |
| | The Star | 13,769 |
| New Zealand | NZ Herald | 115 |
| Australia | Australian Broadcasting Corporation | 312 |
| | Sydney Morning Herald | 93 |
Quote extraction and attribution appears to be an underserved area of NLP, however, a small handful of systems exist that perform this task, namely Stanford's CoreNLP model bundle [2]. Quote Extraction and Attribution (QEA) solutions generally fall into one of two broad categories, (1.) rule-based systems that identify quotation marks and common verbiage associated with a quotation (See: Textacy QEA [3]), or (2.) probabilistic model-based approaches that typically rely on LTSMs and neural network architectures.
Existing solutions of both categories lack the comparative speed and accuracy of newer, transformer-based systems. With reference to CoreNLP, the system does not support GPU-optimised inference. Rule-based systems, such as Textacy, are significantly faster but sorely lacking in terms of precision (Textacy refused to process 28% of documents from a 1000-doc sample of the Whisp corpus) β this issue is compounded by the vast array of different quotation mark 'styles' available within Unicode, as below, there are well over a dozen differing quotation marks.
Modifications to the DirectQuote Corpus
As per the CoreNLP documentation on quote extraction and attribution (QEA), there exists a multitude of varying quotation styles (12+), any of which may appear in texts ingested by Whisp. For the reasons outlined in the introduction, it is necessary to adapt the DirectQuote dataset to represent a wider range of quotation styles.
Considers regular ascii (ββ, ββ, ``ββ, and `β) as well as βsmartβ and international quotation marks as follows: ββ,ββ, «», βΉβΊ, γγ, γγ, ββ, and ββ.
From CoreNLP Docs ~ Pipeline > Quote Extraction And Attribution
I have included 11 quotation 'sets' to replace/populate pre-existing quotation marks in the DirectQuote dataset. These styles include both ASCII and Unicode quotation marks, including a small variety of international styles β largely confined to those used by French and German speaking populations in Europe. Chinese-style quotation marks have not been included due to the limited overlap in publishing between Mandarin and English content.
Training procedure
Token Labels The DirectQuote corpus provides the following 5 labels, following the IOB1 format:
- LeftSpeaker β Quotation, the corresponding speaker is in the preceding text
- RightSpeaker β Quotation, the corresponding speaker is in the following text
- Unknown β Quotation, no corresponding speaker
- Speaker β Speaker
- Out β N/A
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5
Training results
Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
---|---|---|---|---|---|---|---|
0.3392 | 1.0 | 976 | 0.2050 | 0.7040 | 0.8297 | 0.7617 | 0.9327 |
0.1915 | 2.0 | 1952 | 0.1996 | 0.7417 | 0.8990 | 0.8128 | 0.9337 |
0.1668 | 3.0 | 2928 | 0.2023 | 0.7373 | 0.9066 | 0.8132 | 0.9369 |
0.1447 | 4.0 | 3904 | 0.2125 | 0.7458 | 0.9107 | 0.8200 | 0.9367 |
0.1248 | 5.0 | 4880 | 0.2339 | 0.7440 | 0.9090 | 0.8182 | 0.9355 |
Framework versions
- Transformers 4.25.1
- Pytorch 1.10.2+cu113
- Datasets 2.8.0
- Tokenizers 0.13.2
References
[1] Zhang, Y., & Liu, Y. (2021, October 15). DirectQuote: A dataset for direct quotation extraction and attribution in news articles. arXiv.Org. https://arxiv.org/abs/2110.07827
[2] Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
[3] Chartbeat Labs, & DeWilde, B. (2016, February). Information Extraction. Textacy ~ NLP, before and after spaCy. https://textacy.readthedocs.io/en/latest/api_reference/extract.html#textacy.extract.triples.direct_quotations
- Downloads last month
- 42