quote extraction & attribution on DirectQuote dataset with BERT-based token classification 💬

this repository stores the code to train and perform inference with a DistilBERT model using the DirectQuote corpus (Zhang, et al. 2021).

directquote-variedStyles 💬 is a fine-tuned distilbert-base-cased model that performs token classification on a modified version of the DirectQuote dataset. It achieves the following results on the evaluation set:

Loss: 0.2339
Precision: 0.7440
Recall: 0.9090
F1: 0.8182
Accuracy: 0.9355

Model description

directquote-variedStyles performs Quote Extraction and Attribution (QEA) on texts, enabling NLP applications to suitably process quotations in texts and corpora. Further implementations of QEA have been proposed in the realm of 'modular journalism' (See: 'Talking sense: using machine learning to understand quotes').

Intended uses & limitations

More information needed

Training and evaluation data

the DirectQuote dataset presented by Zhang, et al. (2021) represents a corpus of 19,760 paragraphs containing 10,279 direct quotations — this manually-annotated corpus is, as per the authors, "the largest and most complete corpus focusing on direct quotations in news texts" [1].

# DirectQuote Distribution of Data Sources
| Region      | Name                                | Numbers     |
|-------------|-------------------------------------|-------------|
| U.S.        | Associated Press                    | 438         |
|             | Cable News Network                  | 627         |
|             | American Broadcasting Company       | 240         |
|             | New York Times                      | 5,642       |
|             | CBS Broadcasting                    | 4,890       |
| UK          | British Broadcasting Corporation    | 926         |
|             | Reuters                             | 5,836       |
|             | The Guardian                        | 4,302       |
| Canada      | The Globe and Mail                  | 1,955       |
|             | The Star                            | 13,769      |
| New Zealand | NZ Herald                           | 115         |
| Australia   | Australian Broadcasting Corporation | 312         |
|             | Sydney Morning Herald               | 93          |

Quote extraction and attribution appears to be an underserved area of NLP, however, a small handful of systems exist that perform this task, namely Stanford's CoreNLP model bundle [2]. Quote Extraction and Attribution (QEA) solutions generally fall into one of two broad categories, (1.) rule-based systems that identify quotation marks and common verbiage associated with a quotation (See: Textacy QEA [3]), or (2.) probabilistic model-based approaches that typically rely on LTSMs and neural network architectures.

Existing solutions of both categories lack the comparative speed and accuracy of newer, transformer-based systems. With reference to CoreNLP, the system does not support GPU-optimised inference. Rule-based systems, such as Textacy, are significantly faster but sorely lacking in terms of precision (Textacy refused to process 28% of documents from a 1000-doc sample of the Whisp corpus) — this issue is compounded by the vast array of different quotation mark 'styles' available within Unicode, as below, there are well over a dozen differing quotation marks.

Modifications to the DirectQuote Corpus

As per the CoreNLP documentation on quote extraction and attribution (QEA), there exists a multitude of varying quotation styles (12+), any of which may appear in texts ingested by Whisp. For the reasons outlined in the introduction, it is necessary to adapt the DirectQuote dataset to represent a wider range of quotation styles.

Considers regular ascii (“”, ‘’, ``’’, and `’) as well as “smart” and international quotation marks as follows: “”,‘’, «», ‹›, 「」, 『』, „”, and ‚’.

From CoreNLP Docs ~ Pipeline > Quote Extraction And Attribution

I have included 11 quotation 'sets' to replace/populate pre-existing quotation marks in the DirectQuote dataset. These styles include both ASCII and Unicode quotation marks, including a small variety of international styles — largely confined to those used by French and German speaking populations in Europe. Chinese-style quotation marks have not been included due to the limited overlap in publishing between Mandarin and English content.

Training procedure

Token Labels The DirectQuote corpus provides the following 5 labels, following the IOB1 format:

LeftSpeaker — Quotation, the corresponding speaker is in the preceding text
RightSpeaker — Quotation, the corresponding speaker is in the following text
Unknown — Quotation, no corresponding speaker
Speaker — Speaker
Out — N/A

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 5

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.3392	1.0	976	0.2050	0.7040	0.8297	0.7617	0.9327
0.1915	2.0	1952	0.1996	0.7417	0.8990	0.8128	0.9337
0.1668	3.0	2928	0.2023	0.7373	0.9066	0.8132	0.9369
0.1447	4.0	3904	0.2125	0.7458	0.9107	0.8200	0.9367
0.1248	5.0	4880	0.2339	0.7440	0.9090	0.8182	0.9355

Framework versions

Transformers 4.25.1
Pytorch 1.10.2+cu113
Datasets 2.8.0
Tokenizers 0.13.2

References

[1] Zhang, Y., & Liu, Y. (2021, October 15). DirectQuote: A dataset for direct quotation extraction and attribution in news articles. arXiv.Org. https://arxiv.org/abs/2110.07827

[2] Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

[3] Chartbeat Labs, & DeWilde, B. (2016, February). Information Extraction. Textacy ~ NLP, before and after spaCy. https://textacy.readthedocs.io/en/latest/api_reference/extract.html#textacy.extract.triples.direct_quotations

Downloads last month: 7

Safetensors

Model size

65.2M params

Tensor type

F32

Space using whispAI/DirectQuote-SentLevel-DistilBERT 1

Paper for whispAI/DirectQuote-SentLevel-DistilBERT

DirectQuote: A Dataset for Direct Quotation Extraction and Attribution in News Articles

Paper • 2110.07827 • Published Oct 15, 2021