model documentation

3684bd2 almost 2 years ago

5.83 kB

	---
	language:
	- en
	- de
	- fr
	- it
	- nl

	tags:
	- punctuation prediction
	- punctuation

	datasets: wmt/europarl
	license: mit
	widget:
	- text: "Ho sentito che ti sei laureata il che mi fa molto piacere"
	example_title: "Italian"
	- text: "Tous les matins vers quatre heures mon père ouvrait la porte de ma chambre"
	example_title: "French"
	- text: "Ist das eine Frage Frau Müller"
	example_title: "German"
	- text: "My name is Clara and I live in Berkeley California"
	example_title: "English"

	metrics:
	- f1
	---


	# Model Card for fullstop-punctuation-multilingual-base

	# Model Details

	## Model Description

	The goal of this task consists in training NLP models that can predict the end of sentence (EOS) and punctuation marks on automatically generated or transcribed texts.

	- Developed by: Oliver Guhr
	- Shared by [Optional]: Oliver Guhr
	- Model type: Token Classification
	- Language(s) (NLP): English, German, French, Italian, Dutch
	- License: MIT
	- Parent Model: xlm-roberta-base
	- Resources for more information:
	- [GitHub Repo](https://github.com/oliverguhr/fullstop-deep-punctuation-prediction)
	- [Associated Paper](https://www.researchgate.net/profile/Oliver-Guhr/publication/355038679_FullStop_Multilingual_Deep_Models_for_Punctuation_Prediction/links/615a0ce3a6fae644fbd08724/FullStop-Multilingual-Deep-Models-for-Punctuation-Prediction.pdf)



	# Uses


	## Direct Use
	This model can be used for the task of Token Classification

	## Downstream Use [Optional]

	More information needed.

	## Out-of-Scope Use

	The model should not be used to intentionally create hostile or alienating environments for people.

	# Bias, Risks, and Limitations


	Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.



	## Recommendations


	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

	# Training Details

	## Training Data

	The model authors note in the [associated paper](https://www.researchgate.net/profile/Oliver-Guhr/publication/355038679_FullStop_Multilingual_Deep_Models_for_Punctuation_Prediction/links/615a0ce3a6fae644fbd08724/FullStop-Multilingual-Deep-Models-for-Punctuation-Prediction.pdf):
	> The task consists in predicting EOS and punctua- tion marks on unpunctuated lowercased text. The organizers of the SeppNLG shared task provided 470 MB of English, German, French, and Italian text. This data set consists of a training and a de- velopment set.


	## Training Procedure


	### Preprocessing

	More information needed





	### Speeds, Sizes, Times
	More information needed


	# Evaluation


	## Testing Data, Factors & Metrics

	### Testing Data

	More information needed


	### Factors
	More information needed

	### Metrics

	More information needed


	## Results

	### Classification report over all languages
	```
	precision recall f1-score support

	0 0.99 0.99 0.99 47903344
	. 0.94 0.95 0.95 2798780
	, 0.85 0.84 0.85 3451618
	? 0.88 0.85 0.87 88876
	- 0.61 0.32 0.42 157863
	: 0.72 0.52 0.60 103789

	accuracy 0.98 54504270
	macro avg 0.83 0.75 0.78 54504270
	weighted avg 0.98 0.98 0.98 54504270
	```



	# Model Examination

	More information needed

	# Environmental Impact

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: More information needed
	- Hours used: More information needed
	- Cloud Provider: More information needed
	- Compute Region: More information needed
	- Carbon Emitted: More information needed

	# Technical Specifications [optional]

	## Model Architecture and Objective

	More information needed

	## Compute Infrastructure

	More information needed

	### Hardware


	More information needed

	### Software

	More information needed.

	# Citation


	BibTeX:


	```bibtex
	@article{guhr-EtAl:2021:fullstop,
	title={FullStop: Multilingual Deep Models for Punctuation Prediction},
	author = {Guhr, Oliver and Schumann, Anne-Kathrin and Bahrmann, Frank and Böhme, Hans Joachim},
	booktitle = {Proceedings of the Swiss Text Analytics Conference 2021},
	month = {June},
	year = {2021},
	address = {Winterthur, Switzerland},
	publisher = {CEUR Workshop Proceedings},
	url = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
	}
	```




	# Glossary [optional]
	More information needed

	# More Information [optional]
	More information needed


	# Model Card Authors [optional]

	Oliver Guhr in collaboration with Ezi Ozoani and the Hugging Face team


	# Model Card Contact

	More information needed

	# How to Get Started with the Model

	Use the code below to get started with the model.

	<details>
	<summary> Click to expand </summary>

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	tokenizer = AutoTokenizer.from_pretrained("oliverguhr/fullstop-punctuation-multilingual-base")

	model = AutoModelForTokenClassification.from_pretrained("oliverguhr/fullstop-punctuation-multilingual-base")
	```
	</details>