Update the code snippets slightly

2534760 12 months ago

7.28 kB

	---
	library_name: span-marker
	tags:
	- span-marker
	- token-classification
	- ner
	- named-entity-recognition
	- generated_from_span_marker_trainer
	datasets:
	- conll2003
	metrics:
	- precision
	- recall
	- f1
	widget:
	- text: Atlanta Games silver medal winner Edwards has called on other leading athletes
	to take part in the Sarajevo meeting--a goodwill gesture towards Bosnia as it
	recovers from the war in the Balkans--two days after the grand prix final in Milan.
	- text: Portsmouth:Middlesex 199 and 426 (J. Pooley 111,M. Ramprakash 108,M. Gatting
	83), Hampshire 232 and 109-5.
	- text: Poland's Foreign Minister Dariusz Rosati will visit Yugoslavia on September
	3 and 4 to revive a dialogue between the two governments which was effectively
	frozen in 1992,PAP news agency reported on Friday.
	- text: The authorities are apparently extremely afraid of any political and social
	discontent," said Xiao,in Manila to attend an Amnesty International conference
	on human rights in China.
	- text: American Nate Miller successfully defended his WBA cruiserweight title when
	he knocked out compatriot James Heath in the seventh round of their bout on Saturday.
	pipeline_tag: token-classification
	model-index:
	- name: SpanMarker
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: Unknown
	type: conll2003
	split: eval
	metrics:
	- type: f1
	value: 0.9550004205568171
	name: F1
	- type: precision
	value: 0.9542780299209951
	name: Precision
	- type: recall
	value: 0.9557239057239058
	name: Recall
	---

	# SpanMarker

	This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2003](https://huggingface.co/datasets/conll2003) dataset that can be used for Named Entity Recognition.

	## Model Details

	Important Note: I used the Tokenizer from "roberta-base".
	```diff
	from span_marker import SpanMarkerModel
	from span_marker.tokenizer import SpanMarkerTokenizer

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("lambdavi/span-marker-luke-base-conll2003")
	+tokenizer = SpanMarkerTokenizer.from_pretrained("roberta-base", config=model.tokenizer.config)
	+model.set_tokenizer(tokenizer)

	# Run inference
	entities = model.predict("Portsmouth:Middlesex 199 and 426 (J. Pooley 111,M. Ramprakash 108,M. Gatting 83), Hampshire 232 and 109-5.")
	```

	### Model Description
	- Model Type: SpanMarker
	<!-- - Encoder: [Unknown](https://huggingface.co/unknown) -->
	- Maximum Sequence Length: 512 tokens
	- Maximum Entity Length: 8 words
	- Training Dataset: [conll2003](https://huggingface.co/datasets/conll2003)
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->

	### Model Sources

	- Repository: [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
	- Thesis: [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)

	### Model Labels
	\| Label \| Examples \|
	\|:------\|:--------------------------------------------------------------\|
	\| LOC \| "Germany", "BRUSSELS", "Britain" \|
	\| MISC \| "German", "British", "EU-wide" \|
	\| ORG \| "European Commission", "EU", "European Union" \|
	\| PER \| "Werner Zwingmann", "Nikolaus van der Pas", "Peter Blackburn" \|

	## Uses

	### Direct Use for Inference

	```python
	from span_marker import SpanMarkerModel
	from span_marker.tokenizer import SpanMarkerTokenizer

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("lambdavi/span-marker-luke-base-conll2003")
	tokenizer = SpanMarkerTokenizer.from_pretrained("roberta-base", config=model.tokenizer.config)
	model.set_tokenizer(tokenizer)

	# Run inference
	entities = model.predict("Portsmouth:Middlesex 199 and 426 (J. Pooley 111,M. Ramprakash 108,M. Gatting 83), Hampshire 232 and 109-5.")
	```

	### Downstream Use
	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	```python
	from span_marker import SpanMarkerModel, Trainer

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("span_marker_model_id")

	# Specify a Dataset with "tokens" and "ner_tag" columns
	dataset = load_dataset("conll2003") # For example CoNLL2003

	# Initialize a Trainer using the pretrained model & dataset
	trainer = Trainer(
	model=model,
	train_dataset=dataset["train"],
	eval_dataset=dataset["validation"],
	)
	trainer.train()
	trainer.save_model("span_marker_model_id-finetuned")
	```
	</details>

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Set Metrics
	\| Training set \| Min \| Median \| Max \|
	\|:----------------------\|:----\|:--------\|:----\|
	\| Sentence length \| 1 \| 14.5019 \| 113 \|
	\| Entities per sentence \| 0 \| 1.6736 \| 20 \|

	### Training Hyperparameters
	- learning_rate: 1e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 16
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 5

	### Training Results
	\| Epoch \| Step \| Validation Loss \| Validation Precision \| Validation Recall \| Validation F1 \| Validation Accuracy \|
	\|:-----:\|:----:\|:---------------:\|:--------------------:\|:-----------------:\|:-------------:\|:-------------------:\|
	\| 1.0 \| 883 \| 0.0123 \| 0.9293 \| 0.9274 \| 0.9284 \| 0.9848 \|
	\| 2.0 \| 1766 \| 0.0089 \| 0.9412 \| 0.9456 \| 0.9434 \| 0.9882 \|
	\| 3.0 \| 2649 \| 0.0077 \| 0.9499 \| 0.9505 \| 0.9502 \| 0.9893 \|
	\| 4.0 \| 3532 \| 0.0070 \| 0.9527 \| 0.9537 \| 0.9532 \| 0.9900 \|
	\| 5.0 \| 4415 \| 0.0068 \| 0.9543 \| 0.9557 \| 0.9550 \| 0.9902 \|

	### Framework Versions
	- Python: 3.10.12
	- SpanMarker: 1.5.0
	- Transformers: 4.36.0
	- PyTorch: 2.0.0
	- Datasets: 2.16.1
	- Tokenizers: 0.15.0

	## Citation

	### BibTeX
	```
	@software{Aarsen_SpanMarker,
	author = {Aarsen, Tom},
	license = {Apache-2.0},
	title = {{SpanMarker for Named Entity Recognition}},
	url = {https://github.com/tomaarsen/SpanMarkerNER}
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->