Update broken URLs

73ed673 about 1 year ago

7.31 kB

	---
	language: en
	license: other
	library_name: span-marker
	tags:
	- span-marker
	- token-classification
	- ner
	- named-entity-recognition
	- generated_from_span_marker_trainer
	datasets:
	- tner/bionlp2004
	metrics:
	- precision
	- recall
	- f1
	widget:
	- text: Coexpression of HMG I/Y and Oct-2 in cell lines lacking Oct-2 results in high
	levels of HLA-DRA gene expression, and in vitro DNA-binding studies reveal that
	HMG I/Y stimulates Oct-2A binding to the HLA-DRA promoter.
	- text: In erythroid cells most of the transcription activity was contained in a 150
	bp promoter fragment with binding sites for transcription factors AP2, Sp1 and
	the erythroid-specific GATA-1.
	- text: 'Synergy between signal transduction pathways is obligatory for expression
	of c-fos in B and T cell lines: implication for c-fos control via surface immunoglobulin
	and T cell antigen receptors.'
	- text: CIITA mRNA is normally inducible by IFN-gamma in class II non-inducible,
	RB-defective lines, and in one line, re-expression of RB has no effect on CIITA
	mRNA induction levels.
	- text: As we reported previously, MNDA mRNA level in adherent monocytes is elevated
	by IFN-alpha; in this study, we further assessed MNDA expression in in vitro
	monocyte-derived macrophages.
	pipeline_tag: token-classification
	co2_eq_emissions:
	emissions: 45.104
	source: codecarbon
	training_type: fine-tuning
	on_cloud: false
	gpu_model: 1 x NVIDIA GeForce RTX 3090
	cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K
	ram_total_size: 31.777088165283203
	hours_used: 0.296
	model-index:
	- name: SpanMarker with bert-base-uncased on BioNLP2004
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: BioNLP2004
	type: tner/bionlp2004
	split: test
	metrics:
	- type: f1
	value: 0.7620637836032726
	name: F1
	- type: precision
	value: 0.7289958470876371
	name: Precision
	- type: recall
	value: 0.7982742537313433
	name: Recall
	---

	# SpanMarker with bert-base-uncased on BioNLP2004

	This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [BioNLP2004](https://huggingface.co/datasets/tner/bionlp2004) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-uncased](https://huggingface.co/bert-base-uncased) as the underlying encoder. See [train.py](train.py) for the training script.

	## Model Details

	### Model Description

	- Model Type: SpanMarker
	- Encoder: [bert-base-uncased](https://huggingface.co/bert-base-uncased)
	- Maximum Sequence Length: 256 tokens
	- Maximum Entity Length: 8 words
	- Training Dataset: [BioNLP2004](https://huggingface.co/datasets/tner/bionlp2004)
	- Language: en
	- License: other

	### Model Sources

	- Repository: [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
	- Thesis: [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)

	### Model Labels
	\| Label \| Examples \|
	\|:----------\|:-------------------------------------------------------------------------------------------------\|
	\| DNA \| "immunoglobulin heavy-chain enhancer", "enhancer", "immunoglobulin heavy-chain ( IgH ) enhancer" \|
	\| RNA \| "GATA-1 mRNA", "c-myb mRNA", "antisense myb RNA" \|
	\| cell_line \| "monocytic U937 cells", "TNF-treated HUVECs", "HUVECs" \|
	\| cell_type \| "B cells", "non-B cells", "human red blood cells" \|
	\| protein \| "ICAM-1", "VCAM-1", "NADPH oxidase" \|

	## Evaluation

	### Metrics
	\| Label \| Precision \| Recall \| F1 \|
	\|:----------\|:----------\|:-------\|:-------\|
	\| all \| 0.7290 \| 0.7983 \| 0.7621 \|
	\| DNA \| 0.7174 \| 0.7505 \| 0.7336 \|
	\| RNA \| 0.6977 \| 0.7692 \| 0.7317 \|
	\| cell_line \| 0.5831 \| 0.7020 \| 0.6370 \|
	\| cell_type \| 0.8222 \| 0.7381 \| 0.7779 \|
	\| protein \| 0.7196 \| 0.8407 \| 0.7755 \|

	## Uses

	### Direct Use

	```python
	from span_marker import SpanMarkerModel

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-bionlp")
	# Run inference
	entities = model.predict("In erythroid cells most of the transcription activity was contained in a 150 bp promoter fragment with binding sites for transcription factors AP2, Sp1 and the erythroid-specific GATA-1.")
	```

	### Downstream Use
	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	```python
	from span_marker import SpanMarkerModel, Trainer

	# Download from the 🤗 Hub
	model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-bionlp")

	# Specify a Dataset with "tokens" and "ner_tag" columns
	dataset = load_dataset("conll2003") # For example CoNLL2003

	# Initialize a Trainer using the pretrained model & dataset
	trainer = Trainer(
	model=model,
	train_dataset=dataset["train"],
	eval_dataset=dataset["validation"],
	)
	trainer.train()
	trainer.save_model("tomaarsen/span-marker-bert-base-uncased-bionlp-finetuned")
	```
	</details>

	## Training Details

	### Training Set Metrics
	\| Training set \| Min \| Median \| Max \|
	\|:----------------------\|:----\|:--------\|:----\|
	\| Sentence length \| 2 \| 26.5790 \| 166 \|
	\| Entities per sentence \| 0 \| 2.7528 \| 23 \|

	### Training Hyperparameters
	- learning_rate: 5e-05
	- train_batch_size: 32
	- eval_batch_size: 32
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 3

	### Training Results
	\| Epoch \| Step \| Validation Loss \| Validation Precision \| Validation Recall \| Validation F1 \| Validation Accuracy \|
	\|:------:\|:----:\|:---------------:\|:--------------------:\|:-----------------:\|:-------------:\|:-------------------:\|
	\| 0.4505 \| 300 \| 0.0210 \| 0.7497 \| 0.7659 \| 0.7577 \| 0.9254 \|
	\| 0.9009 \| 600 \| 0.0162 \| 0.8048 \| 0.8217 \| 0.8131 \| 0.9432 \|
	\| 1.3514 \| 900 \| 0.0154 \| 0.8126 \| 0.8249 \| 0.8187 \| 0.9434 \|
	\| 1.8018 \| 1200 \| 0.0149 \| 0.8148 \| 0.8451 \| 0.8296 \| 0.9481 \|
	\| 2.2523 \| 1500 \| 0.0150 \| 0.8297 \| 0.8438 \| 0.8367 \| 0.9501 \|
	\| 2.7027 \| 1800 \| 0.0145 \| 0.8280 \| 0.8443 \| 0.8361 \| 0.9501 \|

	### Environmental Impact
	Carbon emissions were measured using [CodeCarbon](https://github.com/mlco2/codecarbon).
	- Carbon Emitted: 0.045 kg of CO2
	- Hours Used: 0.296 hours

	### Training Hardware
	- On Cloud: No
	- GPU Model: 1 x NVIDIA GeForce RTX 3090
	- CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
	- RAM Size: 31.78 GB

	### Framework Versions

	- Python: 3.9.16
	- SpanMarker: 1.3.1.dev
	- Transformers : 4.29.2
	- PyTorch: 2.0.1+cu118
	- Datasets: 2.14.3
	- Tokenizers: 0.13.2