Upload 8 files

Browse files

Files changed (8) hide show

README.md +100 -0
config.json +35 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,100 @@

+---
+language:
+- en
+tags:
+- ner
+- ncbi
+- disease
+- pubmed
+- bioinfomatics
+license: apache-2.0
+datasets:
+- ncbi-disease
+- bc5cdr
+- tner/bc5cdr
+- jnlpba
+- bc2gm_corpus
+- drAbreu/bc4chemd_ner
+- linnaeus
+- ncbi_disease
+widget:
+- text: "Hepatocyte nuclear factor 4 alpha (HNF4α) is regulated by different promoters to generate two isoforms, one of which functions as a tumor suppressor. Here, the authors reveal that induction of the alternative isoform in hepatocellular carcinoma inhibits the circadian clock by repressing BMAL1, and the reintroduction of BMAL1 prevents HCC tumor growth."
+---
+# NER to find Gene & Gene products
+> The model was trained on ncbi-disease, BC5CDR dataset, pretrained on this [pubmed-pretrained roberta model](/raynardj/roberta-pubmed)
+All the labels, the possible token classes.
+```json
+{"label2id": {
+    "O": 0,
+    "Disease":1,
+  }
+ }
+```
+Notice, we removed the 'B-','I-' etc from data label.🗡
+## This is the template we suggest for using the model
+```python
+from transformers import pipeline
+PRETRAINED = "raynardj/ner-disease-ncbi-bionlp-bc5cdr-pubmed"
+ner = pipeline(task="ner",model=PRETRAINED, tokenizer=PRETRAINED)
+ner("Your text", aggregation_strategy="first")
+```
+And here is to make your output more consecutive ⭐️
+```python
+import pandas as pd
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
+def clean_output(outputs):
+    results = []
+    current = []
+    last_idx = 0
+    # make to sub group by position
+    for output in outputs:
+        if output["index"]-1==last_idx:
+            current.append(output)
+        else:
+            results.append(current)
+            current = [output, ]
+        last_idx = output["index"]
+    if len(current)>0:
+        results.append(current)
+    # from tokens to string
+    strings = []
+    for c in results:
+        tokens = []
+        starts = []
+        ends = []
+        for o in c:
+            tokens.append(o['word'])
+            starts.append(o['start'])
+            ends.append(o['end'])
+        new_str = tokenizer.convert_tokens_to_string(tokens)
+        if new_str!='':
+            strings.append(dict(
+                word=new_str,
+                start = min(starts),
+                end = max(ends),
+                entity = c[0]['entity']
+            ))
+    return strings
+def entity_table(pipeline, **pipeline_kw):
+    if "aggregation_strategy" not in pipeline_kw:
+        pipeline_kw["aggregation_strategy"] = "first"
+    def create_table(text):
+        return pd.DataFrame(
+            clean_output(
+                pipeline(text, **pipeline_kw)
+            )
+        )
+    return create_table
+# will return a dataframe
+entity_table(ner)(YOUR_VERY_CONTENTFUL_TEXT)
+```
+> check our NER model on
+* [gene and gene products](/raynardj/ner-gene-dna-rna-jnlpba-pubmed)
+* [chemical substance](/raynardj/ner-chemical-bionlp-bc5cdr-pubmed).
+* [disease](/raynardj/ner-disease-ncbi-bionlp-bc5cdr-pubmed)

config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "_name_or_path": "raynardj/roberta-pubmed",
+  "architectures": [
+    "RobertaForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "O",
+    "1": "Disease"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "Disease": 1,
+    "O": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.9.1",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50265
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e10d8bbbd5c112c44762c48d04ce312c964b98391fb044685a09a9f7da4b5cdb
+size 496313335

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>", "add_prefix_space": true, "errors": "replace", "sep_token": "</s>", "cls_token": "<s>", "pad_token": "<pad>", "mask_token": "<mask>", "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "raynardj/roberta-pubmed", "tokenizer_class": "RobertaTokenizer"}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff