Push model using huggingface_hub.
Browse files- README.md +9 -145
- config.json +0 -0
- model.safetensors +3 -0
README.md
CHANGED
@@ -1,145 +1,9 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
- **The Cancer Genome Atlas, [TCGA](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga)** offers sequencing data of small RNAs and is used to evaluate TransfoRNAs classification performance
|
11 |
-
- Sequences are annotated based on a knowledge-based annotation approach that provides annotations for ~2k different sub-classes belonging to 11 major classes.
|
12 |
-
- Knowledge-based annotations are divided into three sets of varying confidence levels: a **high-confidence (HICO)** set, a **low-confidence (LOCO)** set, and a **non-annotated (NA)** set for sequences that could not be annotated at all. Only HICO annotations are used for training.
|
13 |
-
- HICO RNAs cover ~2k sub-classes and constitute 19.6% of all RNAs found in TCGA. LOCO and NA sets comprise 66.9% and 13.6% of RNAs, respectively.
|
14 |
-
- HICO RNAs are further divided into **in-distribution, ID** (374 sub-classes) and **out-of-distribution, OOD** (1549 sub-classes) sets.
|
15 |
-
- Criteria for ID and OOD: Sub-class containing more than 8 sequences are considered ID, otherwise OOD.
|
16 |
-
- An additional **putative 5' adapter affixes set** contains 294 sequences known to be technical artefacts. The 5’-end perfectly matches the last five or more nucleotides of the 5’-adapter sequence, commonly used in small RNA sequencing.
|
17 |
-
- The knowledge-based annotation (KBA) pipline including installation guide is located under `kba_pipline`
|
18 |
-
|
19 |
-
## Models
|
20 |
-
There are 5 classifier models currently available, each with different input representation.
|
21 |
-
- Baseline:
|
22 |
-
- Input: (single input) Sequence
|
23 |
-
- Model: An embedding layer that converts sequences into vectors followed by a classification feed forward layer.
|
24 |
-
- Seq:
|
25 |
-
- Input: (single input) Sequence
|
26 |
-
- Model: A transformer based encoder model.
|
27 |
-
- Seq-Seq:
|
28 |
-
- Input: (dual inputs) Sequence divided into even and odd tokens.
|
29 |
-
- Model: A transformer encoder is placed for odd tokens and another for even tokens.
|
30 |
-
- Seq-Struct:
|
31 |
-
- Input: (dual inputs) Sequence + Secondary structure
|
32 |
-
- Model: A transformer encoder for the sequence and another for the secondary structure.
|
33 |
-
- Seq-Rev (best performant):
|
34 |
-
- Input: (dual inputs) Sequence
|
35 |
-
- Model: A transformer encoder for the sequence and another for the sequence reversed.
|
36 |
-
|
37 |
-
|
38 |
-
*Note: These (Transformer) based models show overlapping and distinct capabilities. Consequently, an ensemble model is created to leverage those capabilities.*
|
39 |
-
|
40 |
-
|
41 |
-
<img width="948" alt="Screenshot 2023-08-16 at 16 39 20" src="https://github.com/gitHBDX/TransfoRNA-Framework/assets/82571392/d7d092d8-8cbd-492a-9ccc-994ffdd5aa5f">
|
42 |
-
|
43 |
-
## Data Availability
|
44 |
-
Downloading the data and the models can be done from [here](https://www.dropbox.com/sh/y7u8cofmg41qs0y/AADvj5lw91bx7fcDxghMbMtsa?dl=0).
|
45 |
-
|
46 |
-
This will download three subfolders that should be kept on the same folder level as `src`:
|
47 |
-
- `data`: Contains three files:
|
48 |
-
- `TCGA` anndata with ~75k sequences and `var` columns containing the knowledge based annotations.
|
49 |
-
- `HBDXBase.csv` containing a list of RNA precursors which are then used for data augmentation.
|
50 |
-
- `subclass_to_annotation.json` holds mappings for every sub-class to major-class.
|
51 |
-
|
52 |
-
- `models`:
|
53 |
-
- `benchmark` : contains benchmark models trained on sncRNA and premiRNA data. (See additional datasets at the bottom)
|
54 |
-
- `tcga`: All models trained on the TCGA data; `TransfoRNA_ID` (for testing and validation) and `TransfoRNA_FULL` (the production version) containing higher RNA major and sub-class coverage. Each of the two folders contain all the models trained seperately on major-class and sub-class.
|
55 |
-
- `kba_pipeline`: contains mapping reference data required to run the knowledge based pipeline manually
|
56 |
-
## Repo Structure
|
57 |
-
- configs: Contains the configurations of each model, training and inference settings.
|
58 |
-
|
59 |
-
The `conf/main_config.yaml` file offers options to change the task, the training settings and the logging. The following shows all the options and permitted values for each option.
|
60 |
-
|
61 |
-
<img width="835" alt="Screenshot 2024-05-22 at 10 19 15" src="https://github.com/gitHBDX/TransfoRNA/assets/82571392/225d2c98-ed45-4ca7-9e86-557a73af702d">
|
62 |
-
|
63 |
-
- transforna contains two folders:
|
64 |
-
- `src` folder which contains transforna package. View transforna's architecture [here](https://github.com/gitHBDX/TransfoRNA/blob/master/transforna/src/readme.md).
|
65 |
-
- `bin` folder contains all scripts necessary for reproducing manuscript figures.
|
66 |
-
|
67 |
-
## Installation
|
68 |
-
|
69 |
-
The `install.sh` is a script that creates an transforna environment in which all the required packages for TransfoRNA are installed. Simply navigate to the root directory and run from terminal:
|
70 |
-
|
71 |
-
```
|
72 |
-
#make install script executable
|
73 |
-
chmod +x install.sh
|
74 |
-
|
75 |
-
|
76 |
-
#run script
|
77 |
-
./install.sh
|
78 |
-
```
|
79 |
-
|
80 |
-
## TransfoRNA API
|
81 |
-
In `transforna/src/inference/inference_api.py`, all the functionalities of transforna are offered as APIs. There are two functions of interest:
|
82 |
-
- `predict_transforna` : Computes for a set of sequences and for a given model, one of various options; the embeddings, logits, explanatory (similar) sequences, attentions masks or umap coordinates.
|
83 |
-
- `predict_transforna_all_models`: Same as `predict_transforna` but computes the desired option for all the models as well as aggregates the output of the ensemble model.
|
84 |
-
Both return a pandas dataframe containing the sequence along with the desired computation.
|
85 |
-
|
86 |
-
Check the script at `src/test_inference_api.py` for a basic demo on how to call the either of the APIs.
|
87 |
-
|
88 |
-
## Inference from terminal
|
89 |
-
For inference, two paths in `configs/inference_settings/default.yaml` have to be edited:
|
90 |
-
- `sequences_path`: The full path to a csv file containing the sequences for which annotations are to be inferred.
|
91 |
-
- `model_path`: The full path of the model. (currently this points to the Seq model)
|
92 |
-
|
93 |
-
Also in the `main_config.yaml`, make sure to edit the `model_name` to match the input expected by the loaded model.
|
94 |
-
- `model_name`: add the name of the model. One of `"seq"`,`"seq-seq"`,`"seq-struct"`,`"baseline"` or `"seq-rev"` (see above)
|
95 |
-
|
96 |
-
|
97 |
-
Then, navigate the repositories' root directory and run the following command:
|
98 |
-
|
99 |
-
```
|
100 |
-
python transforna/__main__.py inference=True
|
101 |
-
```
|
102 |
-
|
103 |
-
After inference, an `inference_output` folder will be created under `outputs/` which will include two files.
|
104 |
-
- `(model_name)_embedds.csv`: contains vector embedding per sequence in the inference set- (could be used for downstream tasks).
|
105 |
-
*Note: The embedds of each sequence will only be logged if `log_embedds` in the `main_config` is `True`.*
|
106 |
-
- `(model_name)_inference_results.csv`: Contains columns; Net-Label containing predicted label and Is Familiar? boolean column containing the models' novelty predictor output. (True: familiar/ False: Novel)
|
107 |
-
*Note: The output will also contain the logits of the model is `log_logits` in the `main_config` is `True`.*
|
108 |
-
|
109 |
-
|
110 |
-
## Train on custom data
|
111 |
-
TransfoRNA can be trained using input data as Anndata, csv or fasta. If the input is anndata, then `anndata.var` should contains all the sequences. Some changes has to be made (follow `configs/train_model_configs/tcga`):
|
112 |
-
|
113 |
-
In `configs/train_model_configs/custom`:
|
114 |
-
- `dataset_path_train` has to point to the input_data which should contain; a `sequence` column, a `small_RNA_class_annotation` coliumn indicating the major class if available (otherwise should be NaN), `five_prime_adapter_filter` specifies whether the sequence is considered a real sequence or an artifact (`True `for Real and `False` for artifact), a `subclass_name` column containing the sub-class name if available (otherwise should be NaN), and a boolean column `hico` indicating whether a sequence is high confidence or not.
|
115 |
-
- If sampling from the precursor is required in order to augment the sub-classes, the `precursor_file_path` should include precursors. Follow the scheme of the HBDxBase.csv and have a look at `PrecursorAugmenter` class in `transforna/src/processing/augmentation.py`
|
116 |
-
- `mapping_dict_path` should contain the mapping from sub class to major class. i.e: 'miR-141-5p' to 'miRNA'.
|
117 |
-
- `clf_target` sets the classification target of the mopdel and should be either `sub_class_hico` for training on targets in `subclass_name` or `major_class_hico` for training on targets in `small_RNA_class_annotation`. For both, only high confidence sequences are selected for training (based on `hico` column).
|
118 |
-
|
119 |
-
In configs/main_config, some changes should be made:
|
120 |
-
- change `task` to `custom` or to whatever name the `custom.py` has been renamed.
|
121 |
-
- set the `model_name` as desired.
|
122 |
-
|
123 |
-
For training TransfoRNA from the root directory:
|
124 |
-
```
|
125 |
-
python transforna/__main__.py
|
126 |
-
```
|
127 |
-
Using [Hydra](https://hydra.cc/), any option in the main config can be changed. For instance, to train a `Seq-Struct` TransfoRNA model without using a validation split:
|
128 |
-
```
|
129 |
-
python transforna/__main__.py train_split=False model_name='seq-struct'
|
130 |
-
```
|
131 |
-
After training, an output folder is automatically created in the root directory where training is logged.
|
132 |
-
The structure of the output folder is chosen by hydra to be `/day/time/results folders`. Results folders are a set of folders created during training:
|
133 |
-
- `ckpt`: (containing the latest checkpoint of the model)
|
134 |
-
- `embedds`:
|
135 |
-
- Contains a file per each split (train/valid/test/ood/na).
|
136 |
-
- Each file is a `csv` containing the sequences plus their embeddings (obtained by the model and represent numeric representation of a given RNA sequence) as well as the logits. The logits are values the models produce for each sequence, reflecting its confidence of a sequence belonging to a certain class.
|
137 |
-
- `meta`: A folder containing a `yaml` file with all the hyperparameters used for the current run.
|
138 |
-
- `analysis`: contains the learned novelty threshold seperating the in-distribution set(Familiar) from the out of distribution set (Novel).
|
139 |
-
- `figures`: some figures are saved containing the Normalized Levenstein Distance NLD, distribution per split.
|
140 |
-
|
141 |
-
|
142 |
-
## Additional Datasets (Objective):
|
143 |
-
- sncRNA, collected from [RFam](https://rfam.org/) (classification of RNA precursors into 13 classes)
|
144 |
-
- premiRNA [human miRNAs](http://www.mirbase.org)(classification of true vs pseudo precursors)
|
145 |
-
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- pytorch_model_hub_mixin
|
4 |
+
- model_hub_mixin
|
5 |
+
---
|
6 |
+
|
7 |
+
This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
|
8 |
+
- Library: [More Information Needed]
|
9 |
+
- Docs: [More Information Needed]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
config.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:edbf105ab477ca595e2a03ca92eaa23b93488720cd4743cb8d4fac183899238c
|
3 |
+
size 7089068
|