ellisdoro
/

agro-all-MiniLM-L6-v2_cross_attention_gcn_h512_o64_cosine_e1024_early-on2vec-koji-early

ellisdoro commited on Sep 19

Commit

206dca2

verified ·

1 Parent(s): 63071e9

Upload agro_all-MiniLM-L6-v2_cross_attention_gcn_h512_o64_cosine_e1024_early model created with on2vec

Browse files

Files changed (15) hide show

.gitattributes +1 -0
1_TokenOntologyFusionModule/config.json +8 -0
1_TokenOntologyFusionModule/ontology_data.json +3 -0
1_TokenOntologyFusionModule/pytorch_model.bin +3 -0
2_Pooling/config.json +10 -0
README.md +122 -0
config.json +25 -0
config_sentence_transformers.json +14 -0
model.safetensors +3 -0
modules.json +20 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +65 -0
vocab.txt +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+1_TokenOntologyFusionModule/ontology_data.json filter=lfs diff=lfs merge=lfs -text

1_TokenOntologyFusionModule/config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "base_model_dim": 384,
+  "top_k_concepts": 5,
+  "fusion_method": "cross_attention",
+  "concept_weight": 0.2,
+  "relevance_threshold": 0.3,
+  "max_concepts_per_batch": 100
+}

1_TokenOntologyFusionModule/ontology_data.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c2fe699e5b734dd7c4a1e776fc8bdb9507e80fb6cf6254a550f2c8f36b6d9283
+size 34497895

1_TokenOntologyFusionModule/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a3169aa4697ed1e620c4b2c7c223179ec13c16c88fa24f99c26db8fd56ebb07a
+size 3553160

2_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "word_embedding_dimension": 384,
+    "pooling_mode_cls_token": false,
+    "pooling_mode_mean_tokens": true,
+    "pooling_mode_max_tokens": false,
+    "pooling_mode_mean_sqrt_len_tokens": false,
+    "pooling_mode_weightedmean_tokens": false,
+    "pooling_mode_lasttoken": false,
+    "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,122 @@

+---
+base_model: all-MiniLM-L6-v2
+library_name: sentence-transformers
+license: apache-2.0
+pipeline_tag: sentence-similarity
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- ontology
+- on2vec
+- graph-neural-networks
+- base-all-MiniLM-L6-v2
+- general
+- general-ontology
+- fusion-cross_attention
+- gnn-gcn
+- medium-ontology
+---
+# agro_all-MiniLM-L6-v2_cross_attention_gcn_h512_o64_cosine_e1024_early
+This is a sentence-transformers model created with [on2vec](https://github.com/david4096/on2vec), which augments text embeddings with ontological knowledge using Graph Neural Networks.
+## Model Details
+- **Base Text Model**: all-MiniLM-L6-v2
+  - Text Embedding Dimension: 384
+- **Ontology**: agro.owl
+- **Domain**: general
+- **Ontology Concepts**: 4,162
+- **Concept Alignment**: 4,162/4,162 (100.0%)
+- **Fusion Method**: cross_attention
+- **GNN Architecture**: GCN
+- **Structural Embedding Dimension**: 4162
+- **Output Embedding Dimension**: 64
+- **Hidden Dimensions**: 512
+- **Dropout**: 0.0
+- **Training Date**: 2025-09-19
+- **on2vec Version**: 0.1.0
+- **Source Ontology Size**: 7.2 MB
+- **Model Size**: 123.8 MB
+- **Library**: on2vec + sentence-transformers
+## Technical Architecture
+This model uses a multi-stage architecture:
+1. **Text Encoding**: Input text is encoded using the base sentence-transformer model
+2. **Ontological Embedding**: Pre-trained GNN embeddings capture structural relationships
+3. **Fusion Layer**: Simple concatenation of text and ontological embeddings
+**Embedding Flow:**
+- Text: 384 dimensions → 512 hidden → 64 output
+- Structure: 4162 concepts → GNN → 64 output
+- Fusion: cross_attention → Final embedding
+## How It Works
+This model combines:
+1. **Text Embeddings**: Generated using the base sentence-transformer model
+2. **Ontological Embeddings**: Created by training Graph Neural Networks on OWL ontology structure
+3. **Fusion Layer**: Combines both embedding types using the specified fusion method
+The ontological knowledge helps the model better understand domain-specific relationships and concepts.
+## Usage
+```python
+from sentence_transformers import SentenceTransformer
+# Load the model
+model = SentenceTransformer('agro_all-MiniLM-L6-v2_cross_attention_gcn_h512_o64_cosine_e1024_early')
+# Generate embeddings
+sentences = ['Example sentence 1', 'Example sentence 2']
+embeddings = model.encode(sentences)
+# Compute similarity
+from sentence_transformers.util import cos_sim
+similarity = cos_sim(embeddings[0], embeddings[1])
+```
+## Training Process
+This model was created using the on2vec pipeline:
+1. **Ontology Processing**: The OWL ontology was converted to a graph structure
+2. **GNN Training**: Graph Neural Networks were trained to learn ontological relationships
+3. **Text Integration**: Base model text embeddings were combined with ontological embeddings
+4. **Fusion Training**: The fusion layer was trained to optimally combine both embedding types
+## Intended Use
+This model is particularly effective for:
+- General domain text processing
+- Tasks requiring understanding of domain-specific relationships
+- Semantic similarity in specialized domains
+- Classification tasks with domain knowledge requirements
+## Limitations
+- Performance may vary on domains different from the training ontology
+- Ontological knowledge is limited to concepts present in the source OWL file
+- May have higher computational requirements than vanilla text models
+## Citation
+If you use this model, please cite the on2vec framework:
+```bibtex
+@software{on2vec,
+  title={on2vec: Ontology Embeddings with Graph Neural Networks},
+  author={David Steinberg},
+  url={https://github.com/david4096/on2vec},
+  year={2024}
+}
+```
+---
+Created with [on2vec](https://github.com/david4096/on2vec) 🧬→🤖

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
+  "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.56.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "model_type": "SentenceTransformer",
+  "__version__": {
+    "sentence_transformers": "5.1.0",
+    "transformers": "4.56.1",
+    "pytorch": "2.6.0"
+  },
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1377e9af0ca0b016a9f2aa584d6fc71ab3ea6804fae21ef9fb1416e2944057ac
+size 90864192

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_TokenOntologyFusionModule",
+    "type": "on2vec.sentence_transformer_hub.TokenOntologyFusionModule"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 256,
+    "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 128,
+  "model_max_length": 256,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff