license: wtfpl
inference: true
datasets:
- Onion_News
- wikitext-2-raw-v1-preprocessed
- wikipedia/wikimedia
language:
- en
metrics:
- f1
- accuracy
- precision
- perplexity
base_model:
- Wonder-Griffin/TraXL
library_name: transformers
pipeline_tag: text-generation
tags:
- text-generation-inference
- causal-LM
- question-answering
- conversational
TraXLMistral
Created by: Morgan Griffin & WongrifferousAI (Wonder-Griffin)
#Model Description
TraXLMistral is a custom language model based on the GPT-2 architecture with additional enhancements for various tasks including causal language modeling, sequence classification, and question answering. The model incorporates several advanced techniques such as sparse attention, memory-augmented neural networks (MANN), adaptive computation time (ACT), and latent space clustering, making it suitable for both reasoning and general-purpose text generation.
#Key Features:
Sparse Attention: Efficient attention mechanism inspired by Mistral, focusing computational resources on important elements in the sequence. Memory-Augmented Neural Networks (MANN): Enhances model capacity by adding external memory to better handle long-term dependencies and complex reasoning tasks. Adaptive Computation Time (ACT): Dynamically adjusts the number of computation steps based on the complexity of the input. Latent Space Clustering: Clusters latent representations for improved interpretability and task-specific adjustments. Logical Transformer Layer: Improves the model's reasoning capabilities by integrating logical transformations.
Intended Uses & Limitations
#Use Cases:
Text Generation: Generating coherent and contextually relevant text in a wide range of domains, including conversational agents, story generation, and creative writing. Question Answering: Providing accurate and concise answers to natural language questions. Sequence Classification: Classification of text into predefined categories such as sentiment analysis, document categorization, or other NLP tasks. Conversational AI: Suitable for applications requiring interactive and context-aware conversation.
#Limitations:
This model may require additional fine-tuning for domain-specific tasks where the input data differs significantly from the training data. Due to the use of sparse attention and memory modules, the model may require more resources (GPU memory) compared to simpler architectures.
Training Procedure
The model was trained using the Wikitext-raw-01 dataset (details needed) and fine-tuned for various tasks such as causal language modeling, question answering, and sequence classification. #Training Hyperparameters:
Learning Rate: 5e-05 Train Batch Size: 8 Eval Batch Size: 8 Optimizer: Adam (betas = (0.9, 0.999), epsilon = 1e-08) LR Scheduler: Linear Training Steps: 100,000 Seed: 42
#Training Environment:
Transformers version: 4.45.0.dev0 PyTorch version: 2.4.0+cu124 Datasets version: 2.20.0 Tokenizers version: 0.19.1 GPU: The model is trained using GPU acceleration, with checks for CUDA availability and multiple GPUs.
Model Architecture
##Configuration:
Model Type: Hybrid Transformer with GPT/Mistral/TransformerXL (Causal LM) Vocab Size: 50256 Hidden Size: 768 Number of Layers: 4 Number of Attention Heads: 4 Feedforward Expansion Factor: 4 RNN Units: 128 Max Sequence Length: 256 Dropout Rate: 0.1 Sparse Attention: Enabled Memory Size: 256 Max Computation Steps: 5 Dynamic Routing: Enabled
##Special Modules:
Sparse Attention Layer: Improves efficiency by reducing unnecessary attention computation. Adaptive Computation Time (ACT): Adjusts computation time based on input complexity. Memory-Augmented Neural Networks (MANN): Provides external memory to help with long-term dependencies. Latent Space Clustering: Clusters latent representations for improved task-specific behavior. Logical Transformer Layer: Improves reasoning and logic-based tasks.
##Supported Tasks:
Causal Language Modeling (causal_lm): Generates text sequences based on a given prompt. Question Answering (qa): Extracts relevant answers from a context given a question. Sequence Classification: Classifies input sequences into one of the predefined labels.
##Evaluation##
The model was evaluated on several NLP benchmarks, but detailed results are pending. The primary metrics used for evaluation include accuracy, F1-score, and precision. Evaluation Metrics:
Accuracy F1-score Precision
Intended Users
This model is designed for researchers, developers, and organizations looking to implement advanced NLP models in production. It can be used for building conversational agents, question-answering systems, text generation applications, and more. How to Use Inference Example """"
python
from transformers import BertTokenizerFast, TraXLMistral
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') model = TraXLMistral.from_pretrained('Wonder-Griffin/TraXLMistral')
input_text = "What is the capital of France?" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) print(outputs) """" Limitations and Future Work
Limited Training Data: Future iterations should focus on expanding the dataset and improving performance across different languages and domains. Memory Usage: Due to its complex architecture, this model might require optimizations for resource-constrained environments.
Acknowledgements
Created by Morgan Griffin and WongrifferousAI (Wonder-Griffin)