language: en
tags:
- exbert
license: mit
datasets:
- botintel-community/AVAINT
base_model:
- botintel-community/abu-ai-001
pipeline_tag: text-generation
ABU-AI-001
Test the whole generation capabilities here: ABU-AI-001 Model Demo
Pretrained model on English language using a causal language modeling (CLM) objective. This model was created by BotIntel X and first released at this page.
Model description
ABU-AI-001 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on raw texts only, with no human labeling involved, using an automatic process to generate inputs and labels from those texts. More precisely, it was trained to predict the next word in sentences.
Inputs are sequences of continuous text, and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses a mask mechanism to ensure that predictions for token i
only use inputs from tokens 1
to i
.
This model is well-suited for generating coherent and contextually relevant texts based on a given prompt.
ABU-AI-001 is a lightweight yet efficient version, boasting 137M parameters, designed for fast inference and versatile usage.
Related Models: Coming soon.
Intended uses & limitations
You can use the raw model for text generation or fine-tune it for specific downstream tasks. Visit the model hub to explore fine-tuned versions of ABU-AI-001 for different tasks.
How to use
You can use this model directly with a pipeline for text generation. Since generation relies on randomness, setting a seed ensures reproducibility:
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='botintel-community/abu-ai-001')
>>> set_seed(42)
>>> generator("Hello, I'm ABU-AI-001,", max_length=30, num_return_sequences=3)
[{'generated_text': "Hello, I'm ABU-AI-001, a model designed to understand and generate coherent responses."},
{'generated_text': "Hello, I'm ABU-AI-001, your intelligent assistant for creative tasks."},
{'generated_text': "Hello, I'm ABU-AI-001, here to help you generate text and explore ideas."}]
Here is how to use this model to get features of a given text in PyTorch:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('botintel-community/abu-ai-001')
model = AutoModel.from_pretrained('botintel-community/abu-ai-001')
text = "Replace me with any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
and in TensorFlow:
from transformers import AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained('botintel-community/abu-ai-001')
model = TFAutoModel.from_pretrained('botintel-community/abu-ai-001')
text = "Replace me with any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
Limitations and bias
This model was trained on unfiltered internet data, which may include biased or non-neutral content. Use cases requiring unbiased, factual outputs should be handled with care. Always verify critical information generated by the model.
Training data
The training corpus consists of diverse English texts gathered from publicly available datasets. The dataset focuses on high-quality content while striving for a balance across different domains.
Training procedure
Preprocessing
The texts were tokenized using a byte-level version of Byte Pair Encoding (BPE) with a vocabulary size of 50,257 tokens.