|
--- |
|
license: llama3.1 |
|
base_model: |
|
- meta-llama/Meta-Llama-3.1-8B-Instruct |
|
tags: |
|
- Text Generation |
|
- llama3.1 |
|
- text-generation-inference |
|
- Inference Endpoints |
|
- Transformers |
|
- Fusion |
|
language: |
|
- en |
|
--- |
|
# Llama-3.1-8B-Fusion-9010 |
|
|
|
## Overview |
|
`Llama-3.1-8B-Fusion-9010` is a mixed model that combines the strengths of two powerful Llama-based models: [arcee-ai/Llama-3.1-SuperNova-Lite](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite) and [mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated](https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated). The weights are blended in a 9:1 ratio, with 90% of the weights from SuperNova-Lite and 10% from the abliterated Meta-Llama-3.1-8B-Instruct model. |
|
**Although it's a simple mix, the model is usable, and no gibberish has appeared**. |
|
This is an experiment. Later, I will test the 8:2, 7:3, 6:4, and 5:5 ratios separately to see how much impact they have on the model. |
|
|
|
## Model Details |
|
- **Base Models:** |
|
- [arcee-ai/Llama-3.1-SuperNova-Lite](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite) (90%) |
|
- [mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated](https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated) (10%) |
|
- **Model Size:** 8B parameters |
|
- **Architecture:** Llama 3.1 |
|
- **Mixing Ratio:** 9:1 (SuperNova-Lite:Meta-Llama-3.1-8B-Instruct-abliterated) |
|
|
|
## Key Features |
|
- **SuperNova-Lite Contributions (90%):** Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture. |
|
- **Meta-Llama-3.1-8B-Instruct-abliterated Contributions (10%):** This is an uncensored version of Llama 3.1 8B Instruct created with abliteration. |
|
|
|
## Usage |
|
You can use this mixed model in your applications by loading it with Hugging Face's `transformers` library: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer |
|
import time |
|
|
|
mixed_model_name = "huihui-ai/Llama-3.1-8B-Fusion-9010" |
|
|
|
# Check if CUDA is available |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
# Load model and tokenizer |
|
mixed_model = AutoModelForCausalLM.from_pretrained(mixed_model_name, device_map=device, torch_dtype=torch.bfloat16) |
|
tokenizer = AutoTokenizer.from_pretrained(mixed_model_name) |
|
|
|
# Ensure the tokenizer has pad_token_id set |
|
tokenizer.pad_token_id = tokenizer.eos_token_id |
|
|
|
# Input loop |
|
print("Start inputting text for inference (type 'exit' to quit)") |
|
while True: |
|
prompt = input("Enter your prompt: ") |
|
if prompt.lower() == "exit": |
|
print("Exiting inference loop.") |
|
break |
|
|
|
# Inference phase: Generate text using the modified model |
|
chat = [ |
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
{"role": "user", "content": prompt} |
|
] |
|
|
|
# Prepare input data |
|
input_ids = tokenizer.apply_chat_template( |
|
chat, tokenize=True, add_generation_prompt=True, return_tensors="pt" |
|
).to(device) |
|
|
|
# Use TextStreamer for streaming output |
|
streamer = TextStreamer(tokenizer, skip_special_tokens=True) |
|
|
|
# Record the start time |
|
start_time = time.time() |
|
|
|
# Generate text and stream output character by character |
|
outputs = mixed_model.generate( |
|
input_ids, |
|
max_new_tokens=8192, |
|
do_sample=True, |
|
temperature=0.6, |
|
top_p=0.9, |
|
streamer=streamer # Enable streaming output |
|
) |
|
|
|
# Record the end time |
|
end_time = time.time() |
|
|
|
# Calculate the number of generated tokens |
|
generated_tokens = outputs[0][input_ids.shape[-1]:].shape[0] |
|
|
|
# Calculate the total time taken |
|
total_time = end_time - start_time |
|
|
|
# Calculate tokens generated per second |
|
tokens_per_second = generated_tokens / total_time |
|
|
|
print(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.") |