soumitsr
/

llama-v3p2-article-digestor-gguf

@@ -19,7 +19,9 @@ tags:
 ## Model Details
 **Base Model (and tokenizer)**: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
 **Context Window/Max Length**: 16384 tokens
 **Usage**: Instruction model fine-tuned for generating title, summary and extracting keywords from articles/blogs/posts in one shot. Ideal for backend volume processing of contents. I would NOT recommend it for chat.
 ### Input Prompt
 I used the following prompt to train it so if you want the output to be similar, use this prompt.
@@ -66,13 +68,15 @@ For an average of 1536 - 2048 input tokens it produces  roughly 200 tokens (high
 | Model                        | Quality and adherence rate                                                                                                                                        |
 | ---------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | Merged model or Lora adapter | High quality content generation but lower adherence rate compared to the lower precision quantized models. 7-8 out of 2500 inputs will produce non-JSON output    |
-| **Q8_0**                     | Same quality as the merged model. Better adherence rate to response format (1 out of 3000 inputs are non-JSON)                                                    |
-| **Q5_K_M**                   | High quality, recommended. Similar to Q4 model. No visible difference.                                                                                            |
 | Q4_K_M                       | High quality, recommended. Better adherence rate to response format (1 out of ~4000 inputs are non-JSON) but smaller summary (~100 words as opposed to 128 words) |
 | Q2_K                         | Straight up trash. Don't use it.                                                                                                                                  |
 ## Training Details
 **Dataset**: [soumitsr/article-digests](https://huggingface.co/datasets/soumitsr/article-digests/viewer/default/train?p=255&row=25536) . This is generated using real news articles, blogs, reddit posts and yc-hackernews posts feed into Chat GPT-4o-mini for response.
 Trained using Kaggle's free T4 GPU and unsloth. Here is the [Notebook](https://www.kaggle.com/code/soumitsalman/finetuning-llama-3-2-1b). On that note [Unsloth](https://unsloth.ai/) will change your life. To the creators of Unsloth: You are AWESOME! THANK YOU!
 ## Sample Code
 ### Prompt
 ```python
@@ -110,7 +114,7 @@ resp = tokenizer.decode(outputs[0], skip_special_tokens=True))
 response_json = json.loads(resp[resp.find('{'):resp.rfind('}')+1])
 ```
-Using Llama.CPP (No GPU)
 Download one of the ggufs to a local directory and use that as a model path
 ```python

 ## Model Details
 **Base Model (and tokenizer)**: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
 **Context Window/Max Length**: 16384 tokens
 **Usage**: Instruction model fine-tuned for generating title, summary and extracting keywords from articles/blogs/posts in one shot. Ideal for backend volume processing of contents. I would NOT recommend it for chat.
 ### Input Prompt
 I used the following prompt to train it so if you want the output to be similar, use this prompt.
 | Model                        | Quality and adherence rate                                                                                                                                        |
 | ---------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | Merged model or Lora adapter | High quality content generation but lower adherence rate compared to the lower precision quantized models. 7-8 out of 2500 inputs will produce non-JSON output    |
+| Q8_0                         | Same quality as the merged model. Better adherence rate to response format (1 out of 3000 inputs are non-JSON)                                                    |
+| Q5_K_M                       | High quality, recommended. Similar to Q4 model. No visible difference.                                                                                            |
 | Q4_K_M                       | High quality, recommended. Better adherence rate to response format (1 out of ~4000 inputs are non-JSON) but smaller summary (~100 words as opposed to 128 words) |
 | Q2_K                         | Straight up trash. Don't use it.                                                                                                                                  |
 ## Training Details
 **Dataset**: [soumitsr/article-digests](https://huggingface.co/datasets/soumitsr/article-digests/viewer/default/train?p=255&row=25536) . This is generated using real news articles, blogs, reddit posts and yc-hackernews posts feed into Chat GPT-4o-mini for response.
 Trained using Kaggle's free T4 GPU and unsloth. Here is the [Notebook](https://www.kaggle.com/code/soumitsalman/finetuning-llama-3-2-1b). On that note [Unsloth](https://unsloth.ai/) will change your life. To the creators of Unsloth: You are AWESOME! THANK YOU!
 ## Sample Code
 ### Prompt
 ```python
 response_json = json.loads(resp[resp.find('{'):resp.rfind('}')+1])
 ```
+### Using Llama.CPP (No GPU)
 Download one of the ggufs to a local directory and use that as a model path
 ```python