soumitsr
/

llama-v3p2-article-digestor-gguf

@@ -1,5 +1,5 @@
 ---
-base_model: unsloth/llama-3.2-1b-instruct-bnb-4bit
 language:
 - en
 license: apache-2.0
@@ -15,14 +15,13 @@ tags:
 - **Developed by:** soumitsr
 - **License:** apache-2.0
-- **Finetuned from model :** unsloth/llama-3.2-1b-instruct-bnb-4bit
 ## Model Details
-**Base Model (and tokenizer)**: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
-**Context Window/Max Length**: 16384 tokens
-**Usage**: Instruction model fine-tuned for generating title, summary and extracting keywords from articles/blogs/posts in one shot. Ideal for backend volume processing of contents. I would NOT recommend it for chat.
 ### Input Prompt
 I used the following prompt to train it so if you want the output to be similar, use this prompt.
 ```python
@@ -34,6 +33,7 @@ INPUT:
 {text}
 <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
 ```
 ### Response Format
 The output will be a JSON object without any additional text or delimiter
 ```json
@@ -54,10 +54,12 @@ for example
 ```
 The training dataset was designed to force it to produce JSON structure without any additional texts or delimiters. So langchain JSON parser will likely die because it looks for JSON format within a delimiter.
 ## Model Paths:
 - Lora adapter  (for Llama-3.2-1B-Instruct): https://huggingface.co/soumitsr/llama-v3p2-article-digestor-lora
 - Merged 16-bit model: https://huggingface.co/soumitsr/llama-v3p2-article-digestor
 - GGUFs for llama.cpp: https://huggingface.co/soumitsr/llama-v3p2-article-digestor-gguf
 ### Performance:
 For an average of 1536 - 2048 input tokens it produces  roughly 200 tokens (higher with lora adapter and lower using Q4_K_M)
 - T4 using lora adapter in 4-bit: ~3.8 seconds
@@ -72,12 +74,14 @@ For an average of 1536 - 2048 input tokens it produces  roughly 200 tokens (high
 | Q5_K_M                       | High quality, recommended. Similar to Q4 model. No visible difference.                                                                                            |
 | Q4_K_M                       | High quality, recommended. Better adherence rate to response format (1 out of ~4000 inputs are non-JSON) but smaller summary (~100 words as opposed to 128 words) |
 | Q2_K                         | Straight up trash. Don't use it.                                                                                                                                  |
 ## Training Details
 **Dataset**: [soumitsr/article-digests](https://huggingface.co/datasets/soumitsr/article-digests/viewer/default/train?p=255&row=25536) . This is generated using real news articles, blogs, reddit posts and yc-hackernews posts feed into Chat GPT-4o-mini for response.
 Trained using Kaggle's free T4 GPU and unsloth. Here is the [Notebook](https://www.kaggle.com/code/soumitsalman/finetuning-llama-3-2-1b). On that note [Unsloth](https://unsloth.ai/) will change your life. To the creators of Unsloth: You are AWESOME! THANK YOU!
 ## Sample Code
 ### Prompt
 ```python
 # this was the prompt template the model was trained with
@@ -91,6 +95,7 @@ INPUT:
 input_text = "whatever article, blog, post or novela you want to digest"
 ```
 ### Using Lora Adapter (Requires GPU)
 ```python
 from unsloth import FastLanguageModel
@@ -131,5 +136,6 @@ resp = model.create_completion(
 response_json = json.loads(resp[resp.find('{'):resp.rfind('}')+1])
 ```
 ## Appendix - Purpose of this model
 I wanted a token efficient and cheap way to get quality summary, title and named-entities. The initial aim was to parse through volumes of click-bait garbage articles and blogs. When it comes to simpler tasks that are related to processing of given text ChatGPT is incredibly good at adhering to given instruction and response format. Llama-3.2-1b is powerful base model but it is inconsistent with sticking to response format and when it does, it produces a super generic content e.g. title that doesn't mean anything and the summary that is a one-lines BS. So I wanted to create something that will give me ChatGPT quality and consistency for basic tasks like summary, title and tag generation et. voila.

 ---
+base_model: meta-llama/Llama-3.2-1B-Instruct
 language:
 - en
 license: apache-2.0
 - **Developed by:** soumitsr
 - **License:** apache-2.0
+- **Finetuned from model :** meta-llama/Llama-3.2-1B-Instruct
 ## Model Details
+- **Base Model (and tokenizer)**: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
+- **Context Window/Max Length**: 16384 tokens
+- **Usage**: Instruction model fine-tuned for generating title, summary and extracting keywords from articles/blogs/posts in one shot. Ideal for backend volume processing of contents. I would NOT recommend it for chat.
 ### Input Prompt
 I used the following prompt to train it so if you want the output to be similar, use this prompt.
 ```python
 {text}
 <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
 ```
 ### Response Format
 The output will be a JSON object without any additional text or delimiter
 ```json
 ```
 The training dataset was designed to force it to produce JSON structure without any additional texts or delimiters. So langchain JSON parser will likely die because it looks for JSON format within a delimiter.
 ## Model Paths:
 - Lora adapter  (for Llama-3.2-1B-Instruct): https://huggingface.co/soumitsr/llama-v3p2-article-digestor-lora
 - Merged 16-bit model: https://huggingface.co/soumitsr/llama-v3p2-article-digestor
 - GGUFs for llama.cpp: https://huggingface.co/soumitsr/llama-v3p2-article-digestor-gguf
 ### Performance:
 For an average of 1536 - 2048 input tokens it produces  roughly 200 tokens (higher with lora adapter and lower using Q4_K_M)
 - T4 using lora adapter in 4-bit: ~3.8 seconds
 | Q5_K_M                       | High quality, recommended. Similar to Q4 model. No visible difference.                                                                                            |
 | Q4_K_M                       | High quality, recommended. Better adherence rate to response format (1 out of ~4000 inputs are non-JSON) but smaller summary (~100 words as opposed to 128 words) |
 | Q2_K                         | Straight up trash. Don't use it.                                                                                                                                  |
 ## Training Details
 **Dataset**: [soumitsr/article-digests](https://huggingface.co/datasets/soumitsr/article-digests/viewer/default/train?p=255&row=25536) . This is generated using real news articles, blogs, reddit posts and yc-hackernews posts feed into Chat GPT-4o-mini for response.
 Trained using Kaggle's free T4 GPU and unsloth. Here is the [Notebook](https://www.kaggle.com/code/soumitsalman/finetuning-llama-3-2-1b). On that note [Unsloth](https://unsloth.ai/) will change your life. To the creators of Unsloth: You are AWESOME! THANK YOU!
 ## Sample Code
 ### Prompt
 ```python
 # this was the prompt template the model was trained with
 input_text = "whatever article, blog, post or novela you want to digest"
 ```
 ### Using Lora Adapter (Requires GPU)
 ```python
 from unsloth import FastLanguageModel
 response_json = json.loads(resp[resp.find('{'):resp.rfind('}')+1])
 ```
 ## Appendix - Purpose of this model
 I wanted a token efficient and cheap way to get quality summary, title and named-entities. The initial aim was to parse through volumes of click-bait garbage articles and blogs. When it comes to simpler tasks that are related to processing of given text ChatGPT is incredibly good at adhering to given instruction and response format. Llama-3.2-1b is powerful base model but it is inconsistent with sticking to response format and when it does, it produces a super generic content e.g. title that doesn't mean anything and the summary that is a one-lines BS. So I wanted to create something that will give me ChatGPT quality and consistency for basic tasks like summary, title and tag generation et. voila.