File size: 8,162 Bytes
7f46fa4
b03f95b
7f46fa4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b03f95b
7f46fa4
56eb3b8
b03f95b
 
 
36de3b1
56eb3b8
 
 
 
 
 
 
 
 
 
 
b03f95b
56eb3b8
 
 
 
 
 
 
 
 
7f46fa4
56eb3b8
 
 
 
 
 
 
 
 
 
b03f95b
56eb3b8
 
 
 
b03f95b
56eb3b8
 
 
 
 
 
 
 
 
 
36de3b1
 
56eb3b8
 
b03f95b
56eb3b8
 
36de3b1
56eb3b8
36de3b1
56eb3b8
b03f95b
56eb3b8
 
 
 
 
 
 
 
 
 
 
 
 
b03f95b
56eb3b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36de3b1
56eb3b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b03f95b
56eb3b8
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
base_model: meta-llama/Llama-3.2-1B-Instruct
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- gguf
---

# Uploaded  model

- **Developed by:** soumitsr
- **License:** apache-2.0
- **Finetuned from model :** meta-llama/Llama-3.2-1B-Instruct

## Model Details
- **Base Model (and tokenizer)**: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
- **Context Window/Max Length**: 16384 tokens
- **Usage**: Instruction model fine-tuned for generating title, summary and extracting keywords from articles/blogs/posts in one shot. Ideal for backend volume processing of contents. I would NOT recommend it for chat.

### Input Prompt
I used the following prompt to train it so if you want the output to be similar, use this prompt.
```python
prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
response_format:json_object
<|eot_id|><|start_header_id|>user<|end_header_id|>
TASK: create title, summary and tags (e.g. company, organization, person, catastrophic event, product, process, security vulnerability, stock ticker symbol, geographic location). title should be 10 - 20 words, summary should be 100 - 200 words and tags (entities) should a string of comma separated phrases.
INPUT:
{text}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
```

### Response Format
The output will be a JSON object without any additional text or delimiter
```json
{
	"title": "some 10 - 20 words title",
	"summary": "some 100 - 180 word summary",
	"tags": "comma separated list of named entities"
}
```

for example
```json
{ 
	"title": "The Future of Space Missions: How 3D Printing is Revolutionizing Astronaut Logistics", 
	"summary": "The 3D printing market is poised for significant growth, with an estimated value of US$95 billion by 2032, according to BCG. While it may never replace traditional manufacturing on Earth, its potential in space is transformative. Astronauts aboard the International Space Station (ISS) manage complex logistics, relying on substantial deliveries of spare parts—over 7,000 pounds annually—with additional supplies stored on Earth and the ISS itself. However, this model is unsustainable for future manned missions to Mars and the Moon, where astronauts will face isolation and the need for adaptability. 3D printing offers a viable solution, enabling the in-situ production of parts and tools as needed, thus facilitating a new era of space exploration where self-sufficiency becomes essential for survival and success.", 
	"tags": "3D printing, space exploration, International Space Station, manufacturing, Mars, Moon, logistics, astronauts, spare parts, BCG" 
}
```

The training dataset was designed to force it to produce JSON structure without any additional texts or delimiters. So langchain JSON parser will likely die because it looks for JSON format within a delimiter.

## Model Paths:
- Lora adapter  (for Llama-3.2-1B-Instruct): https://huggingface.co/soumitsr/llama-v3p2-article-digestor-lora
- Merged 16-bit model: https://huggingface.co/soumitsr/llama-v3p2-article-digestor
- GGUFs for llama.cpp: https://huggingface.co/soumitsr/llama-v3p2-article-digestor-gguf

### Performance:
For an average of 1536 - 2048 input tokens it produces  roughly 200 tokens (higher with lora adapter and lower using Q4_K_M)
- T4 using lora adapter in 4-bit: ~3.8 seconds
- T4 using merge 16-bit model: ~5.2 seconds
- A100 using lora adapter: <0.4 seconds 
- CPU (4 cores) using Q4_K_M: 38-40 seconds

| Model                        | Quality and adherence rate                                                                                                                                        |
| ---------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Merged model or Lora adapter | High quality content generation but lower adherence rate compared to the lower precision quantized models. 7-8 out of 2500 inputs will produce non-JSON output    |
| Q8_0                         | Same quality as the merged model. Better adherence rate to response format (1 out of 3000 inputs are non-JSON)                                                    |
| Q5_K_M                       | High quality, recommended. Similar to Q4 model. No visible difference.                                                                                            |
| Q4_K_M                       | High quality, recommended. Better adherence rate to response format (1 out of ~4000 inputs are non-JSON) but smaller summary (~100 words as opposed to 128 words) |
| Q2_K                         | Straight up trash. Don't use it.                                                                                                                                  |

## Training Details
**Dataset**: [soumitsr/article-digests](https://huggingface.co/datasets/soumitsr/article-digests/viewer/default/train?p=255&row=25536) . This is generated using real news articles, blogs, reddit posts and yc-hackernews posts feed into Chat GPT-4o-mini for response.

Trained using Kaggle's free T4 GPU and unsloth. Here is the [Notebook](https://www.kaggle.com/code/soumitsalman/finetuning-llama-3-2-1b). On that note [Unsloth](https://unsloth.ai/) will change your life. To the creators of Unsloth: You are AWESOME! THANK YOU!

## Sample Code

### Prompt
```python
# this was the prompt template the model was trained with
prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
response_format:json_object
<|eot_id|><|start_header_id|>user<|end_header_id|>
TASK: create title, summary and tags (e.g. company, organization, person, catastrophic event, product, process, security vulnerability, stock ticker symbol, geographic location). title should be 10 - 20 words, summary should be 100 - 200 words and tags (entities) should a string of comma separated phrases.
INPUT:
{text}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>""" 

input_text = "whatever article, blog, post or novela you want to digest" 
```

### Using Lora Adapter (Requires GPU)
```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "soumitsr/llama-v3p2-article-digestor-lora", 
    max_seq_length = 16384
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer(prompt_template.format(text=input_text), return_tensors="pt")
# feel free to play with the max_new_tokens and temperature
outputs = model.generate(
	 **inputs, 
	 max_new_tokens=512, 
	 temperature=0.1, 
	 stream=False
)
resp = tokenizer.decode(outputs[0], skip_special_tokens=True))

response_json = json.loads(resp[resp.find('{'):resp.rfind('}')+1])
```

### Using Llama.CPP (No GPU)

Download one of the ggufs to a local directory and use that as a model path
```python
from llama_cpp import Llama

model = Llama(model_path=model_file_apth, n_ctx=16384, n_threads=os.cpu_count(), embedding=False, verbose=False)  

resp = model.create_completion(
	prompt=prompt_template.format(text=text),
	max_tokens=384, 
	frequency_penalty=0.3, # feel free to play with these numbers
	temperature=0.2
)['choices'][0]['text']

response_json = json.loads(resp[resp.find('{'):resp.rfind('}')+1])
```

## Appendix - Purpose of this model
I wanted a token efficient and cheap way to get quality summary, title and named-entities. The initial aim was to parse through volumes of click-bait garbage articles and blogs. When it comes to simpler tasks that are related to processing of given text ChatGPT is incredibly good at adhering to given instruction and response format. Llama-3.2-1b is powerful base model but it is inconsistent with sticking to response format and when it does, it produces a super generic content e.g. title that doesn't mean anything and the summary that is a one-lines BS. So I wanted to create something that will give me ChatGPT quality and consistency for basic tasks like summary, title and tag generation et. voila.