soumitsr commited on
Commit
56eb3b8
·
verified ·
1 Parent(s): e354f1d

Updating with usage instruction and details

Browse files
Files changed (1) hide show
  1. README.md +111 -2
README.md CHANGED
@@ -17,6 +17,115 @@ tags:
17
  - **License:** apache-2.0
18
  - **Finetuned from model :** unsloth/llama-3.2-1b-instruct-bnb-4bit
19
 
20
- This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  - **License:** apache-2.0
18
  - **Finetuned from model :** unsloth/llama-3.2-1b-instruct-bnb-4bit
19
 
20
+ ## Model Details
21
+ **Base Model (and tokenizer)**: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
22
+ **Context Window/Max Length**: 16384 tokens
23
+ **Usage**: Instruction model fine-tuned for generating title, summary and extracting keywords from articles/blogs/posts in one shot. Ideal for backend volume processing of contents. I would NOT recommend it for chat.
24
+ ### Input Prompt
25
+ I used the following prompt to train it so if you want the output to be similar, use this prompt.
26
+ ```python
27
+ prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
28
+ response_format:json_object
29
+ <|eot_id|><|start_header_id|>user<|end_header_id|>
30
+ TASK: create title, summary and tags (e.g. company, organization, person, catastrophic event, product, process, security vulnerability, stock ticker symbol, geographic location). title should be 10 - 20 words, summary should be 100 - 200 words and tags (entities) should a string of comma separated phrases.
31
+ INPUT:
32
+ {text}
33
+ <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
34
+ ```
35
+ ### Response Format
36
+ The output will be a JSON object without any additional text or delimiter
37
+ ```json
38
+ {
39
+ "title": "some 10 - 20 words title",
40
+ "summary": "some 100 - 180 word summary",
41
+ "tags": "comma separated list of named entities"
42
+ }
43
+ ```
44
 
45
+ for example
46
+ ```json
47
+ {
48
+ "title": "The Future of Space Missions: How 3D Printing is Revolutionizing Astronaut Logistics",
49
+ "summary": "The 3D printing market is poised for significant growth, with an estimated value of US$95 billion by 2032, according to BCG. While it may never replace traditional manufacturing on Earth, its potential in space is transformative. Astronauts aboard the International Space Station (ISS) manage complex logistics, relying on substantial deliveries of spare parts—over 7,000 pounds annually—with additional supplies stored on Earth and the ISS itself. However, this model is unsustainable for future manned missions to Mars and the Moon, where astronauts will face isolation and the need for adaptability. 3D printing offers a viable solution, enabling the in-situ production of parts and tools as needed, thus facilitating a new era of space exploration where self-sufficiency becomes essential for survival and success.",
50
+ "tags": "3D printing, space exploration, International Space Station, manufacturing, Mars, Moon, logistics, astronauts, spare parts, BCG"
51
+ }
52
+ ```
53
+
54
+ The training dataset was designed to force it to produce JSON structure without any additional texts or delimiters. So langchain JSON parser will likely die because it looks for JSON format within a delimiter.
55
+ ## Model Paths:
56
+ - Lora adapter (for Llama-3.2-1B-Instruct): https://huggingface.co/soumitsr/llama-v3p2-article-digestor-lora
57
+ - Merged 16-bit model: https://huggingface.co/soumitsr/llama-v3p2-article-digestor
58
+ - GGUFs for llama.cpp: https://huggingface.co/soumitsr/llama-v3p2-article-digestor-gguf
59
+ ### Performance:
60
+ For an average of 1536 - 2048 input tokens it produces roughly 200 tokens (higher with lora adapter and lower using Q4_K_M)
61
+ - T4 using lora adapter in 4-bit: ~3.8 seconds
62
+ - T4 using merge 16-bit model: ~5.2 seconds
63
+ - A100 using lora adapter: <0.4 seconds
64
+ - CPU (4 cores) using Q4_K_M: 38-40 seconds
65
+
66
+ | Model | Quality and adherence rate |
67
+ | ---------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
68
+ | Merged model or Lora adapter | High quality content generation but lower adherence rate compared to the lower precision quantized models. 7-8 out of 2500 inputs will produce non-JSON output |
69
+ | **Q8_0** | Same quality as the merged model. Better adherence rate to response format (1 out of 3000 inputs are non-JSON) |
70
+ | **Q5_K_M** | High quality, recommended. Similar to Q4 model. No visible difference. |
71
+ | Q4_K_M | High quality, recommended. Better adherence rate to response format (1 out of ~4000 inputs are non-JSON) but smaller summary (~100 words as opposed to 128 words) |
72
+ | Q2_K | Straight up trash. Don't use it. |
73
+ ## Training Details
74
+ **Dataset**: [soumitsr/article-digests](https://huggingface.co/datasets/soumitsr/article-digests/viewer/default/train?p=255&row=25536) . This is generated using real news articles, blogs, reddit posts and yc-hackernews posts feed into Chat GPT-4o-mini for response.
75
+ Trained using Kaggle's free T4 GPU and unsloth. Here is the [Notebook](https://www.kaggle.com/code/soumitsalman/finetuning-llama-3-2-1b). On that note [Unsloth](https://unsloth.ai/) will change your life. To the creators of Unsloth: You are AWESOME! THANK YOU!
76
+ ## Sample Code
77
+ ### Prompt
78
+ ```python
79
+ # this was the prompt template the model was trained with
80
+ prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
81
+ response_format:json_object
82
+ <|eot_id|><|start_header_id|>user<|end_header_id|>
83
+ TASK: create title, summary and tags (e.g. company, organization, person, catastrophic event, product, process, security vulnerability, stock ticker symbol, geographic location). title should be 10 - 20 words, summary should be 100 - 200 words and tags (entities) should a string of comma separated phrases.
84
+ INPUT:
85
+ {text}
86
+ <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
87
+
88
+ input_text = "whatever article, blog, post or novela you want to digest"
89
+ ```
90
+ ### Using Lora Adapter (Requires GPU)
91
+ ```python
92
+ from unsloth import FastLanguageModel
93
+
94
+ model, tokenizer = FastLanguageModel.from_pretrained(
95
+ model_name = "soumitsr/llama-v3p2-article-digestor-lora",
96
+ max_seq_length = 16384
97
+ )
98
+ FastLanguageModel.for_inference(model) # Enable native 2x faster inference
99
+
100
+ inputs = tokenizer(prompt_template.format(text=input_text), return_tensors="pt")
101
+ # feel free to play with the max_new_tokens and temperature
102
+ outputs = model.generate(
103
+ **inputs,
104
+ max_new_tokens=512,
105
+ temperature=0.1,
106
+ stream=False
107
+ )
108
+ resp = tokenizer.decode(outputs[0], skip_special_tokens=True))
109
+
110
+ response_json = json.loads(resp[resp.find('{'):resp.rfind('}')+1])
111
+ ```
112
+
113
+ Using Llama.CPP (No GPU)
114
+
115
+ Download one of the ggufs to a local directory and use that as a model path
116
+ ```python
117
+ from llama_cpp import Llama
118
+
119
+ model = Llama(model_path=model_file_apth, n_ctx=16384, n_threads=os.cpu_count(), embedding=False, verbose=False)
120
+
121
+ resp = model.create_completion(
122
+ prompt=prompt_template.format(text=text),
123
+ max_tokens=384,
124
+ frequency_penalty=0.3, # feel free to play with these numbers
125
+ temperature=0.2
126
+ )['choices'][0]['text']
127
+
128
+ response_json = json.loads(resp[resp.find('{'):resp.rfind('}')+1])
129
+ ```
130
+ ## Appendix - Purpose of this model
131
+ I wanted a token efficient and cheap way to get quality summary, title and named-entities. The initial aim was to parse through volumes of click-bait garbage articles and blogs. When it comes to simpler tasks that are related to processing of given text ChatGPT is incredibly good at adhering to given instruction and response format. Llama-3.2-1b is powerful base model but it is inconsistent with sticking to response format and when it does, it produces a super generic content e.g. title that doesn't mean anything and the summary that is a one-lines BS. So I wanted to create something that will give me ChatGPT quality and consistency for basic tasks like summary, title and tag generation et. voila.