soumitsr's picture
minor syntax tweak
b03f95b verified
metadata
base_model: meta-llama/Llama-3.2-1B-Instruct
language:
  - en
license: apache-2.0
tags:
  - text-generation-inference
  - transformers
  - unsloth
  - llama
  - gguf

Uploaded model

  • Developed by: soumitsr
  • License: apache-2.0
  • Finetuned from model : meta-llama/Llama-3.2-1B-Instruct

Model Details

  • Base Model (and tokenizer): meta-llama/Llama-3.2-1B-Instruct
  • Context Window/Max Length: 16384 tokens
  • Usage: Instruction model fine-tuned for generating title, summary and extracting keywords from articles/blogs/posts in one shot. Ideal for backend volume processing of contents. I would NOT recommend it for chat.

Input Prompt

I used the following prompt to train it so if you want the output to be similar, use this prompt.

prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
response_format:json_object
<|eot_id|><|start_header_id|>user<|end_header_id|>
TASK: create title, summary and tags (e.g. company, organization, person, catastrophic event, product, process, security vulnerability, stock ticker symbol, geographic location). title should be 10 - 20 words, summary should be 100 - 200 words and tags (entities) should a string of comma separated phrases.
INPUT:
{text}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

Response Format

The output will be a JSON object without any additional text or delimiter

{
    "title": "some 10 - 20 words title",
    "summary": "some 100 - 180 word summary",
    "tags": "comma separated list of named entities"
}

for example

{ 
    "title": "The Future of Space Missions: How 3D Printing is Revolutionizing Astronaut Logistics", 
    "summary": "The 3D printing market is poised for significant growth, with an estimated value of US$95 billion by 2032, according to BCG. While it may never replace traditional manufacturing on Earth, its potential in space is transformative. Astronauts aboard the International Space Station (ISS) manage complex logistics, relying on substantial deliveries of spare parts—over 7,000 pounds annually—with additional supplies stored on Earth and the ISS itself. However, this model is unsustainable for future manned missions to Mars and the Moon, where astronauts will face isolation and the need for adaptability. 3D printing offers a viable solution, enabling the in-situ production of parts and tools as needed, thus facilitating a new era of space exploration where self-sufficiency becomes essential for survival and success.", 
    "tags": "3D printing, space exploration, International Space Station, manufacturing, Mars, Moon, logistics, astronauts, spare parts, BCG" 
}

The training dataset was designed to force it to produce JSON structure without any additional texts or delimiters. So langchain JSON parser will likely die because it looks for JSON format within a delimiter.

Model Paths:

Performance:

For an average of 1536 - 2048 input tokens it produces roughly 200 tokens (higher with lora adapter and lower using Q4_K_M)

  • T4 using lora adapter in 4-bit: ~3.8 seconds
  • T4 using merge 16-bit model: ~5.2 seconds
  • A100 using lora adapter: <0.4 seconds
  • CPU (4 cores) using Q4_K_M: 38-40 seconds
Model Quality and adherence rate
Merged model or Lora adapter High quality content generation but lower adherence rate compared to the lower precision quantized models. 7-8 out of 2500 inputs will produce non-JSON output
Q8_0 Same quality as the merged model. Better adherence rate to response format (1 out of 3000 inputs are non-JSON)
Q5_K_M High quality, recommended. Similar to Q4 model. No visible difference.
Q4_K_M High quality, recommended. Better adherence rate to response format (1 out of 4000 inputs are non-JSON) but smaller summary (100 words as opposed to 128 words)
Q2_K Straight up trash. Don't use it.

Training Details

Dataset: soumitsr/article-digests . This is generated using real news articles, blogs, reddit posts and yc-hackernews posts feed into Chat GPT-4o-mini for response.

Trained using Kaggle's free T4 GPU and unsloth. Here is the Notebook. On that note Unsloth will change your life. To the creators of Unsloth: You are AWESOME! THANK YOU!

Sample Code

Prompt

# this was the prompt template the model was trained with
prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
response_format:json_object
<|eot_id|><|start_header_id|>user<|end_header_id|>
TASK: create title, summary and tags (e.g. company, organization, person, catastrophic event, product, process, security vulnerability, stock ticker symbol, geographic location). title should be 10 - 20 words, summary should be 100 - 200 words and tags (entities) should a string of comma separated phrases.
INPUT:
{text}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>""" 

input_text = "whatever article, blog, post or novela you want to digest" 

Using Lora Adapter (Requires GPU)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "soumitsr/llama-v3p2-article-digestor-lora", 
    max_seq_length = 16384
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer(prompt_template.format(text=input_text), return_tensors="pt")
# feel free to play with the max_new_tokens and temperature
outputs = model.generate(
     **inputs, 
     max_new_tokens=512, 
     temperature=0.1, 
     stream=False
)
resp = tokenizer.decode(outputs[0], skip_special_tokens=True))

response_json = json.loads(resp[resp.find('{'):resp.rfind('}')+1])

Using Llama.CPP (No GPU)

Download one of the ggufs to a local directory and use that as a model path

from llama_cpp import Llama

model = Llama(model_path=model_file_apth, n_ctx=16384, n_threads=os.cpu_count(), embedding=False, verbose=False)  

resp = model.create_completion(
    prompt=prompt_template.format(text=text),
    max_tokens=384, 
    frequency_penalty=0.3, # feel free to play with these numbers
    temperature=0.2
)['choices'][0]['text']

response_json = json.loads(resp[resp.find('{'):resp.rfind('}')+1])

Appendix - Purpose of this model

I wanted a token efficient and cheap way to get quality summary, title and named-entities. The initial aim was to parse through volumes of click-bait garbage articles and blogs. When it comes to simpler tasks that are related to processing of given text ChatGPT is incredibly good at adhering to given instruction and response format. Llama-3.2-1b is powerful base model but it is inconsistent with sticking to response format and when it does, it produces a super generic content e.g. title that doesn't mean anything and the summary that is a one-lines BS. So I wanted to create something that will give me ChatGPT quality and consistency for basic tasks like summary, title and tag generation et. voila.