Slow response : Text validation
#37
by
GUrubux
- opened
Stuck at this point for hours. Slow inference
I'm experiencing slow inference and excessive memory usage (maxing out 128GB of RAM) when running LLaMA 3.1 8B Instruct for text generation tasks.
The inference process takes too long, and the system resources are heavily taxed.
What configurations (code, model settings, or infrastructure) should I change to optimize performance?
Should I consider better hardware, or is there a way to make the current setup more efficient?
Code
import transformers
import torch
import time
import pdfplumber
import os
import json
start_time = time.time()
model_id = "meta-llama/Llama-3.1-70B-Instruct"#"meta-llama/Llama-3.1-8B-Instruct"
import warnings
warnings.filterwarnings("ignore")
def extract_text_from_pdf(pdf_path):
text = ""
# Check if the PDF exists
if os.path.exists(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages):
text += page.extract_text()
else:
raise FileNotFoundError(f"PDF file not found: {pdf_path}")
return text
pdf_text = extract_text_from_pdf("some_pdf_file.pdf")
print(len(pdf_text))
pipeline = transformers.pipeline(
"text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
)
print(f"Model Load Time : {time.time() - start_time:.2f} seconds")
def get_answer_llm(question):
inter_start_time = time.time()
print(f"question: {question}")
prompt = f"Context: {pdf_text}\n\n Question:{question}: - Is this correct or accurate as per the Context Yes or No? if not please provide the correct information?"
print(len(prompt))
with open("file.txt", "w", encoding="utf-8") as file:
file.write(prompt)
output = pipeline(prompt, max_new_tokens=16000)
# Extract the 'generated_text' and find the 'Answer:' part
generated_text = output[0]['generated_text']
answer_start = generated_text.find("Answer:")
answer = generated_text[answer_start:]
# Print only the 'Answer:' part
print(f"answer: {answer}")
print(f"{time.time() - inter_start_time:.2f} seconds")
list_of_questions = ['Question1', 'Question2', 'Question3', 'Question4', 'Question5']
# Example usage
for question in list_of_questions:
get_answer_llm(question)
total_time = time.time()
print(f"Total time taken: {total_time - start_time:.2f} seconds")
OUTPUT
PS C:\Users\Nam\llama3> python try_HF_gpt2.py
92780
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:06<00:00, 4.44it/s]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
Model Load Time : 11.77 seconds
question: This is a promotional website intended for UK healthcare professionals only.
92985
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
System Config
This comment has been hidden
GUrubux
changed discussion status to
closed
GUrubux
changed discussion status to
open