Why is granite-docling-258M so slow?
Hi everyone,
I’ve been testing the ibm-granite/granite-docling-258M model (not the ONNX version) and I’m running into severe performance issues, even when using a powerful GPU (RTX 6000 Ada / 48 GB VRAM).
The model takes an unexpectedly long time to process even a single page, and resource usage doesn’t seem to justify the delay. Given that this is a relatively small model (~258 M parameters), I would expect it to be reasonably fast even on CPU, or at least near real-time on GPU.
My main questions:
Is there any internal throttling or hidden preprocessing (e.g. image segmentation, OCR fallback) that could explain the slowdown?
Is there any more detailed documentation on the model architecture, runtime flow, or inference pipeline?
Are there recommended settings or flags (like disabling unused components, OCR, or auto-device detection) to make it run faster, especially on CPU?
I’ve tested both direct loading and pipeline-based inference, and both exhibit the same latency pattern. If anyone from the development team or community has achieved near-real-time results — could you please share your configuration and runtime stats?
only vllm module works
only vllm module works
hgarp-prozis, do you have any sample code on using vllm to serve granite?
for what it's worth, the bf16 gguf with llama.cpp is pretty fast on a 4090: https://huggingface.co/ibm-granite/granite-docling-258M/blob/main/assets/new_arxiv.png gets OCRed at 506t/s.
Yeah I tried the webGPU demo from Xenova and it was slower than expected on my 32GB M1 Pro.
Several billion-parameter models with LM studio output tokens wayyyyy faster than this 258M parameter model for some reason.
I think that when discussing performance it would be useful to be specific about what exactely is being evaluated and how. What kind of file? If pdf, how many pages? There's a huge difference in speed between running the full pipeline (and even this one can be run directly from bash, or from their docker's gradio UI) and just the model (what engine? what quant?).
In my case, the same file i mentioned above gets processed in ~35 seconds by the pipeline started from the docker UI, and in 6 seconds with llama-server and an bf16 gguf.
Interestingly the response i get using the "simple", non-pipeline model call is more accurate the the one from the pipeline. Example output:
pipeline (~35seconds):
Figure 2. Estimated captured of the planet assuming the planet radiates blackbody The captured fux is calculated as the ratio of the integrated blackbody emission within the instrument's band pass to the total emission over 0 all wavelengths, = B(A,T) dX. The captured fux fraction 1s shown for [0.6-2.85 um] (red line); Hubble WFC3 [1.12-1.64 (dashed green line); NIRSpec G395H [2.7-5.15 um] (dash dotted blue line) . The red-shaded region shows the temperature range on WASP-121 b based on our Tef estimates. Red dashed lines indicate the boundaries of the planet'8 temperature range within the NIRISS SOSS captured flux fraction From this we estimate that these observations capture beon orbital phase. the minimum temperature from the NAMELESS this estimate decreases to 50%. In either case; the wavelength coverage of NIRISS exceeds that of any other instrument . flux xrnax i.e., ing Using fit,
llama-server -m danchev/ibm-granite-docling-258M-GGUF (~5seconds, 486t/s)
Figure 2. Estimated captured flux of the planet assuming the planet radiates as a blackbody. The captured flux is calculated as the ratio of the integrated blackbody emission within the instrument's band pass to the total emission over all wavelengths, i.e., γ = ∫ λ$_{min}$ λ$_{min}$ B ( λ, T ) dλ/ ∫ ∞ 0 B ( λ, T ) dλ. The captured flux fraction is shown for NIRISS SOSS [0.6-2.85 µm] (red line); Hubble WFC3 [1.12-1.64 µm] (dashed green line); NIRSpec G395H [2.7-5.15 µm] (dash dotted blue line). The red-shaded region shows the temperature range on WASP-121 b based on T$_{eff}$ estimates. Red dashed lines indicate the boundaries of the planet's temperature range within the NIRISS SOSS captured flux fraction. From this we estimate that these observations capture between 55% and 82% of the planet's bolometric flux, depending on orbital phase. Using the minimum temperature from the NAMELESS fit, this estimate decreases to 50%. In either case, the wavelength coverage of NIRISS exceeds that of any other instrument.
While not perfect (it missed the / between NIRISS and SOSS), the direct model call seems to me more accurate while also being much faster. Of course the output of the direct model call needs to pe post-processed as it includes a bunch of loc_ tags, but overall, unless i'm missing something, the direct model call approach seems preferable.
Hi all! We'll definitely dig into performance concerns and keep everyone posted.
One important thing to keep in mind with VLMs versus LLMs is that each VLM translates images to tokens differently. Some simply scale the image to a fixed size and encode that, others break the image up into tiles and encode each separately, resulting in far more input tokens. Granite Docling does the later, so a fair amount of the slowness will be caused by simply having a lot of prompt tokens to process in prefill. These types of image tiling strategies can be thought of as a form of test-time compute where images that get tiled more aggressively essentially give the model more tokens to look at. Additionally, the image preprocessing itself can be slow (rescaling, resizing). This portion will depend a lot on which implementation you're using for these preprocessing steps, so you could easily see pretty large variations between inference engines.
For folks experiencing slowness with the model, it would be great if you can share the following:
- What inference engine are you using to run the model?
- Are you using this with the
docling
library, or the model directly? - If using the model directly, what are the dimensions of the input image, and what prompt text are you sending with it?
- If using the
docling
library, any reproducible code snippets and inputs would be extremely helpful.
Thanks for all the great interest in this model and project!
@gabegoodhart Thank you so much for your comments and insights, and especially for creating this framework and putting so much work on making it available in so many different ways!
This might not be directly related to the topic on this discussion (speed), and if needed i can start a new one, but was wondering if you can share your thoughts about the differences in accuracy between using the model directly vs the docling gradio UI in docker that i described above? Thanks a lot.
Hi @gabegoodhart ,
I ran the granite-docling-258M with transformers
on an RTX 4070 GPU. I have used the example code that is shared in the model card to process one image.
import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
from pathlib import Path
# Load model and processor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
model = AutoModelForImageTextToText.from_pretrained(
pretrained_model_name_or_path="ibm-granite/granite-docling-258M",
torch_dtype=torch.bfloat16,
).to(DEVICE)
# Prepare inputs
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
print(f"DocTags: \n{doctags}\n")
# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
# create a docling document
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
print(f"Markdown:\n{doc.export_to_markdown()}\n")
## export as any format.
# Path("out/").mkdir(parents=True, exist_ok=True)
# HTML:
# output_path_html = Path("out/") / "example.html"
# doc.save_as_html(output_path_html)
# Markdown:
# output_path_md = Path("out/") / "example.md"
# doc.save_as_markdown(output_path_md)
The output is good, but processing an entire PDF file is going to take hours with the current generation speed that I am getting.
I switched to llama.cpp, I downloaded the half-precision gguf model from ggml-org. I was able to process the same image in 3.03s
with a generation speed equal to 394.17 t/s
.
Here is how I ran the model with llama.cpp
:
llama-mtmd-cli \
--model ~/.cache/llama.cpp/granite-docling-258M-f16.gguf \
--mmproj ~/.cache/llama.cpp/mmproj-granite-docling-258M-f16.gguf \
--n-gpu-layers 999
I loaded the image:
> /image /path/to/image.png
/path/to/image.png image loaded
After that I gave the prompt:
> Convert this page to docling.
Hit Ctrl+C
to see the generation speed:
llama_perf_context_print: load time = 183.19 ms
llama_perf_context_print: prompt eval time = 427.23 ms / 877 tokens ( 0.49 ms per token, 2052.75 tokens per second)
llama_perf_context_print: eval time = 4401.00 ms / 1774 runs ( 2.48 ms per token, 403.09 tokens per second)
llama_perf_context_print: total time = 127936.52 ms / 2651 tokens
llama_perf_context_print: graphs reused = 1766
The generation speed is 403.09 t/s
, which is much better than what I was getting with transformers
.
From 5 minutes to just 3 seconds, that is a 100x
speedup! I really wonder why the transformers
implementation is so slow.
I have a question, the output from llama.cpp is different than the one from transformers
llama.cpp:
<loc_114><loc_27><loc_385><loc_34>Energy Budget of WASP-121b from JWST/NIRISS Phase Curve
<loc_454><loc_28><loc_462><loc_34>9
<loc_41><loc_42><loc_241><loc_87>while the kernel weights are structured as ( N$_{slice}$ , N$_{time}$ ). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.
<loc_41><loc_89><loc_241><loc_206>To address this, we follow a similar approach to our sinusoidal fits using emcee , but we increase the total number of steps to 100,000 and use 100 walkers. Naïvely, the fit would include 2 N$_{slice}$ + 1 parameters: N$_{slice}$ for the albedo values, N$_{slice}$ for the emission parameters, and one additional scatter parameter, σ . However, since night-side slices do not contribute to the reflected light component, we exclude these albedo values from the fit. In any case, our choice of 100 walkers ensures a sufficient number of walkers per free parameer. Following Coulombe et al. (2025) we set an upper prior limit of 3/2 on all albedo slices as a fully Lambertian sphere ( A$_{i}$ = 1) corresponds to a geometric albedo of A$_{g}$ = 2/3. For thermal emission we impose a uniform prior between 0 and 500 ppm for each slice.
<loc_41><loc_207><loc_241><loc_270>We choose to fit our detrended lightcurves considering 4, 6 and 8 longitudinal slices ( N$_{slice}$ = 4, 6, 8). However, we show the results of the simplest 4 slice model. As in our previous fits, we conduct an initial run with 25,000 steps (25% of the total run) and use the maximumprobability parameters from this preliminary fit as the starting positions for the final 75,000-step run. We then discard the first 60% of the final run as burn-in.
<loc_73><loc_277><loc_212><loc_283>2.5. Planetary Effective Temperature
<loc_41><loc_286><loc_241><loc_346>Phase curves are the only way to probe thermal emission from the day and nightside of an exoplanet and hence determine its global energy budget (Partimer & Crossfield 2018). The wavelength range of NIRISS/SOSS covers a large portion of the emitted flux of WASP-121 b (~ 50-83%; see Figure 2), enabling a precise and robust constraint of the planet's energy budget.
<loc_41><loc_348><loc_241><loc_364>We convert the fitted F$_{p}$ / F$_{∗}$ emission spectra to brightness temperature by wavelength,
<loc_60><loc_368><loc_240><loc_388>T _ { \text {bright} } = \frac { b c } { k \lambda } \cdot \left [ \ln \left ( \frac { 2 b c ^ { 2 } } { \lambda ^ { 5 } B _ { \lambda , \text {planet} } } + 1 \right ) \right ] ^ { - 1 } \quad , \quad ( 1 6 )
<loc_41><loc_391><loc_178><loc_398>where the planet's thermal emission is
<loc_85><loc_404><loc_240><loc_418>B _ { \lambda , \, p l a n e t } = \frac { F _ { p } / F _ { * } } { ( R _ { p } / R _ { * } ) ^ { 2 } } \cdot B _ { \lambda , \, s t a r } \, .
<loc_41><loc_425><loc_241><loc_455>There are many ways of converting brightness temperatures to effective temperature, including the ErrorWeighted Mean (EWM), Power-Weighted mean (PWM) and with a Gaussian Process (Schwartz & Cowan 2015;
<loc_273><loc_50><loc_454><loc_134><line_chart><loc_261><loc_141><loc_462><loc_265>Figure 2. Estimated captured flux of the planet assuming the planet radiates as a blackbody. The captured flux is calculated as the ratio of the integrated blackbody emission within the instrument's band pass to the total emission over all wavelengths, i.e., γ = ∫ λmax λ$_{min}$ B ( λ,T) dλ/ ∫ ∞ 0 B ( λ,T ) dλ . The captured flux fraction is shown for NIRISS SOSS [0.6-2.85 µm] (red line); Hubble WFC3 [1.12-1.64 µm] (dashed green line); NIRSpec G395H [2.7-5.15 µm] (dash dotted blue line). The red-shaded region shows the temperature range on WASP-121 b based on T$_{eff}$ estimates. Red dashed lines indicate the boundaries of the planet's temperature range within the NIRISS SOSS captured flux fraction. From this we estimate that these observations capture between 55% and 82% of the planet's bolometric flux, depending on orbital phase. Using the minimum temperature from the NAMELESS fit, this estimate decreases to 50%. In either case, the wavelength coverage of NIRISS exceeds that of any other instrument.
<loc_261><loc_274><loc_462><loc_360>Pass et al. 2019). In this work, we elect to compute our effective temperature estimates with a novel method that is essentially a combination of the PWM and EWM. We create the effective temperature by using a simple Monte Carlo process. First, we perturb our F$_{p}$ / F$_{s}$ emission spectra at each point in the orbit by a Gaussian based on the measurement uncertainty. Our new emission spectrum is then used to create an estimate of the brightness temperature spectrum. This process is repeated at each orbital phase. We then estimate the effective temperature, T$_{eff}$ for a given orbital phase as
<loc_317><loc_363><loc_460><loc_382>T _ { e f f } = \frac { \sum _ { i = 1 } ^ { N } w _ { i } T _ { b r i g h t , i } } { \sum _ { i = 1 } ^ { N } w _ { i } } ,
<loc_262><loc_385><loc_462><loc_415>where w$_{i}$ is the weight for the i -th wavelength given by the fraction of the planet's bolometric flux that falls within that wavelength bin scaled by the inverse variance of the measurement,
<loc_306><loc_417><loc_462><loc_437>w _ { i } = \frac { \int _ { \lambda _ { i } + 1 } ^ { \lambda _ { i } + 1 } B ( \lambda _ { i } , T _ { \text {est} } ) \, d \lambda } { \int _ { 0 } ^ { \infty } B ( \lambda _ { i } , T _ { \text {est} } ) \, d \lambda } \cdot \frac { 1 } { \sigma _ { i } ^ { 2 } } ,
<loc_262><loc_441><loc_462><loc_455>with T$_{est}$ representing an estimated effective temperature at the orbital phase of interest. When computing
transformers:
<doctag><page_header><loc_115><loc_27><loc_385><loc_34>Energy Budget of WASP-121 b from JWST/NIRISS Phase Curve</page_header>
<page_header><loc_454><loc_28><loc_459><loc_34>9</page_header>
<text><loc_41><loc_42><loc_239><loc_88>while the kernel weights are structured as ( N$_{slice}$ , N$_{time}$ ). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.</text>
<text><loc_41><loc_89><loc_239><loc_206>To address this, we follow a similar approach to our sinusoidal fits using emcee , but we increase the total number of steps to 100,000 and use 100 walkers. Na¨ıvely, the fit would include 2 N$_{slice}$ + 1 parameters: N$_{slice}$ for the albedo values, N$_{slice}$ for the emission parameters, and one additional scatter parameter, σ . However, since night-side slices do not contribute to the reflected light component, we exclude these albedo values from the fit. In any case, our choice of 100 walkers ensures a sufficient number of walkers per free parameter. Following Coulombe et al. (2025) we set an upper prior limit of 3 / 2 on all albedo slices as a fully Lambertian sphere ( A$_{i}$ = 1 ) corresponds to a geometric albedo of A$_{g}$ = 2 / 3. For thermal emission we impose a uniform prior between 0 and 500 ppm for each slice.</text>
<text><loc_41><loc_207><loc_239><loc_269>We choose to fit our detrended lightcurves considering 4, 6 and 8 longitudinal slices ( N$_{slice}$ = 4 , 6 , 8). However, we show the results of the simplest 4 slice model. As in our previous fits, we conduct an initial run with 25,000 steps (25% of the total run) and use the maximumprobability parameters from this preliminary fit as the starting positions for the final 75,000-step run. We then discard the first 60% of the final run as burn-in.</text>
<section_header_level_1><loc_73><loc_276><loc_207><loc_283>2.5. Planetary Effective Temperature</section_header_level_1>
<text><loc_41><loc_286><loc_239><loc_348>Phase curves are the only way to probe thermal emission from the day and nightside of an exoplanet and hence determine its global energy budget (Partier & Crossfield 2018). The wavelength range of NIRISS/SOSS covers a large portion of the emitted flux of WASP-121 b ( ∼ 50-83%; see Figure 2), enabling a precise and robust constraint of the planet's energy budget.</text>
<text><loc_41><loc_349><loc_239><loc_364>We convert the fitted F$_{p}$ / F$_{∗}$ emission spectra to brightness temperature by wavelength,</text>
<formula><loc_60><loc_368><loc_238><loc_387>T _ { b r i g h t } = \frac { h c } { k \lambda } \cdot \left [ \ln \left ( \frac { 2 b c ^ { 2 } } { \lambda ^ { 5 } B _ { \lambda , p l a n e t } } + 1 \right ) \right ] ^ { - 1 } ,</formula>
<text><loc_41><loc_391><loc_178><loc_398>where the planet's thermal emission is</text>
<formula><loc_84><loc_403><loc_238><loc_419>B _ { \lambda , \text {planet} } = \frac { F _ { p } / F _ { * } } { ( R _ { p } / R _ { * } ) ^ { 2 } } \cdot B _ { \lambda , \text {star} } \, .</formula>
<text><loc_41><loc_425><loc_239><loc_455>There are many ways of converting brightness temperatures to effective temperature, including the ErrorWeighted Mean (EWM), Power-Weighted mean (PWM) and with a Gaussian Process (Schwartz & Cowan 2015;</text>
<chart><loc_273><loc_49><loc_454><loc_134><line_chart><caption><loc_261><loc_141><loc_459><loc_264>Figure 2. Estimated captured flux of the planet assuming the planet radiates as a blackbody. The captured flux is calculated as the ratio of the integrated blackbody emission within the instrument's band pass to the total emission over all wavelengths, i.e., γ = ∫ λ$_{max}$ λ$_{min}$ B ( λ, T ) dλ/ ∫ ∞ 0 B ( λ, T ) dλ . The captured flux fraction is shown for NIRISS SOSS [0.6-2.85 µ m] (red line); Hubble WFC3 [1.12-1.64 µ m] (dashed green line); NIRSpec G395H [2.7-5.15 µ m] (dash dotted blue line). The red-shaded region shows the temperature range on WASP-121 b based on our T$_{eff}$ estimates. Red dashed lines indicate the boundaries of the planet's temperature range within the NIRISS SOSS captured flux fraction. From this we estimate that these observations capture between 55% and 82% of the planet's bolometric flux, depending on orbital phase. Using the minimum temperature from the NAMELESS fit, this estimate decreases to 50%. In either case, the wavelength coverage of NIRISS exceeds that of any other instrument.</caption></chart>
<text><loc_261><loc_273><loc_459><loc_359>Pass et al. 2019). In this work, we elect to compute our effective temperature estimates with a novel method that is essentially a combination of the PWM and EWM. We create the effective temperature by using a simple Monte Carlo process. First, we perturb our F$_{p}$ / F$_{s}$ emission spectra at each point in the orbit by a Gaussian based on the measurement uncertainty. Our new emission spectrum is then used to create an estimate of the brightness temperature spectrum. This process is repeated at each orbital phase. We then estimate the effective temperature, T$_{eff}$ for a given orbital phase as</text>
<formula><loc_317><loc_362><loc_459><loc_382>T _ { \text {eff} } = \frac { \sum _ { i = 1 } ^ { N } w _ { i } T _ { \text {bright,} } , } { \sum _ { i = 1 } ^ { N } w _ { i } } ,</formula>
<text><loc_261><loc_384><loc_459><loc_414>where w$_{i}$ is the weight for the i -th wavelength given by the fraction of the planet's bolometric flux that falls within that wavelength bin scaled by the inverse variance of the measurement,</text>
<formula><loc_305><loc_417><loc_459><loc_437>w _ { i } = \frac { \int _ { \lambda _ { i } } ^ { \lambda _ { i } + 1 } B ( \lambda _ { i } , T _ { \text {est} } ) \, d \lambda } { \int _ { 0 } ^ { \infty } B ( \lambda _ { i } , T _ { \text {est} } ) \, d \lambda } \cdot \frac { 1 } { \sigma _ { i } ^ { 2 } } ,</formula>
<text><loc_261><loc_440><loc_459><loc_454>with T$_{est}$ representing an estimated effective temperature at the orbital phase of interest. When computing</text>
</doctag><|end_of_text|>
It seems that the output from llama.cpp
lacks the appropriate Docling tags (like text, formula, etc.)
When I try to pass the output from llama.cpp to the following code, I get an empty string, but it works fine for the output from transformers
from IPython.display import Markdown, display
from docling_core.types.doc.document import DoclingDocument
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doc_tags], [image])
document = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
extracted_text_markdown = document.export_to_markdown()
display(Markdown(extracted_text_markdown))
Do you know a solution to this issue?
Thanks in advance.
Edit:
I have noticed that the bouding boxes are separated by newline characters. I have used the following code to get the final text from the raw output.
import re
def extract_inner_text(text_chunk: str) -> str:
return re.sub(r"<.*?>", "", text_chunk, flags=re.DOTALL).strip()
extracted_text_llama_cpp = ""
for line in doc_tags_llama_cpp.splitlines():
extracted_text_llama_cpp += extract_inner_text(line) + "\n"
print(extracted_text_llama_cpp)
and especially for creating this framework and putting so much work on making it available in so many different ways!
I definitely appreciate it! The real starrs (pun intended) of the show here are
@PeterWJStaar
@dolfim-ibm
and the rest of the Docling team. I'm just the messenger (and llama.cpp
interface).
but was wondering if you can share your thoughts about the differences in accuracy between using the model directly vs the docling gradio UI in docker that i described above?
This would definitely make a good discussion on its own since I expect there are a number of folks with similar questions. I'll let the Docling team comment on the details, but my rough answer is that using the full package has a lot more ability to be customized to the specific documents. The library itself supports multiple backends, both for format conversion and visual parsing. The default settings are tuned for a good quality/speed tradeoff, but you may find that with the default settings you get worse quality than the raw model if the raw model itself isn't part of the default pipeline (I got to here tracing the defaults, but it goes deeper than that to figure out what is part of the defaults).
I ran the granite-docling-258M with transformers on an RTX 4070 GPU
This is really great detail, thanks for sharing it all! I'm on a mac, so typically go through the mps
backend and can't easily verify the slowness you're seeing on CUDA, but with a slightly modified version of your script, I'm definitely seeing it peg my GPU for a long time (still running as we speak).
docling-repro.py
import torch
import time
from datetime import timedelta
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
from pathlib import Path
# Load model and processor
DEVICE = "cpu"
if torch.cuda.is_available():
DEVICE = "cuda"
elif torch.backends.mps.is_available():
DEVICE = "mps"
print(f"USING DEVICE: {DEVICE}")
start_time = time.time()
model_path = "/Users/ghart/models/ibm-granite/granite-docling-258M"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
pretrained_model_name_or_path=model_path,
torch_dtype=torch.bfloat16,
).to(DEVICE)
print("==> Done loading model: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))
# Prepare inputs
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
print("==> Done preparing inputs: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
print("==> Done generating: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
print(f"DocTags: \n{doctags}\n")
# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
# create a docling document
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
print(f"Markdown:\n{doc.export_to_markdown()}\n")
print("==> Done: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))
## export as any format.
# Path("out/").mkdir(parents=True, exist_ok=True)
# HTML:
# output_path_html = Path("out/") / "example.html"
# doc.save_as_html(output_path_html)
# Markdown:
# output_path_md = Path("out/") / "example.md"
# doc.save_as_markdown(output_path_md)
USING DEVICE: mps
`torch_dtype` is deprecated! Use `dtype` instead!
==> Done loading model: 0.582172s
==> Done preparing inputs: 1.439638s
# Ctrl-C after several minutes
I also see that the GPU VRAM steadily climbs. It started out around 30GB total (my baseline is around 12GB with other workloads), and was up to about 50GB when I stopped it. This definitely feels buggy, so we'll dig deeper and see if we can get to the bottom of it.
Thanks for the reply and for the work you and the team are doing.
I am happy with using llama.cpp
for now, but I will keep an eye on this issue. Hopefully, it will get fixed soon.
One interesting piece of debugging: If I fully disable the image inputs, I still see much slower generation than I would expect, so this appears to be an issue in the language model and not the number of image tokens or the preprocessing stack.
Let me try that too.
I have tried keeping just the text input and I observed the same thing.
Interestingly, on CPU, it's 2x faster than on mps
with just the llama
model portion. Something is definitely working incorrectly here. If I monitor my GPU utilization with nvtop
while running it, I see short bursts of GPU utilization, but not sustained usage like I would expect from a model fully allocated to the device.
I created a copy of the text model that acts purely as a LlamaForCausalLM
model and I still see the same behavior.
config.json
{
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 100264,
"dtype": "bfloat16",
"eos_token_id": 100257,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 576,
"initializer_range": 0.02,
"intermediate_size": 1536,
"max_position_embeddings": 8192,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 9,
"num_hidden_layers": 30,
"num_key_value_heads": 3,
"pad_token_id": 100257,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 100000.0,
"tie_word_embeddings": true,
"use_cache": false,
"vocab_size": 100352
}
docling-repro-text-only.py
import torch
import time
from datetime import timedelta
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.image_utils import load_image
from pathlib import Path
# Load model and processor
DEVICE = "cpu"
if torch.cuda.is_available():
DEVICE = "cuda"
elif torch.backends.mps.is_available():
DEVICE = "mps"
print(f"USING DEVICE: {DEVICE}")
start_time = time.time()
model_path = "/Users/ghart/models/ibm-granite/granite-docling-258M-text-only"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path).to(DEVICE)
print("==> Done loading model: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))
# Prepare inputs
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")
messages = [
{
"role": "user",
"content": [
# {"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt")
inputs = inputs.to(DEVICE)
print("==> Done preparing inputs: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=512, use_cache=True)
print("==> Done generating: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
result = tokenizer.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
print("==> Done: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))
This clearly shows the same behavior (CPU is ~3x faster). My assumption is that there is not some glaring bug in modeling_llama.py
otherwise this would have surfaced a long time ago for llama models. This leads me to believe that there's something about the shape of Granite Docling that is causing the overhead.
Here is what I have done, I have copied the folder that contains the docling model and renamed it to text only like you did in the script and replaced the config.json
file.
I loaded that new model using AutoModelForCausalLM
. I get this output telling that some weights were not used
Unused weights
Some weights of the model checkpoint at /home/imad-saddik/.cache/huggingface/hub/models--ibm-granite--granite-docling-258M-text-only/snapshots/982fe3b40f2fa73c365bdb1bcacf6c81b7184bfe/ were not used when initializing LlamaForCausalLM: ['model.connector.modality_projection.proj.weight', 'model.text_model.embed_tokens.weight', 'model.text_model.layers.0.input_layernorm.weight', 'model.text_model.layers.0.mlp.down_proj.weight', 'model.text_model.layers.0.mlp.gate_proj.weight', 'model.text_model.layers.0.mlp.up_proj.weight', 'model.text_model.layers.0.post_attention_layernorm.weight', 'model.text_model.layers.0.self_attn.k_proj.weight', 'model.text_model.layers.0.self_attn.o_proj.weight', 'model.text_model.layers.0.self_attn.q_proj.weight', 'model.text_model.layers.0.self_attn.v_proj.weight', 'model.text_model.layers.1.input_layernorm.weight', 'model.text_model.layers.1.mlp.down_proj.weight', 'model.text_model.layers.1.mlp.gate_proj.weight', 'model.text_model.layers.1.mlp.up_proj.weight', 'model.text_model.layers.1.post_attention_layernorm.weight', 'model.text_model.layers.1.self_attn.k_proj.weight', 'model.text_model.layers.1.self_attn.o_proj.weight', 'model.text_model.layers.1.self_attn.q_proj.weight', 'model.text_model.layers.1.self_attn.v_proj.weight', 'model.text_model.layers.10.input_layernorm.weight', 'model.text_model.layers.10.mlp.down_proj.weight', 'model.text_model.layers.10.mlp.gate_proj.weight', 'model.text_model.layers.10.mlp.up_proj.weight', 'model.text_model.layers.10.post_attention_layernorm.weight', 'model.text_model.layers.10.self_attn.k_proj.weight', 'model.text_model.layers.10.self_attn.o_proj.weight', 'model.text_model.layers.10.self_attn.q_proj.weight', 'model.text_model.layers.10.self_attn.v_proj.weight', 'model.text_model.layers.11.input_layernorm.weight', 'model.text_model.layers.11.mlp.down_proj.weight', 'model.text_model.layers.11.mlp.gate_proj.weight', 'model.text_model.layers.11.mlp.up_proj.weight', 'model.text_model.layers.11.post_attention_layernorm.weight', 'model.text_model.layers.11.self_attn.k_proj.weight', 'model.text_model.layers.11.self_attn.o_proj.weight', 'model.text_model.layers.11.self_attn.q_proj.weight', 'model.text_model.layers.11.self_attn.v_proj.weight', 'model.text_model.layers.12.input_layernorm.weight', 'model.text_model.layers.12.mlp.down_proj.weight', 'model.text_model.layers.12.mlp.gate_proj.weight', 'model.text_model.layers.12.mlp.up_proj.weight', 'model.text_model.layers.12.post_attention_layernorm.weight', 'model.text_model.layers.12.self_attn.k_proj.weight', 'model.text_model.layers.12.self_attn.o_proj.weight', 'model.text_model.layers.12.self_attn.q_proj.weight', 'model.text_model.layers.12.self_attn.v_proj.weight', 'model.text_model.layers.13.input_layernorm.weight', 'model.text_model.layers.13.mlp.down_proj.weight', 'model.text_model.layers.13.mlp.gate_proj.weight', 'model.text_model.layers.13.mlp.up_proj.weight', 'model.text_model.layers.13.post_attention_layernorm.weight', 'model.text_model.layers.13.self_attn.k_proj.weight', 'model.text_model.layers.13.self_attn.o_proj.weight', 'model.text_model.layers.13.self_attn.q_proj.weight', 'model.text_model.layers.13.self_attn.v_proj.weight', 'model.text_model.layers.14.input_layernorm.weight', 'model.text_model.layers.14.mlp.down_proj.weight', 'model.text_model.layers.14.mlp.gate_proj.weight', 'model.text_model.layers.14.mlp.up_proj.weight', 'model.text_model.layers.14.post_attention_layernorm.weight', 'model.text_model.layers.14.self_attn.k_proj.weight', 'model.text_model.layers.14.self_attn.o_proj.weight', 'model.text_model.layers.14.self_attn.q_proj.weight', 'model.text_model.layers.14.self_attn.v_proj.weight', 'model.text_model.layers.15.input_layernorm.weight', 'model.text_model.layers.15.mlp.down_proj.weight', 'model.text_model.layers.15.mlp.gate_proj.weight', 'model.text_model.layers.15.mlp.up_proj.weight', 'model.text_model.layers.15.post_attention_layernorm.weight', 'model.text_model.layers.15.self_attn.k_proj.weight', 'model.text_model.layers.15.self_attn.o_proj.weight', 'model.text_model.layers.15.self_attn.q_proj.weight', 'model.text_model.layers.15.self_attn.v_proj.weight', 'model.text_model.layers.16.input_layernorm.weight', 'model.text_model.layers.16.mlp.down_proj.weight', 'model.text_model.layers.16.mlp.gate_proj.weight', 'model.text_model.layers.16.mlp.up_proj.weight', 'model.text_model.layers.16.post_attention_layernorm.weight', 'model.text_model.layers.16.self_attn.k_proj.weight', 'model.text_model.layers.16.self_attn.o_proj.weight', 'model.text_model.layers.16.self_attn.q_proj.weight', 'model.text_model.layers.16.self_attn.v_proj.weight', 'model.text_model.layers.17.input_layernorm.weight', 'model.text_model.layers.17.mlp.down_proj.weight', 'model.text_model.layers.17.mlp.gate_proj.weight', 'model.text_model.layers.17.mlp.up_proj.weight', 'model.text_model.layers.17.post_attention_layernorm.weight', 'model.text_model.layers.17.self_attn.k_proj.weight', 'model.text_model.layers.17.self_attn.o_proj.weight', 'model.text_model.layers.17.self_attn.q_proj.weight', 'model.text_model.layers.17.self_attn.v_proj.weight', 'model.text_model.layers.18.input_layernorm.weight', 'model.text_model.layers.18.mlp.down_proj.weight', 'model.text_model.layers.18.mlp.gate_proj.weight', 'model.text_model.layers.18.mlp.up_proj.weight', 'model.text_model.layers.18.post_attention_layernorm.weight', 'model.text_model.layers.18.self_attn.k_proj.weight', 'model.text_model.layers.18.self_attn.o_proj.weight', 'model.text_model.layers.18.self_attn.q_proj.weight', 'model.text_model.layers.18.self_attn.v_proj.weight', 'model.text_model.layers.19.input_layernorm.weight', 'model.text_model.layers.19.mlp.down_proj.weight', 'model.text_model.layers.19.mlp.gate_proj.weight', 'model.text_model.layers.19.mlp.up_proj.weight', 'model.text_model.layers.19.post_attention_layernorm.weight', 'model.text_model.layers.19.self_attn.k_proj.weight', 'model.text_model.layers.19.self_attn.o_proj.weight', 'model.text_model.layers.19.self_attn.q_proj.weight', 'model.text_model.layers.19.self_attn.v_proj.weight', 'model.text_model.layers.2.input_layernorm.weight', 'model.text_model.layers.2.mlp.down_proj.weight', 'model.text_model.layers.2.mlp.gate_proj.weight', 'model.text_model.layers.2.mlp.up_proj.weight', 'model.text_model.layers.2.post_attention_layernorm.weight', 'model.text_model.layers.2.self_attn.k_proj.weight', 'model.text_model.layers.2.self_attn.o_proj.weight', 'model.text_model.layers.2.self_attn.q_proj.weight', 'model.text_model.layers.2.self_attn.v_proj.weight', 'model.text_model.layers.20.input_layernorm.weight', 'model.text_model.layers.20.mlp.down_proj.weight', 'model.text_model.layers.20.mlp.gate_proj.weight', 'model.text_model.layers.20.mlp.up_proj.weight', 'model.text_model.layers.20.post_attention_layernorm.weight', 'model.text_model.layers.20.self_attn.k_proj.weight', 'model.text_model.layers.20.self_attn.o_proj.weight', 'model.text_model.layers.20.self_attn.q_proj.weight', 'model.text_model.layers.20.self_attn.v_proj.weight', 'model.text_model.layers.21.input_layernorm.weight', 'model.text_model.layers.21.mlp.down_proj.weight', 'model.text_model.layers.21.mlp.gate_proj.weight', 'model.text_model.layers.21.mlp.up_proj.weight', 'model.text_model.layers.21.post_attention_layernorm.weight', 'model.text_model.layers.21.self_attn.k_proj.weight', 'model.text_model.layers.21.self_attn.o_proj.weight', 'model.text_model.layers.21.self_attn.q_proj.weight', 'model.text_model.layers.21.self_attn.v_proj.weight', 'model.text_model.layers.22.input_layernorm.weight', 'model.text_model.layers.22.mlp.down_proj.weight', 'model.text_model.layers.22.mlp.gate_proj.weight', 'model.text_model.layers.22.mlp.up_proj.weight', 'model.text_model.layers.22.post_attention_layernorm.weight', 'model.text_model.layers.22.self_attn.k_proj.weight', 'model.text_model.layers.22.self_attn.o_proj.weight', 'model.text_model.layers.22.self_attn.q_proj.weight', 'model.text_model.layers.22.self_attn.v_proj.weight', 'model.text_model.layers.23.input_layernorm.weight', 'model.text_model.layers.23.mlp.down_proj.weight', 'model.text_model.layers.23.mlp.gate_proj.weight', 'model.text_model.layers.23.mlp.up_proj.weight', 'model.text_model.layers.23.post_attention_layernorm.weight', 'model.text_model.layers.23.self_attn.k_proj.weight', 'model.text_model.layers.23.self_attn.o_proj.weight', 'model.text_model.layers.23.self_attn.q_proj.weight', 'model.text_model.layers.23.self_attn.v_proj.weight', 'model.text_model.layers.24.input_layernorm.weight', 'model.text_model.layers.24.mlp.down_proj.weight', 'model.text_model.layers.24.mlp.gate_proj.weight', 'model.text_model.layers.24.mlp.up_proj.weight', 'model.text_model.layers.24.post_attention_layernorm.weight', 'model.text_model.layers.24.self_attn.k_proj.weight', 'model.text_model.layers.24.self_attn.o_proj.weight', 'model.text_model.layers.24.self_attn.q_proj.weight', 'model.text_model.layers.24.self_attn.v_proj.weight', 'model.text_model.layers.25.input_layernorm.weight', 'model.text_model.layers.25.mlp.down_proj.weight', 'model.text_model.layers.25.mlp.gate_proj.weight', 'model.text_model.layers.25.mlp.up_proj.weight', 'model.text_model.layers.25.post_attention_layernorm.weight', 'model.text_model.layers.25.self_attn.k_proj.weight', 'model.text_model.layers.25.self_attn.o_proj.weight', 'model.text_model.layers.25.self_attn.q_proj.weight', 'model.text_model.layers.25.self_attn.v_proj.weight', 'model.text_model.layers.26.input_layernorm.weight', 'model.text_model.layers.26.mlp.down_proj.weight', 'model.text_model.layers.26.mlp.gate_proj.weight', 'model.text_model.layers.26.mlp.up_proj.weight', 'model.text_model.layers.26.post_attention_layernorm.weight', 'model.text_model.layers.26.self_attn.k_proj.weight', 'model.text_model.layers.26.self_attn.o_proj.weight', 'model.text_model.layers.26.self_attn.q_proj.weight', 'model.text_model.layers.26.self_attn.v_proj.weight', 'model.text_model.layers.27.input_layernorm.weight', 'model.text_model.layers.27.mlp.down_proj.weight', 'model.text_model.layers.27.mlp.gate_proj.weight', 'model.text_model.layers.27.mlp.up_proj.weight', 'model.text_model.layers.27.post_attention_layernorm.weight', 'model.text_model.layers.27.self_attn.k_proj.weight', 'model.text_model.layers.27.self_attn.o_proj.weight', 'model.text_model.layers.27.self_attn.q_proj.weight', 'model.text_model.layers.27.self_attn.v_proj.weight', 'model.text_model.layers.28.input_layernorm.weight', 'model.text_model.layers.28.mlp.down_proj.weight', 'model.text_model.layers.28.mlp.gate_proj.weight', 'model.text_model.layers.28.mlp.up_proj.weight', 'model.text_model.layers.28.post_attention_layernorm.weight', 'model.text_model.layers.28.self_attn.k_proj.weight', 'model.text_model.layers.28.self_attn.o_proj.weight', 'model.text_model.layers.28.self_attn.q_proj.weight', 'model.text_model.layers.28.self_attn.v_proj.weight', 'model.text_model.layers.29.input_layernorm.weight', 'model.text_model.layers.29.mlp.down_proj.weight', 'model.text_model.layers.29.mlp.gate_proj.weight', 'model.text_model.layers.29.mlp.up_proj.weight', 'model.text_model.layers.29.post_attention_layernorm.weight', 'model.text_model.layers.29.self_attn.k_proj.weight', 'model.text_model.layers.29.self_attn.o_proj.weight', 'model.text_model.layers.29.self_attn.q_proj.weight', 'model.text_model.layers.29.self_attn.v_proj.weight', 'model.text_model.layers.3.input_layernorm.weight', 'model.text_model.layers.3.mlp.down_proj.weight', 'model.text_model.layers.3.mlp.gate_proj.weight', 'model.text_model.layers.3.mlp.up_proj.weight', 'model.text_model.layers.3.post_attention_layernorm.weight', 'model.text_model.layers.3.self_attn.k_proj.weight', 'model.text_model.layers.3.self_attn.o_proj.weight', 'model.text_model.layers.3.self_attn.q_proj.weight', 'model.text_model.layers.3.self_attn.v_proj.weight', 'model.text_model.layers.4.input_layernorm.weight', 'model.text_model.layers.4.mlp.down_proj.weight', 'model.text_model.layers.4.mlp.gate_proj.weight', 'model.text_model.layers.4.mlp.up_proj.weight', 'model.text_model.layers.4.post_attention_layernorm.weight', 'model.text_model.layers.4.self_attn.k_proj.weight', 'model.text_model.layers.4.self_attn.o_proj.weight', 'model.text_model.layers.4.self_attn.q_proj.weight', 'model.text_model.layers.4.self_attn.v_proj.weight', 'model.text_model.layers.5.input_layernorm.weight', 'model.text_model.layers.5.mlp.down_proj.weight', 'model.text_model.layers.5.mlp.gate_proj.weight', 'model.text_model.layers.5.mlp.up_proj.weight', 'model.text_model.layers.5.post_attention_layernorm.weight', 'model.text_model.layers.5.self_attn.k_proj.weight', 'model.text_model.layers.5.self_attn.o_proj.weight', 'model.text_model.layers.5.self_attn.q_proj.weight', 'model.text_model.layers.5.self_attn.v_proj.weight', 'model.text_model.layers.6.input_layernorm.weight', 'model.text_model.layers.6.mlp.down_proj.weight', 'model.text_model.layers.6.mlp.gate_proj.weight', 'model.text_model.layers.6.mlp.up_proj.weight', 'model.text_model.layers.6.post_attention_layernorm.weight', 'model.text_model.layers.6.self_attn.k_proj.weight', 'model.text_model.layers.6.self_attn.o_proj.weight', 'model.text_model.layers.6.self_attn.q_proj.weight', 'model.text_model.layers.6.self_attn.v_proj.weight', 'model.text_model.layers.7.input_layernorm.weight', 'model.text_model.layers.7.mlp.down_proj.weight', 'model.text_model.layers.7.mlp.gate_proj.weight', 'model.text_model.layers.7.mlp.up_proj.weight', 'model.text_model.layers.7.post_attention_layernorm.weight', 'model.text_model.layers.7.self_attn.k_proj.weight', 'model.text_model.layers.7.self_attn.o_proj.weight', 'model.text_model.layers.7.self_attn.q_proj.weight', 'model.text_model.layers.7.self_attn.v_proj.weight', 'model.text_model.layers.8.input_layernorm.weight', 'model.text_model.layers.8.mlp.down_proj.weight', 'model.text_model.layers.8.mlp.gate_proj.weight', 'model.text_model.layers.8.mlp.up_proj.weight', 'model.text_model.layers.8.post_attention_layernorm.weight', 'model.text_model.layers.8.self_attn.k_proj.weight', 'model.text_model.layers.8.self_attn.o_proj.weight', 'model.text_model.layers.8.self_attn.q_proj.weight', 'model.text_model.layers.8.self_attn.v_proj.weight', 'model.text_model.layers.9.input_layernorm.weight', 'model.text_model.layers.9.mlp.down_proj.weight', 'model.text_model.layers.9.mlp.gate_proj.weight', 'model.text_model.layers.9.mlp.up_proj.weight', 'model.text_model.layers.9.post_attention_layernorm.weight', 'model.text_model.layers.9.self_attn.k_proj.weight', 'model.text_model.layers.9.self_attn.o_proj.weight', 'model.text_model.layers.9.self_attn.q_proj.weight', 'model.text_model.layers.9.self_attn.v_proj.weight', 'model.text_model.norm.weight', 'model.vision_model.embeddings.patch_embedding.bias', 'model.vision_model.embeddings.patch_embedding.weight', 'model.vision_model.embeddings.position_embedding.weight', 'model.vision_model.encoder.layers.0.layer_norm1.bias', 'model.vision_model.encoder.layers.0.layer_norm1.weight', 'model.vision_model.encoder.layers.0.layer_norm2.bias', 'model.vision_model.encoder.layers.0.layer_norm2.weight', 'model.vision_model.encoder.layers.0.mlp.fc1.bias', 'model.vision_model.encoder.layers.0.mlp.fc1.weight', 'model.vision_model.encoder.layers.0.mlp.fc2.bias', 'model.vision_model.encoder.layers.0.mlp.fc2.weight', 'model.vision_model.encoder.layers.0.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.0.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.0.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.0.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.0.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.0.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.0.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.0.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.1.layer_norm1.bias', 'model.vision_model.encoder.layers.1.layer_norm1.weight', 'model.vision_model.encoder.layers.1.layer_norm2.bias', 'model.vision_model.encoder.layers.1.layer_norm2.weight', 'model.vision_model.encoder.layers.1.mlp.fc1.bias', 'model.vision_model.encoder.layers.1.mlp.fc1.weight', 'model.vision_model.encoder.layers.1.mlp.fc2.bias', 'model.vision_model.encoder.layers.1.mlp.fc2.weight', 'model.vision_model.encoder.layers.1.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.1.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.1.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.1.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.1.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.1.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.1.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.1.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.10.layer_norm1.bias', 'model.vision_model.encoder.layers.10.layer_norm1.weight', 'model.vision_model.encoder.layers.10.layer_norm2.bias', 'model.vision_model.encoder.layers.10.layer_norm2.weight', 'model.vision_model.encoder.layers.10.mlp.fc1.bias', 'model.vision_model.encoder.layers.10.mlp.fc1.weight', 'model.vision_model.encoder.layers.10.mlp.fc2.bias', 'model.vision_model.encoder.layers.10.mlp.fc2.weight', 'model.vision_model.encoder.layers.10.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.10.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.10.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.10.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.10.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.10.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.10.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.10.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.11.layer_norm1.bias', 'model.vision_model.encoder.layers.11.layer_norm1.weight', 'model.vision_model.encoder.layers.11.layer_norm2.bias', 'model.vision_model.encoder.layers.11.layer_norm2.weight', 'model.vision_model.encoder.layers.11.mlp.fc1.bias', 'model.vision_model.encoder.layers.11.mlp.fc1.weight', 'model.vision_model.encoder.layers.11.mlp.fc2.bias', 'model.vision_model.encoder.layers.11.mlp.fc2.weight', 'model.vision_model.encoder.layers.11.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.11.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.11.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.11.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.11.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.11.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.11.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.11.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.2.layer_norm1.bias', 'model.vision_model.encoder.layers.2.layer_norm1.weight', 'model.vision_model.encoder.layers.2.layer_norm2.bias', 'model.vision_model.encoder.layers.2.layer_norm2.weight', 'model.vision_model.encoder.layers.2.mlp.fc1.bias', 'model.vision_model.encoder.layers.2.mlp.fc1.weight', 'model.vision_model.encoder.layers.2.mlp.fc2.bias', 'model.vision_model.encoder.layers.2.mlp.fc2.weight', 'model.vision_model.encoder.layers.2.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.2.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.2.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.2.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.2.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.2.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.2.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.2.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.3.layer_norm1.bias', 'model.vision_model.encoder.layers.3.layer_norm1.weight', 'model.vision_model.encoder.layers.3.layer_norm2.bias', 'model.vision_model.encoder.layers.3.layer_norm2.weight', 'model.vision_model.encoder.layers.3.mlp.fc1.bias', 'model.vision_model.encoder.layers.3.mlp.fc1.weight', 'model.vision_model.encoder.layers.3.mlp.fc2.bias', 'model.vision_model.encoder.layers.3.mlp.fc2.weight', 'model.vision_model.encoder.layers.3.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.3.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.3.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.3.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.3.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.3.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.3.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.3.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.4.layer_norm1.bias', 'model.vision_model.encoder.layers.4.layer_norm1.weight', 'model.vision_model.encoder.layers.4.layer_norm2.bias', 'model.vision_model.encoder.layers.4.layer_norm2.weight', 'model.vision_model.encoder.layers.4.mlp.fc1.bias', 'model.vision_model.encoder.layers.4.mlp.fc1.weight', 'model.vision_model.encoder.layers.4.mlp.fc2.bias', 'model.vision_model.encoder.layers.4.mlp.fc2.weight', 'model.vision_model.encoder.layers.4.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.4.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.4.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.4.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.4.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.4.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.4.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.4.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.5.layer_norm1.bias', 'model.vision_model.encoder.layers.5.layer_norm1.weight', 'model.vision_model.encoder.layers.5.layer_norm2.bias', 'model.vision_model.encoder.layers.5.layer_norm2.weight', 'model.vision_model.encoder.layers.5.mlp.fc1.bias', 'model.vision_model.encoder.layers.5.mlp.fc1.weight', 'model.vision_model.encoder.layers.5.mlp.fc2.bias', 'model.vision_model.encoder.layers.5.mlp.fc2.weight', 'model.vision_model.encoder.layers.5.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.5.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.5.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.5.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.5.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.5.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.5.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.5.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.6.layer_norm1.bias', 'model.vision_model.encoder.layers.6.layer_norm1.weight', 'model.vision_model.encoder.layers.6.layer_norm2.bias', 'model.vision_model.encoder.layers.6.layer_norm2.weight', 'model.vision_model.encoder.layers.6.mlp.fc1.bias', 'model.vision_model.encoder.layers.6.mlp.fc1.weight', 'model.vision_model.encoder.layers.6.mlp.fc2.bias', 'model.vision_model.encoder.layers.6.mlp.fc2.weight', 'model.vision_model.encoder.layers.6.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.6.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.6.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.6.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.6.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.6.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.6.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.6.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.7.layer_norm1.bias', 'model.vision_model.encoder.layers.7.layer_norm1.weight', 'model.vision_model.encoder.layers.7.layer_norm2.bias', 'model.vision_model.encoder.layers.7.layer_norm2.weight', 'model.vision_model.encoder.layers.7.mlp.fc1.bias', 'model.vision_model.encoder.layers.7.mlp.fc1.weight', 'model.vision_model.encoder.layers.7.mlp.fc2.bias', 'model.vision_model.encoder.layers.7.mlp.fc2.weight', 'model.vision_model.encoder.layers.7.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.7.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.7.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.7.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.7.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.7.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.7.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.7.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.8.layer_norm1.bias', 'model.vision_model.encoder.layers.8.layer_norm1.weight', 'model.vision_model.encoder.layers.8.layer_norm2.bias', 'model.vision_model.encoder.layers.8.layer_norm2.weight', 'model.vision_model.encoder.layers.8.mlp.fc1.bias', 'model.vision_model.encoder.layers.8.mlp.fc1.weight', 'model.vision_model.encoder.layers.8.mlp.fc2.bias', 'model.vision_model.encoder.layers.8.mlp.fc2.weight', 'model.vision_model.encoder.layers.8.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.8.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.8.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.8.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.8.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.8.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.8.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.8.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.9.layer_norm1.bias', 'model.vision_model.encoder.layers.9.layer_norm1.weight', 'model.vision_model.encoder.layers.9.layer_norm2.bias', 'model.vision_model.encoder.layers.9.layer_norm2.weight', 'model.vision_model.encoder.layers.9.mlp.fc1.bias', 'model.vision_model.encoder.layers.9.mlp.fc1.weight', 'model.vision_model.encoder.layers.9.mlp.fc2.bias', 'model.vision_model.encoder.layers.9.mlp.fc2.weight', 'model.vision_model.encoder.layers.9.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.9.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.9.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.9.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.9.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.9.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.9.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.9.self_attn.v_proj.weight', 'model.vision_model.post_layernorm.bias', 'model.vision_model.post_layernorm.weight']
- This IS expected if you are initializing LlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /home/imad-saddik/.cache/huggingface/hub/models--ibm-granite--granite-docling-258M-text-only/snapshots/982fe3b40f2fa73c365bdb1bcacf6c81b7184bfe/ and are newly initialized: ['lm_head.weight', 'model.embed_tokens.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.post_attention_layernorm.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.1.input_layernorm.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.1.post_attention_layernorm.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.1.self_attn.o_proj.weight', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.10.input_layernorm.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.10.mlp.gate_proj.weight', 'model.layers.10.mlp.up_proj.weight', 'model.layers.10.post_attention_layernorm.weight', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.10.self_attn.o_proj.weight', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.11.input_layernorm.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.11.mlp.gate_proj.weight', 'model.layers.11.mlp.up_proj.weight', 'model.layers.11.post_attention_layernorm.weight', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.11.self_attn.o_proj.weight', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.12.input_layernorm.weight', 'model.layers.12.mlp.down_proj.weight', 'model.layers.12.mlp.gate_proj.weight', 'model.layers.12.mlp.up_proj.weight', 'model.layers.12.post_attention_layernorm.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.12.self_attn.o_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.13.input_layernorm.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.13.mlp.gate_proj.weight', 'model.layers.13.mlp.up_proj.weight', 'model.layers.13.post_attention_layernorm.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.13.self_attn.o_proj.weight', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.14.input_layernorm.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.14.mlp.gate_proj.weight', 'model.layers.14.mlp.up_proj.weight', 'model.layers.14.post_attention_layernorm.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.14.self_attn.o_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.15.input_layernorm.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.15.post_attention_layernorm.weight', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.15.self_attn.o_proj.weight', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.16.mlp.gate_proj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.16.self_attn.o_proj.weight', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.18.input_layernorm.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.18.mlp.up_proj.weight', 'model.layers.18.post_attention_layernorm.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.19.mlp.gate_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.2.input_layernorm.weight', 'model.layers.2.mlp.down_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'model.layers.2.mlp.up_proj.weight', 'model.layers.2.post_attention_layernorm.weight', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.2.self_attn.o_proj.weight', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.20.mlp.gate_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.21.self_attn.o_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.22.input_layernorm.weight', 'model.layers.22.mlp.down_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.23.post_attention_layernorm.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.24.mlp.gate_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.24.self_attn.o_proj.weight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.25.mlp.down_proj.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.27.input_layernorm.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.28.input_layernorm.weight', 'model.layers.28.mlp.down_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.28.post_attention_layernorm.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.29.mlp.down_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.29.post_attention_layernorm.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.3.input_layernorm.weight', 'model.layers.3.mlp.down_proj.weight', 'model.layers.3.mlp.gate_proj.weight', 'model.layers.3.mlp.up_proj.weight', 'model.layers.3.post_attention_layernorm.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.3.self_attn.o_proj.weight', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.4.input_layernorm.weight', 'model.layers.4.mlp.down_proj.weight', 'model.layers.4.mlp.gate_proj.weight', 'model.layers.4.mlp.up_proj.weight', 'model.layers.4.post_attention_layernorm.weight', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.4.self_attn.o_proj.weight', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.5.input_layernorm.weight', 'model.layers.5.mlp.down_proj.weight', 'model.layers.5.mlp.gate_proj.weight', 'model.layers.5.mlp.up_proj.weight', 'model.layers.5.post_attention_layernorm.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.5.self_attn.o_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.6.input_layernorm.weight', 'model.layers.6.mlp.down_proj.weight', 'model.layers.6.mlp.gate_proj.weight', 'model.layers.6.mlp.up_proj.weight', 'model.layers.6.post_attention_layernorm.weight', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.6.self_attn.o_proj.weight', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.7.input_layernorm.weight', 'model.layers.7.mlp.down_proj.weight', 'model.layers.7.mlp.gate_proj.weight', 'model.layers.7.mlp.up_proj.weight', 'model.layers.7.post_attention_layernorm.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.7.self_attn.o_proj.weight', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.8.input_layernorm.weight', 'model.layers.8.mlp.down_proj.weight', 'model.layers.8.mlp.gate_proj.weight', 'model.layers.8.mlp.up_proj.weight', 'model.layers.8.post_attention_layernorm.weight', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.8.self_attn.o_proj.weight', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.9.input_layernorm.weight', 'model.layers.9.mlp.down_proj.weight', 'model.layers.9.mlp.gate_proj.weight', 'model.layers.9.mlp.up_proj.weight', 'model.layers.9.post_attention_layernorm.weight', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.9.self_attn.o_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.9.self_attn.v_proj.weight', 'model.norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
==> Done loading model: 1.447446s
After that I ran the generation on the GPU and it finished in 5.5 seconds.