Zero-Shot Image Classification
Transformers
Safetensors
clip
Inference Endpoints

Are there any differences between the Full-model and TE-only versions?

#3
by easygoing0114 - opened

Hello, thank you for sharing this amazing CLIP!

I compared the Full-model and TE-only model in Flux.1, and I noticed subtle differences in the illustrations.

ViT-L-14-GmP-SAE-FULL-model_vs_ViT-L-14-GmP-SAE-TE-only.png

I understand that the TE-only model is extracted from the Full-model, containing only the text encoder part. Are there any other differences?

Or does the part other than the text encoder also contribute to image generation?


For reference, I always use the LongCLIP-SAE-ViT-L-14-FP32 model.

I featured this model in my blog:
Is Your Illustration High-Quality? Comparing T5xxl and CLIP-L with Real Data! | AI Image Journey

Thank you!

That's very interesting, thank you for sharing these results! And - it is quite unexpected, to be honest. Both the 'ViT-L-14-GmP-SAE-FULL-model.safetensors' and 'ViT-L-14-GmP-SAE-TE-only.safetensors' are partially converted to half precision (as per OpenAI's original CLIP code), so there should be no difference. What are you using to generate these? The only thing I can imagine is that the code handles a full model differently than a TE-only model somehow, i.e. as it extracts the TE from the full model. Technically, you could use different outputs from the Text Encoder; SDXL famously adjusted from using the final layer output to the penultimate layer output. After the final layer, you can also apply a final layer norm, or not, or use projection. This lead to some issues because some CLIP models for specific diffusion models simply lack the layers they don't need for that particular guidance, which then makes them incompatible with other diffusion models, see here for an example. However, for this very reason, even my "TE-only" models contain absolutely everything, the full Text Encoder up to including projection. So, there should be no difference, as whatever you are using to generate the images should find the same "extraction point" for embeddings and doesn't have to fall back to another one due to something missing from "TE-only".

Assuming from your (very nice! 👍) blog post, you're using ComfyUI or Forge; I'd open an issue on GitHub and ask about that there (feel free to include my response).

I'd be curious if there's also a difference to the pickle files, which are not converted to HuggingFace format (as the model.safetensors is), as I've seen plenty of research model repos shared with the statement that "safetensors will produce slightly different results than shown in the paper; to reproduce the results in the paper, please use the PyTorch model". Now, I don't expect you to download my pickles (after all, I am the only person to truly know that they are safe; you, on the other hand, can't be sure about that!); but if you did this in ComfyUI AND you'd be willing to share your full workflow, I'd be curious to try it myself!

Thank You for Your Reply

Thank you for your reply. I’m using ComfyUI.

First, I’ve attached the workflow I used for this comparison.

Flux1_Clip_compare_2025.1.12.png

This PNG image includes the ComfyUI workflow.

Since I’m using the T5xxl in the Flan-T5xxl-FP32 format, I’ve configured ComfyUI to start with the --fp32-text-enc option.

Although the CLIP-SAE-ViT-L-14 model is processed in FP32 format, it has already been rounded to FP16, so there shouldn’t be any difference in the results.


Flux.1[dev]

The phenomenon might be specific to the anime model I’m using, blue_pencil-flux1_v0.0.1-FP16. To check, I also compared the results with the original Flux.1[dev] model.

The workflow and conditions, except for the model, were exactly the same as before, including the prompts.

ViT-L-14-GmP-SAE-FULL-model_vs_TE-only_dev.png

The results were still different.


Flux.1[shnell]

Additionally, I conducted a comparison using Flux.1[shnell].

ViT-L-14-GmP-SAE-FULL-model_vs_TE-only_shnell.png

Once again, the results were different. Upon closer inspection,

ViT-L-14-GmP-SAE-FULL-model_vs_TE-only_shnell_close-up.png

the details generated with the ViT-L-14-GmP-SAE-FULL-model seem more refined, and the overall quality appears higher!


Original CLIP-L

Finally, I wondered whether the same phenomenon occurs with the original CLIP-L.

openai/clip-vit-large-patch14/model.safetensors(FP32)
comfyanonymous/flux_text_encoders/clip_l.safetensors(FP16)

original-CLIP-L_shnell_workflow.png

For this test, I configured ComfyUI to start with the --fp16-text-enc option so both are processed in FP16 format.

original-CLIP-L_dev.png

Here, the results were completely identical!


Observations

Your CLIP may separate the text encoder differently from how ComfyUI separates it.

Also, I’m just a hobbyist image generation user, not a technical expert, so it’s possible I’m misunderstanding something. However, if your CLIP creates some form of crosstalk between the text encoder and the vision transformer, I find that concept quite fascinating.

Thank you again for sharing this incredible CLIP, and also for providing it in FP32 format!

Additional Comments

I asked ChatGPT about this.

Is Crosstalk Between the Text Encoder and Vision Transformer in CLIP Theoretically Possible?

CLIP’s design separates the Text Encoder and Vision Transformer as independent modules. However, there are theoretical possibilities for "crosstalk" between them due to the following reasons:


1. Shared Embedding Space

  • CLIP maps both text and images into a shared embedding space (latent space).
  • Since this space is shared, optimization of one module during training could indirectly influence the other.

2. Gradient Propagation During Training

  • While the Text Encoder and Vision Transformer process separate data, the contrastive loss function aligns their outputs in the shared embedding space.
  • This process might induce a form of interaction (crosstalk) between the two modules.

3. Effects of Layer Regularization

  • Some implementations of CLIP employ layer normalization or weight sharing across modules.
  • These mechanisms could lead to unintended interactions between the Text Encoder and Vision Transformer.

4. Customization During Fine-Tuning

  • During fine-tuning for specific tasks, the Text Encoder and Vision Transformer may exhibit behaviors resembling crosstalk.
  • This is especially true when adapting to specific datasets or optimization objectives.

Conclusion

While CLIP’s architecture does not explicitly design for crosstalk, indirect interactions can theoretically occur through the shared embedding space and training dynamics. Customized CLIP models or task-specific fine-tuning could make such effects more noticeable.

Sign up or log in to comment