|
--- |
|
license: mit |
|
datasets: |
|
- SPRIGHT-T2I/spright_coco |
|
--- |
|
## Update 11/AUG/2024: |
|
|
|
New Best-Performing CLIP ViT-L/14 'GmP-smooth' model added (simply download the files named *BEST*!): |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/qb5hYNxSTMB5z7rSs7N9k.png) |
|
|
|
Or just create a fine-tune yourself: [https://github.com/zer0int/CLIP-fine-tune](https://github.com/zer0int/CLIP-fine-tune) |
|
|
|
How? |
|
- Geometric Parametrization (GmP) (same as before) |
|
- Activation Value manipulation for 'adverb neuron' (same as before) |
|
- NEW: Custom loss function with label smoothing! |
|
- For in-depth details, see my GitHub. π€ |
|
|
|
---- |
|
|
|
## A fine-tune of OpenAI / CLIP ViT-L/14 that has an unprecedented ImageNet/ObjectNet accuracy of ~0.90 (original pre-trained model / OpenAI's CLIP: ~0.85)**. |
|
|
|
Made possible with Geometric Parametrization (GmP): |
|
|
|
``` |
|
|
|
"Normal" CLIP MLP (multi-layer perceptron): |
|
|
|
(mlp): Sequential( |
|
|-(c_fc): Linear(in_features=1024, out_features=4096, bias=True) |
|
| (gelu): QuickGELU() |
|
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True) |
|
| | |
|
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight |
|
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias |
|
| |
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight |
|
|---- visual.transformer.resblocks.0.mlp.c_proj.bias |
|
|
|
|
|
GmP CLIP MLP: |
|
|
|
Weight decomposition into: |
|
- radial component 'r' as norm of pre-trained weights |
|
- angular component 'theta' as normalized direction |
|
-> preserves weight vectors' directionality and magnitude |
|
|
|
(mlp): Sequential( |
|
|-(c_fc): GeometricLinear() |
|
| (gelu): QuickGELU() |
|
|-}-(c_proj): GeometricLinear() |
|
| | |
|
| |-- visual.transformer.resblocks.0.mlp.c_fc.r |
|
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta |
|
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias |
|
| |
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r |
|
|---- visual.transformer.resblocks.0.mlp.c_proj.theta |
|
|---- visual.transformer.resblocks.0.mlp.c_proj.bias |
|
|
|
(Same thing for [text] transformer.resblocks) |
|
|
|
``` |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/mqIgsH_aWKop_DDQ2KglN.png) |
|
|
|
β
The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder! π€ |
|
|
|
- ** For details on training and those numbers / the eval, please see [https://github.com/zer0int/CLIP-fine-tune](https://github.com/zer0int/CLIP-fine-tune) |
|
- -> You can use "exp-acts-ft-finetune-OpenAI-CLIP-ViT-L-14-GmP-manipulate-neurons.py" to replicate my exact model fine-tune. |
|
|
|
Pre-trained CLIP model by OpenAI, License: [MIT License](https://github.com/openai/CLIP/blob/main/LICENSE) |