how to infer text-img pair demo?
Using openai official text model, text embedding dim is 768, mismatching with llm2clip img embedding dim 1280.
text_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14-336")
inputs = tokenizer(text=texts, padding=True, return_tensors="pt").to(device)
text_features = text_model.get_text_features(**inputs) # [1, 768]
Hello, we will upload our text model to Hugging Face within the next couple of days and aim to release all the parameters of the text models, adapters, and related components. Previously, we experienced some delays due to precision issues during the Hugging Face conversion process, but we have resolved them and will soon upload all the parameters you might need. We welcome your suggestions and requests and will do our best to update versions to meet your requirements, making it more convenient for everyone to conduct research.
@WinstonDeng We have updated the caption contrastive fine-tuned version of Llama3-8B-CC (https://huggingface.co/microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned) to assist with your retrieval experiments and training of your own CLIP models. Additionally, the parameters for our adapter and projector have been made available in our OpenAI ViT-L repository (https://huggingface.co/microsoft/LLM2CLIP-Openai-L-14-336). The retrieval testing methods are documented in the model card for reference.
Our tests show retrieval performance exceeding the results reported in the paper, and we encourage you to try it out.
Regarding the EVA series of models, there have been precision mismatches during the conversion to Hugging Face, which are currently being fixed. Updates will be released progressively.
Furthermore, we will provide detailed instructions on how to use LLM2CLIP to fine-tune your own CLIP models in about a week—please stay tuned!