A Touch, Vision, and Language Dataset for Multimodal Alignment

by Max (Letian) Fu, Gaurav Datta*, Huang Huang*, William Chung-Ho Panitch*, Jaimyn Drake*, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg at UC Berkeley, Meta AI, TU Dresden, and CeTI (*equal contribution).

[Paper] | [Project Page] | [Checkpoints] | [Dataset] | [Citation]

This repo contains the official checkpoints for A Touch, Vision, and Language Dataset for Multimodal Alignment.

The tactile encoders comes in three different sizes: ViT-Tiny, ViT-Small, and ViT-Base, all of which are stored in

ckpt/tvl_enc

TVL-LLaMA, the generative counterparts, are stored in

ckpt/tvl_llama

Inference

For zero-shot classification, we would require OpenCLIP with the following configuration:

CLIP_VISION_MODEL = "ViT-L-14"
CLIP_PRETRAIN_DATA = "datacomp_xl_s13b_b90k"

For TVL-LLaMA, please request access to the pre-trained LLaMA-2 from this form. In particular, we use llama-2-7b as the base model. The weights here contains the trained adapter, the tactile encoder, and the vision encoder for the ease of loading.

For the complete info, please take a look at the GitHub repo to see instructions on pretraining, fine-tuning, and evaluation with these models.

Citation

Please give us a star ๐ŸŒŸ on Github to support us!

Please cite our work if you find our work inspiring or use our code in your work:

@article{fu2024tvl,
    title={A Touch, Vision, and Language Dataset for Multimodal Alignment}, 
    author={Letian Fu and Gaurav Datta and Huang Huang and William Chung-Ho Panitch and Jaimyn Drake and Joseph Ortiz and Mustafa Mukadam and Mike Lambeta and Roberto Calandra and Ken Goldberg},
    journal={arXiv preprint arXiv:2402.13232},
    year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.