Any-to-Any
Safetensors
ml-4m
Edit model card

4M: Massively Multimodal Masked Modeling

A framework for training any-to-any multimodal foundation models.
Scalable. Open-sourced. Across tens of modalities and tasks.

Website | GitHub | BibTeX

Official implementation and pre-trained models for :

4M: Massively Multimodal Masked Modeling, NeurIPS 2023 (Spotlight)
David Mizrahi*, Roman Bachmann*, Oğuzhan Fatih Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, Amir Zamir

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities, arXiv 2024
Roman Bachmann*, Oğuzhan Fatih Kar*, David Mizrahi*, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir

4M is a framework for training "any-to-any" foundation models, using tokenization and masking to scale to many diverse modalities. Models trained using 4M can perform a wide range of vision tasks, transfer well to unseen tasks and modalities, and are flexible and steerable multimodal generative models. We are releasing code and models for "4M: Massively Multimodal Masked Modeling" (here denoted 4M-7), as well as "4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities" (here denoted 4M-21).

Installation

For install instructions, please see https://github.com/apple/ml-4m.

Usage

This model can be loaded from Hugging Face Hub as follows:

from fourm.models.fm import FM
fm = FM.from_pretrained('EPFL-VILAB/4M-7-SR_L_CC12M')

Please see https://github.com/apple/ml-4m/blob/main/README_GENERATION.md for more detailed instructions and https://github.com/apple/ml-4m for other 4M model and tokenizer checkpoints.

Citation

If you find this repository helpful, please consider citing our work:

@inproceedings{4m,
    title={{4M}: Massively Multimodal Masked Modeling},
    author={David Mizrahi and Roman Bachmann and O{\u{g}}uzhan Fatih Kar and Teresa Yeo and Mingfei Gao and Afshin Dehghan and Amir Zamir},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
}

@article{4m21,
    title={{4M-21}: An Any-to-Any Vision Model for Tens of Tasks and Modalities},
    author={Roman Bachmann and O{\u{g}}uzhan Fatih Kar and David Mizrahi and Ali Garjani and Mingfei Gao and David Griffiths and Jiaming Hu and Afshin Dehghan and Amir Zamir},
    journal={arXiv 2024},
    year={2024},
}

License

The model weights in this repository are released under the Sample Code license as found in the LICENSE file.

Downloads last month
131
Safetensors
Model size
965M params
Tensor type
F32
·
Inference API
Inference API (serverless) does not yet support ml-4m models for this pipeline type.

Space using EPFL-VILAB/4M-7-SR_L_CC12M 1

Collection including EPFL-VILAB/4M-7-SR_L_CC12M