nvidia
/

NVLM-D-72B

@@ -22,8 +22,13 @@ library_name: transformers
 ## Description
 This family of models performs vision-language and text-only tasks including optical character recognition, multimodal reasoning, localization, common sense reasoning, world knowledge utilization, and coding.
 ## License/Terms of Use
-[Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0) <br>
 # Model Details
@@ -84,7 +89,20 @@ Results (as of September 17th, 2024) in the multimodal benchmarks are as follows
 ## Model Architectures
-**Network Architecture:** Decoder-Only Transformer
 ### Input
 **Input Type(s):** Text, Image <br>
@@ -411,7 +429,7 @@ Wenliang Dai* (wdai@nvidia.com), Nayeon Lee* (nayeonl@nvidia.com), Boxin Wang* (
 ## Ethical Considerations
-NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

 ## Description
 This family of models performs vision-language and text-only tasks including optical character recognition, multimodal reasoning, localization, common sense reasoning, world knowledge utilization, and coding.
+This model is ready for non-commercial use.
 ## License/Terms of Use
+Governing Terms: Deed - [Attribution-NonCommercial 4.0 International - Creative Commons](https://creativecommons.org/licenses/by-nc/4.0/deed.en).
+Additional Information: [LICENSE · Qwen/Qwen2-72B-Instruct at main](https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE) for Qwen2-72B-Instruct and [The MIT License – Open Source Initiative](https://opensource.org/license/mit) for InternViT-6B-448px-V1-2.
 # Model Details
 ## Model Architectures
+**Network Architecture:** Decoder-Only Transformer
+**Text-only LLM backbone:** [Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct)
+**Vision encoder:** [InternViT-6B](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)
+### Robustness
+The model trained on this dataset cannot regenerate its training data:
+1. The model has no image generation capability since its output is only text. Hence it cannot regenerate any image it would have seen during training.
+2. The model cannot regenerate training text data: during training, the model takes text and images as inputs, and the model output (text) is conditioned on both inputs. During inference, without training images as input, the models would not be able to reproduce any part of the training text data.
 ### Input
 **Input Type(s):** Text, Image <br>
 ## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).