Update README
Browse files- README.md +462 -93
- bias.md +13 -0
- explainability.md +15 -0
- privacy.md +13 -0
- safety.md +10 -0
README.md
CHANGED
|
@@ -10,65 +10,58 @@ tags:
|
|
| 10 |
- VLM
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
| 14 |
|
| 15 |
-
## Model Overview
|
| 16 |
|
| 17 |
-
### Description
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
This model was trained on commercial images for all three stages of training and supports single image inference.
|
| 22 |
|
| 23 |
### License/Terms of Use
|
| 24 |
-
**Governing Terms:**
|
| 25 |
-
|
| 26 |
-
Your use of the model is governed by the [NVIDIA Open License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
|
| 27 |
-
|
| 28 |
-
**Additional Information:**
|
| 29 |
-
|
| 30 |
-
Backbone LLM: NVIDIA-Nemotron-Nano-12B-v2.
|
| 31 |
|
|
|
|
| 32 |
|
| 33 |
### Deployment Geography:
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
### Use Case:
|
| 38 |
-
|
| 39 |
-
Customers: AI foundry enterprise customers
|
| 40 |
-
|
| 41 |
-
Use Cases: Image summarization. Text-image analysis, Optical Character Recognition, Interactive Q&A on images, Text Chain-of-Thought reasoning
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
## Release Date:
|
| 45 |
|
|
|
|
| 46 |
- Build.Nvidia.com [June 3rd, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-vl-12b-v2)
|
| 47 |
-
- Hugging Face [June 3rd, 2025]
|
| 48 |
|
| 49 |
-
## Model Architecture:
|
| 50 |
|
| 51 |
-
**Network Type:** Transformer
|
| 52 |
|
|
|
|
|
|
|
|
|
|
| 53 |
**Network Architecture:**
|
| 54 |
-
|
| 55 |
-
Vision Encoder: [C-RADIOv2-H](https://huggingface.co/nvidia/C-RADIOv2-VLM-H)
|
| 56 |
-
|
| 57 |
Language Encoder: NVIDIA-Nemotron-Nano-12B-v2
|
|
|
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
Input Type(s): Image, Text
|
| 62 |
-
- Input Images
|
| 63 |
-
- Language Supported: German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese, English
|
| 64 |
|
| 65 |
-
Input Format(s): Image (Red, Green, Blue (RGB)), and Text (String)
|
| 66 |
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
Other Properties Related to Input:
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
- Maximum Resolution: Determined by a 12-tile layout constraint, with each tile being 512 × 512 pixels. This supports aspect ratios such as:
|
| 73 |
- 4 × 3 layout: up to 2048 × 1536 pixels
|
| 74 |
- 3 × 4 layout: up to 1536 × 2048 pixels
|
|
@@ -76,26 +69,41 @@ Other Properties Related to Input:
|
|
| 76 |
- 6 × 2 layout: up to 3072 × 1024 pixels
|
| 77 |
- Other configurations allowed, provided total tiles ≤ 12
|
| 78 |
- Channel Count: 3 channels (RGB)
|
| 79 |
-
- Alpha Channel: Not supported (no transparency)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
-
|
| 82 |
-
Output Type(s): Text
|
| 83 |
|
| 84 |
-
Output Formats: String
|
| 85 |
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
|
|
|
|
|
|
| 89 |
|
| 90 |
-
|
| 91 |
|
| 92 |
-
### Software Integration
|
| 93 |
-
Runtime Engine(s): TensorRT-LLM<br>
|
| 94 |
-
Supported Hardware Microarchitecture Compatibility: H100 SXM 80GB<br>
|
| 95 |
-
Supported Operating System(s): Linux<br>
|
| 96 |
|
| 97 |
-
|
| 98 |
-
NVIDIA-Nemotron-Nano-VL-
|
| 99 |
|
| 100 |
## Quick Start
|
| 101 |
|
|
@@ -311,7 +319,6 @@ output_text = processor.batch_decode(
|
|
| 311 |
print(output_text)
|
| 312 |
```
|
| 313 |
|
| 314 |
-
|
| 315 |
#### Inference with vLLM
|
| 316 |
|
| 317 |
Make sure to use the main branch of vLLM. Run the following install command:
|
|
@@ -337,50 +344,412 @@ vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8 --trust-remote-code --quant
|
|
| 337 |
vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD --trust-remote-code --quantization modelopt_fp4 --video-pruning-rate 0
|
| 338 |
```
|
| 339 |
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
|
| 344 |
-
Data
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
| 352 |
-
|
| 353 |
-
|
| 354 |
-
|
| 355 |
-
|
| 356 |
-
|
| 357 |
-
|
| 358 |
-
|
| 359 |
-
|
| 360 |
-
|
| 361 |
-
|
| 362 |
-
|
| 363 |
-
|
| 364 |
-
|
| 365 |
-
|
| 366 |
-
|
| 367 |
-
|
| 368 |
-
|
| 369 |
-
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 374 |
|
| 375 |
# Inference:
|
| 376 |
-
**Engine:**
|
| 377 |
-
**
|
| 378 |
-
* 1x NVIDIA H100 SXM 80GB
|
| 379 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 380 |
|
| 381 |
## Ethical Considerations:
|
| 382 |
-
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
|
| 383 |
-
|
| 384 |
-
Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
|
| 385 |
-
|
| 386 |
-
Outputs generated by these models may contain political content or other potentially misleading information, issues with content security and safety, or unwanted bias that is independent of our oversight.
|
|
|
|
| 10 |
- VLM
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Model Overview
|
| 14 |
|
|
|
|
| 15 |
|
| 16 |
+
### Description:
|
| 17 |
+
NVIDIA Nemotron Nano v2 12B VL model enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities.
|
| 18 |
+
<br>
|
| 19 |
|
| 20 |
+
This model is ready for commercial use. <br>
|
|
|
|
|
|
|
| 21 |
|
| 22 |
### License/Terms of Use
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
Governing Terms: Use of this model is governed by the [ NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
|
| 25 |
|
| 26 |
### Deployment Geography:
|
| 27 |
+
Global <br>
|
| 28 |
|
| 29 |
+
### Use Case: <br>
|
| 30 |
+
Nemotron Nano 12B V2 VL is a model for multi-modal document intelligence. It would be used by individuals or businesses that need to process documents such as invoices, receipts, and manuals. The model is capable of handling multiple images of documents, up to four images at a resolution of 1k x 2k each, along with a long text prompt. The expected use is for tasks like summarization and Visual Question Answering (VQA). The model is also expected to have a significant advantage in throughput. <br>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
### Release Date: <br>
|
| 33 |
- Build.Nvidia.com [June 3rd, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-vl-12b-v2)
|
| 34 |
+
- Hugging Face [June 3rd, 2025] via [nvidia/NVIDIA-Nemotron-Nano-VL-12B-V2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16)
|
| 35 |
|
|
|
|
| 36 |
|
|
|
|
| 37 |
|
| 38 |
+
# Model Architecture:
|
| 39 |
+
**Architecture Type:**
|
| 40 |
+
Transformer <br>
|
| 41 |
**Network Architecture:**
|
| 42 |
+
Vision Encoder: CRadioV2-H
|
|
|
|
|
|
|
| 43 |
Language Encoder: NVIDIA-Nemotron-Nano-12B-v2
|
| 44 |
+
<br>
|
| 45 |
|
| 46 |
+
** Number of model parameters: 12.6B<br>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
|
|
|
| 48 |
|
| 49 |
+
## Computational Load (For NVIDIA Models Only)
|
| 50 |
+
**Cumulative Compute:** 2.2e+22 <br>
|
| 51 |
+
**Estimated Energy and Emissions for Model Training:**
|
| 52 |
+
Energy Consumption: 7,827.46 kWh <br>
|
| 53 |
+
Carbon Emissions: 3.21 tCO2e <br>
|
| 54 |
|
|
|
|
| 55 |
|
| 56 |
+
## Input: <br>
|
| 57 |
+
**Input Type(s):** Image, Video, Text
|
| 58 |
+
**Input Format:** Image (png,jpg), Video (MP4, MKV, FLV, 3GP), Text (String) <br>
|
| 59 |
+
**Input Parameters:** Image (2D),Video(3D), Text (1D) <br>
|
| 60 |
+
**Other Properties Related to Input:**
|
| 61 |
+
- Input Images Supported: 4
|
| 62 |
+
- Language Supported: English only <br>
|
| 63 |
+
- Input + Output Token: 128K
|
| 64 |
+
- Minimum Resolution: 32 × 32 pixels
|
| 65 |
- Maximum Resolution: Determined by a 12-tile layout constraint, with each tile being 512 × 512 pixels. This supports aspect ratios such as:
|
| 66 |
- 4 × 3 layout: up to 2048 × 1536 pixels
|
| 67 |
- 3 × 4 layout: up to 1536 × 2048 pixels
|
|
|
|
| 69 |
- 6 × 2 layout: up to 3072 × 1024 pixels
|
| 70 |
- Other configurations allowed, provided total tiles ≤ 12
|
| 71 |
- Channel Count: 3 channels (RGB)
|
| 72 |
+
- Alpha Channel: Not supported (no transparency) <br>
|
| 73 |
+
- Frames: 2 FPS with min of 8 frame and max of 128 frames
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
## Output: <br>
|
| 78 |
+
**Output Type(s):** Text <br>
|
| 79 |
+
**Output Format:** String <br>
|
| 80 |
+
Output Parameters: 1D <br>
|
| 81 |
+
**Other Properties Related to Output:** Input + Output Token: 128K <br>
|
| 82 |
|
| 83 |
+
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
|
|
|
|
| 84 |
|
|
|
|
| 85 |
|
| 86 |
+
## Software Integration:
|
| 87 |
+
**Runtime Engine(s):**
|
| 88 |
+
* [vLLM] <br>
|
| 89 |
+
* [TRT-LLM] <br>
|
| 90 |
|
| 91 |
+
**Supported Hardware Microarchitecture Compatibility:** <br>
|
| 92 |
+
* NVIDIA L40S <br>*
|
| 93 |
+
* NVIDIA A100 <br>
|
| 94 |
+
* NVIDIA B200 <br>
|
| 95 |
+
* NVIDIA H100/H200 <br>
|
| 96 |
+
* NVIDIA RTX PRO 6000 Server Edition<br>
|
| 97 |
+
* NVIDIA GB200 <br>
|
| 98 |
|
| 99 |
+
**Preferred/Supported Operating System(s):**
|
| 100 |
+
* [Linux] <br>
|
| 101 |
|
| 102 |
+
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>
|
| 103 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
+
## Model Version(s):
|
| 106 |
+
NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/FP8/NVFP4 <br>
|
| 107 |
|
| 108 |
## Quick Start
|
| 109 |
|
|
|
|
| 319 |
print(output_text)
|
| 320 |
```
|
| 321 |
|
|
|
|
| 322 |
#### Inference with vLLM
|
| 323 |
|
| 324 |
Make sure to use the main branch of vLLM. Run the following install command:
|
|
|
|
| 344 |
vllm serve nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD --trust-remote-code --quantization modelopt_fp4 --video-pruning-rate 0
|
| 345 |
```
|
| 346 |
|
| 347 |
+
## Training, Testing, and Evaluation Datasets:
|
| 348 |
+
|
| 349 |
+
### Training Datasets:
|
| 350 |
+
|
| 351 |
+
**Data Modalities** <br>
|
| 352 |
+
** Total Size: 39'486'703 samples <br>
|
| 353 |
+
** Total Number of Datasets: 270 <br>
|
| 354 |
+
** Text-only datasets: 33 <br>
|
| 355 |
+
** Text-and-image datasets: 176 <br>
|
| 356 |
+
** Video-and-text datasets: 61 <br>
|
| 357 |
+
** Total size: 27.7 TB <br>
|
| 358 |
+
|
| 359 |
+
** Data modalities: Text, Image, Video <br>
|
| 360 |
+
** Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic <br>
|
| 361 |
+
** Labeling Method by dataset: Hybrid: Automated, Human, Synthetic <br>
|
| 362 |
+
|
| 363 |
+
** Dataset partition: Training [100%], Testing [0%], Validation [0%] <br>
|
| 364 |
+
** Time period for training data collection: 2023-2025 <br>
|
| 365 |
+
** Time period for testing data collection: N/A <br>
|
| 366 |
+
** Time period for validation data collection: N/A <br>
|
| 367 |
+
|
| 368 |
+
The post-training datasets consist of a mix of internal and public datasets designed for training vision language models across various tasks. It includes:
|
| 369 |
+
|
| 370 |
+
* Public datasets sourced from publicly available images and annotations, supporting tasks like classification, captioning, visual question answering, conversation modeling, document analysis and text/image reasoning.
|
| 371 |
+
* Internal text and image datasets built with public commercial images and internal labels, adapted for the same tasks as listed above.
|
| 372 |
+
* Synthetic image datasets generated programmatically for specific tasks like tabular data understanding and optical character recognition (OCR), for English, Chinese as well as other languages.
|
| 373 |
+
* Video datasets supporting video question answering and reasoning tasks from publicly available video sources, with either publicly available or internally generated annotations.
|
| 374 |
+
* Specialized datasets for safety alignment, function calling, and domain-specific tasks (e.g., science diagrams, financial question answering).
|
| 375 |
+
* NVIDIA-Sourced Synthetic Datasets for text reasoning.
|
| 376 |
+
* Private datasets for safety alignment or VQA on invoices.
|
| 377 |
+
* Crawled or scraped captioning, VQA, and video datasets.
|
| 378 |
+
* Some datasets were improved with Qwen2.5-72B-Instruct annotations
|
| 379 |
+
|
| 380 |
+
For around ~30% of our total training corpus and several of the domains listed above, we used commercially permissive models to perform:
|
| 381 |
+
* Language translation
|
| 382 |
+
* Re-labeling of annotations for text, image and video datasets
|
| 383 |
+
* Synthetic data generation
|
| 384 |
+
* Generating chain-of-thought (CoT) traces
|
| 385 |
+
|
| 386 |
+
Additional processing for several datasets included rule-based QA generation (e.g., with templates), expanding short answers into longer responses, as well as proper reformatting. More details can be found [here](https://arxiv.org/abs/2501.14818).
|
| 387 |
+
|
| 388 |
+
|
| 389 |
+
** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.<br>
|
| 390 |
+
|
| 391 |
+
# Public Datasets <br>
|
| 392 |
+
| Dataset Name | Type | Modalities | Number of Samples | Size |
|
| 393 |
+
|--------------|------|------------|-------------------|------|
|
| 394 |
+
| Captioning on Open Images (subset, relabeled) | VQA | image, text | 1'278'221 | 378.34 GB |
|
| 395 |
+
| Localized Narratives (subset, relabeled) | VQA | image, text | 503'275 | 147.67 GB |
|
| 396 |
+
| TextCaps (subset) | Image Captioning | image, text | 21'953 | 5.76 GB |
|
| 397 |
+
| TextCaps (subset) | Image Captioning | image, text | 109'765 | 28.81 GB |
|
| 398 |
+
| TextVQA (subset) | Image Captioning | image, text | 34'602 | 9.08 GB |
|
| 399 |
+
| RefCoco | Referring Expression Grounding | image, text | 14'694 | 2.39 GB |
|
| 400 |
+
| VQAv2 | VQA | image, text | 28'555 | 4.41 GB |
|
| 401 |
+
| AOKVQA | VQA | image, text | 20'832 | 3.39 GB |
|
| 402 |
+
| GQA | VQA | image, text | 21'433 | 2.94 GB |
|
| 403 |
+
| AOKVQA | VQA | image, text | 16'131 | 2.62 GB |
|
| 404 |
+
| synthdog-en | OCR | image, text | 29'672 | 2.31 GB |
|
| 405 |
+
| WIT | Image Captioning | image, text | 538'916 | 745.24 GB |
|
| 406 |
+
| CLEVR | Image Reasoning | image, text | 70'000 | 12.57 GB |
|
| 407 |
+
| CLEVR-Math | Image Reasoning | image, text | 70'000 | 12.47 GB |
|
| 408 |
+
| OpenAssistant (oasst1, oasst2) | Text Instruction Tuning | text | 47'118 | 0.09 GB |
|
| 409 |
+
| VATEX | Video Captioning | video, text | 2'880 | 5.50 GB |
|
| 410 |
+
| YouCook2 | Video Captioning | video, text | 36 | 0.17 GB |
|
| 411 |
+
| VCG+ 112K | VideoQA | video, text | 164 | 2.82 GB |
|
| 412 |
+
| Video Localized Narratives | Video Captioning | video, text | 373 | 0.64 GB |
|
| 413 |
+
| CLEVRER | VQA | video, text | 40'000 | 46.05 GB |
|
| 414 |
+
| NExT-QA | VideoQA | video, text | 10'368 | 57.06 GB |
|
| 415 |
+
| CLEVRER | Video Reasoning | video, text | 42'620 | 49.10 GB |
|
| 416 |
+
| ScreenQA | VQA | image, text | 302'004 | 30.52 GB |
|
| 417 |
+
| WikiSQL | Image Reasoning | image, text | N/A | N/A |
|
| 418 |
+
| WikiTableQuestions | TextQA | text | N/A | N/A |
|
| 419 |
+
| RenderedText | OCR | image, text | N/A | N/A |
|
| 420 |
+
| FinQA | Text Reasoning | text | N/A | N/A |
|
| 421 |
+
| TAT-QA | Text Reasoning | text | N/A | N/A |
|
| 422 |
+
| Databricks Dolly 15K | Text Instruction Tuning | text | N/A | N/A |
|
| 423 |
+
| WebSight | Image Classification | image, text | N/A | N/A |
|
| 424 |
+
| RAVEN | Image Reasoning | image, text | N/A | N/A |
|
| 425 |
+
| VizWiz | VQA | image, text | N/A | N/A |
|
| 426 |
+
| Inter-GPS | Image Reasoning | image, text | N/A | N/A |
|
| 427 |
+
| OCR dataset from arXiv data | OCR | image, text | 120'000 | 49.99 GB |
|
| 428 |
+
| OCR dataset from arXiv data | OCR | image, text | 599'927 | 249.93 GB |
|
| 429 |
+
| OCR dataset from arXiv data | OCR | image, text | 1'565'011 | 1637.79 GB |
|
| 430 |
+
| OCR dataset from arXiv data | OCR | image, text | 418'059 | 422.04 GB |
|
| 431 |
+
| OCR dataset from arXiv data | OCR | image, text | 200'001 | 200.89 GB |
|
| 432 |
+
| OCR dataset from arXiv data | OCR | image, text | 200'000 | 198.94 GB |
|
| 433 |
+
| OCR dataset from arXiv data | OCR | image, text | 200'001 | 196.08 GB |
|
| 434 |
+
| OCR dataset from arXiv data | OCR | image, text | 400'000 | 382.95 GB |
|
| 435 |
+
| OCR dataset from arXiv data | OCR | image, text | 400'000 | 388.16 GB |
|
| 436 |
+
| OCR dataset from arXiv data | OCR | image, text | 18'280 | 20.98 GB |
|
| 437 |
+
| DocLayNet (curated) | OCR | image, text | 48'369 | 18.59 GB |
|
| 438 |
+
| DocLayNet (curated & augmented) | OCR | image, text | 48'249 | 9.12 GB |
|
| 439 |
+
| DocLayNet (curated & augmented) | OCR | image, text | 48'267 | 9.09 GB |
|
| 440 |
+
| SynthTabNet | OCR | image, text | 200'000 | 9.70 GB |
|
| 441 |
+
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'309 | 17.00 GB |
|
| 442 |
+
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'461 | 7.77 GB |
|
| 443 |
+
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 8'462 | 7.99 GB |
|
| 444 |
+
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'236 | 5.84 GB |
|
| 445 |
+
| OCR dataset based on pdfs from CommonCrawl | OCR | image, text | 14'232 | 5.92 GB |
|
| 446 |
+
| SynthTables | OCR | image, text | 4'887 | 0.38 GB |
|
| 447 |
+
| TabRecSet | OCR | image, text | 25'281 | 2.46 GB |
|
| 448 |
+
| TabRecSet | OCR | image, text | 25'281 | 1.61 GB |
|
| 449 |
+
| FinTabNet | OCR | image, text | 57'137 | 9.22 GB |
|
| 450 |
+
| FinTabNet | OCR | image, text | 57'131 | 21.76 GB |
|
| 451 |
+
| FinTabNet | OCR | image, text | 57'129 | 21.68 GB |
|
| 452 |
+
| PubTables-1M | OCR | image, text | 224'170 | 29.55 GB |
|
| 453 |
+
| PubTables-1M | OCR | image, text | 224'169 | 36.32 GB |
|
| 454 |
+
| PubTables-1M | OCR | image, text | 225'108 | 36.45 GB |
|
| 455 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 37.13 GB |
|
| 456 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 33.38 GB |
|
| 457 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 32.85 GB |
|
| 458 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 31.15 GB |
|
| 459 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.30 GB |
|
| 460 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 38.40 GB |
|
| 461 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 27.09 GB |
|
| 462 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 29.52 GB |
|
| 463 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.49 GB |
|
| 464 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 30.14 GB |
|
| 465 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 100.14 GB |
|
| 466 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.82 GB |
|
| 467 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 93.96 GB |
|
| 468 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.61 GB |
|
| 469 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 89.89 GB |
|
| 470 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 95.75 GB |
|
| 471 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 85.65 GB |
|
| 472 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 91.01 GB |
|
| 473 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 90.29 GB |
|
| 474 |
+
| OCR dataset based on Wikimedia | OCR | image, text | 200'000 | 84.66 GB |
|
| 475 |
+
| TextOCR | OCR | image, text | 21'727 | 5.83 GB |
|
| 476 |
+
| TextOCR | OCR | image, text | 21'138 | 2.83 GB |
|
| 477 |
+
| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'359 | 12.92 GB |
|
| 478 |
+
| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'351 | 14.57 GB |
|
| 479 |
+
| Table OCR on pdfs from CommonCrawl | OCR | image, text | 19'350 | 14.44 GB |
|
| 480 |
+
| HierText | OCR | image, text | 8'278 | 2.60 GB |
|
| 481 |
+
| FUNSD | OCR | image, text | 149 | 0.01 GB |
|
| 482 |
+
| Gretel Synthetic Safety Alignment | Safety | Text | 19'779 | 0.03 GB |
|
| 483 |
+
| Internal safety alignment multimodal dataset | Safety | image, text | 22'559 | 8.27 GB |
|
| 484 |
+
| ALFRED Action | Safety | video, text | 6'524 | 5.92 GB |
|
| 485 |
+
| ALFRED Goal | Safety | video, text | 6'464 | 5.86 GB |
|
| 486 |
+
| VQA-RAD | Safety | image, text | 1'793 | 0.09 GB |
|
| 487 |
+
| SLAKE | Safety | image, text | 9'835 | 0.85 GB |
|
| 488 |
+
| STEM MMLU-aux (subset) | Safety | text | 37'444 | 0.49 GB |
|
| 489 |
+
| Glaive & Xlam | Function call | text | 8'000 | 0.02 GB |
|
| 490 |
+
| Textbooks VQA | VQA | image, text | 46'745 | 10.85 GB |
|
| 491 |
+
| ai2d | VQA | image, text | 12'413 | 2.23 GB |
|
| 492 |
+
| ScienceQA | VQA | image, text | 12'716 | 0.39 GB |
|
| 493 |
+
| ScienceQA from LlaVA-OneVision | VQA | image, text | 19'196 | 0.65 GB |
|
| 494 |
+
| ChartQA | VQA | image, text | 15'121 | 0.68 GB |
|
| 495 |
+
| ChartQA (augmented) | VQA | image, text | 15'050 | 0.65 GB |
|
| 496 |
+
| ChartQA (CoT) | VQA | image, text | 23'571 | 1.04 GB |
|
| 497 |
+
| ChartQA | VQA | image, text | 60'438 | 2.69 GB |
|
| 498 |
+
| Geo170K | VQA | image, text | 13'263 | 0.07 GB |
|
| 499 |
+
| InfographicVQA | VQA | image, text | 23'946 | 8.21 GB |
|
| 500 |
+
| DocVQA | VQA | image, text | 39'463 | 26.29 GB |
|
| 501 |
+
| DocVQA (CoT) | Image Reasoning | image, text | 16'881 | 10.65 GB |
|
| 502 |
+
| ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 524'892 | 96.99 GB |
|
| 503 |
+
| ALLaVA-4V (subset) | Visual Instruction Tuning | image, text | 227'776 | 42.52 GB |
|
| 504 |
+
| TabMWP | Image Reasoning | image, text | 23'058 | 0.30 GB |
|
| 505 |
+
| PMC-VQA | VQA | image, text | 2'266 | 0.04 GB |
|
| 506 |
+
| OCR-VQA from The Cauldron | VQA | image, text | 165'746 | 5.79 GB |
|
| 507 |
+
| ST-VQA from The Cauldron | VQA | image, text | 17'232 | 0.68 GB |
|
| 508 |
+
| WebSight from The Cauldron | OCR | image, text | 9'809 | 1.84 GB |
|
| 509 |
+
| EST-VQA | VQA | image, text | 17'043 | 4.25 GB |
|
| 510 |
+
| TAL Handwritten English OCR | OCR | image, text | 9'998 | 0.22 GB |
|
| 511 |
+
| TAL Handwritten Math writing | OCR | image, text | 22'244 | 0.33 GB |
|
| 512 |
+
| SlideVQA | VQA | image, text | 5'773 | 0.42 GB |
|
| 513 |
+
| pixmo-docs | VQA | image, text | 251'165 | 34.88 GB |
|
| 514 |
+
| pixmo-cap | Image Captioning | image, text | 706'897 | 261.63 GB |
|
| 515 |
+
| pixmo-cap-qa | VQA | image, text | 214'978 | 56.72 GB |
|
| 516 |
+
| pixmo-ask-model-anything | Visual Instruction Tuning | image, text | 153'592 | 20.50 GB |
|
| 517 |
+
| TallyQA | VQA | image, text | 68'775 | 10.64 GB |
|
| 518 |
+
| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 490.37 GB |
|
| 519 |
+
| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'664'533 | 488.17 GB |
|
| 520 |
+
| Bounding box to text annotations on a subset of Open Images | VQA | image, text | 1'128'326 | 324.46 GB |
|
| 521 |
+
| TabMWP (CoT) | Image Reasoning | image, text | 20'305 | 0.28 GB |
|
| 522 |
+
| VisualWebInstruct | Visual Instruction Tuning | image, text | 260'419 | 7.41 GB |
|
| 523 |
+
| Internal collection of public text SFT datasets | Text Instruction Tuning | text | 197'938 | 1.04 GB |
|
| 524 |
+
| ReCTS from ICDAR2019 | OCR | image, text | 20'000 | 1.77 GB |
|
| 525 |
+
| RCTW from ICDAR2017 | OCR | image, text | 8'034 | 7.85 GB |
|
| 526 |
+
| OCR equation heavy dataset from arXiv data | OCR | image, text | 2'000 | 0.03 GB |
|
| 527 |
+
| Mulberry-SFT (CoT) | Image Reasoning | image, text | 191'332 | 30.80 GB |
|
| 528 |
+
| LLaVA-CoT-100k (CoT) | Image Reasoning | image, text | 63'013 | 8.18 GB |
|
| 529 |
+
| GeomVerse (CoT) | Image Reasoning | image, text | 9'298 | 0.90 GB |
|
| 530 |
+
| MapQA (CoT) | Image Reasoning | image, text | 16'832 | 1.77 GB |
|
| 531 |
+
| MetaMathQA (CoT) | Text Reasoning | text | 225'408 | 4.55 GB |
|
| 532 |
+
| MetaMathQA (CoT) | Image Reasoning | image, text | 220'544 | 4.48 GB |
|
| 533 |
+
| PlotQA (CoT) | Image Reasoning | image, text | 16'256 | 0.76 GB |
|
| 534 |
+
| Visual7W Telling (CoT) | Image Reasoning | image, text | 62'592 | 3.21 GB |
|
| 535 |
+
| Visual7W Pointing | VQA | image, text | 25'733 | 0.93 GB |
|
| 536 |
+
| VisText | Image Captioning | image, text | 9'969 | 0.52 GB |
|
| 537 |
+
| ScreenQA | VQA | image, text | 32'724 | 3.51 GB |
|
| 538 |
+
| wave-ui-25k | OCR | image, text | 24'978 | 11.44 GB |
|
| 539 |
+
| Charts2500 | VQA | image, text | 2'486 | 0.09 GB |
|
| 540 |
+
| Cyrillic | OCR | image, text | 72'284 | 1.49 GB |
|
| 541 |
+
| CMM-Math | Image Reasoning | image, text | 13'148 | 0.05 GB |
|
| 542 |
+
| SimChart9K | Image Reasoning | image, text | 9'536 | 0.69 GB |
|
| 543 |
+
| UniChart | Image Reasoning | image, text | 504'885 | 17.04 GB |
|
| 544 |
+
| CASIA-HWDB2-line | OCR | image, text | 2'193 | 0.09 GB |
|
| 545 |
+
| MMTab | VQA | image, text | 232'746 | 59.23 GB |
|
| 546 |
+
| ArxivQA | VQA | image, text | 99'995 | 17.32 GB |
|
| 547 |
+
| docmatix-single | VQA | image, text | 19'992 | 3.94 GB |
|
| 548 |
+
| DocReason525K | Image Reasoning | image, text | 25'863 | 33.80 GB |
|
| 549 |
+
| FigureQA | VQA | image, text | 100'000 | 2.37 GB |
|
| 550 |
+
| LRV-Instruction | Visual Instruction Tuning | image, text | 7'198 | 0.37 GB |
|
| 551 |
+
| VisualWebInstruct (CoT) | Image Reasoning | image, text | 48'929 | 4.37 GB |
|
| 552 |
+
| DocMatix (multi-page) | Image Reasoning | image, text | 19'969 | 8.66 GB |
|
| 553 |
+
| spot-the-diff | Image Reasoning | image, text | 8'007 | 1.45 GB |
|
| 554 |
+
| DocVQA (CoT) | Image Reasoning | image, text | 36'333 | 24.32 GB |
|
| 555 |
+
| DocVQA (CoT) | Image Reasoning | image, text | 45'710 | 2.10 GB |
|
| 556 |
+
| DocVQA (CoT) | Image Reasoning | image, text | 19'548 | 6.70 GB |
|
| 557 |
+
| Mulberry-SFT (subset, CoT) | Image Reasoning | image, text | 103'763 | 18.45 GB |
|
| 558 |
+
| UniGeo (CoT) | Image Reasoning | image, text | 9'728 | 0.05 GB |
|
| 559 |
+
| NIGHTS | Image Reasoning | image, text | 12'906 | 37.01 GB |
|
| 560 |
+
| Mantis-Instruct (CoT) | Image Reasoning | image, text | 67'723 | 13.86 GB |
|
| 561 |
+
| OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 2'858 | 1.23 GB |
|
| 562 |
+
| OCR dataset based on pdfs from CommonCrawl | Image Reasoning | image, text | 586 | 0.46 GB |
|
| 563 |
+
| FinTabNet (relabeled) | Image Reasoning | image, text | 8'356 | 3.17 GB |
|
| 564 |
+
| Table OCR on pdfs from CommonCrawl | Image Reasoning | image, text | 4'846 | 3.65 GB |
|
| 565 |
+
| HierText (relabeled for QA) | Image Reasoning | image, text | 514 | 0.07 GB |
|
| 566 |
+
| ECD-10k-Images | Image Reasoning | image, text | 132'613 | 15.38 GB |
|
| 567 |
+
| ActivityNet (open-ended QA) | VideoQA | video, text | 6'490 | 162.22 GB |
|
| 568 |
+
| NExT-QA (multi-choice QA) | VideoQA | video, text | 5'496 | 11.07 GB |
|
| 569 |
+
| NExT-QA (open-ended QA) | VideoQA | video, text | 5'492 | 10.99 GB |
|
| 570 |
+
| NExT-QA (multi-choice QA) | VideoQA | video, text | 52 | 0.74 GB |
|
| 571 |
+
| NExT-QA (open-ended QA) | VideoQA | video, text | 61 | 0.85 GB |
|
| 572 |
+
| NExT-QA (open-ended QA) | VideoQA | video, text | 6'843 | 27.83 GB |
|
| 573 |
+
| NExT-QA (multi-choice QA) | VideoQA | video, text | 6'843 | 27.85 GB |
|
| 574 |
+
| ActivityNet (open-ended QA) | VideoQA | video, text | 7'420 | 102.81 GB |
|
| 575 |
+
| ActivityNet (open-ended QA) | VideoQA | video, text | 3'840 | 25.84 GB |
|
| 576 |
+
| NExT-QA (multi-choice QA) | VideoQA | video, text | 4'633 | 35.38 GB |
|
| 577 |
+
| NExT-QA (open-ended QA) | VideoQA | video, text | 4'694 | 35.84 GB |
|
| 578 |
+
| ActivityNet (open-ended QA) | VideoQA | video, text | 2'580 | 7.46 GB |
|
| 579 |
+
| Perception Test (multi-choice QA) | VideoQA | video, text | 1'785 | 18.67 GB |
|
| 580 |
+
| Perception Test (multi-choice QA) | VideoQA | video, text | 618 | 11.52 GB |
|
| 581 |
+
| NExT-QA | VideoQA | video, text | 34'132 | 150.86 GB |
|
| 582 |
+
| CLEVRER | VideoQA | video, text | 40'000 | 46.03 GB |
|
| 583 |
+
| Video dataset based on Kinetics | VideoQA | video, text | 39'452 | 26.15 GB |
|
| 584 |
+
| EGO4D | VideoQA | video, text | 7'797 | 3.38 GB |
|
| 585 |
+
| TVQA | VideoQA | video, text | 34'868 | 100.05 GB |
|
| 586 |
+
| EgoExoLearn | VideoQA | video, text | 36'373 | 8558.27 GB |
|
| 587 |
+
| Video dataset based on Kinetics | VideoQA | video, text | 647'883 | 890.56 GB |
|
| 588 |
+
| Mementos | VideoQA | video, text | 4'060 | 14.07 GB |
|
| 589 |
+
| Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
|
| 590 |
+
| ActivityNet | VideoQA | video, text | 10'021 | 191.49 GB |
|
| 591 |
+
| EGO4D | VideoQA | video, text | 1'506 | 137.00 GB |
|
| 592 |
+
| FineAction | VideoQA | video, text | 7'504 | 169.76 GB |
|
| 593 |
+
| HACS | VideoQA | video, text | 31'223 | 829.25 GB |
|
| 594 |
+
| HiREST | VideoQA | video, text | 822 | 42.50 GB |
|
| 595 |
+
| Perception Test | VideoQA | video, text | 2'135 | 25.98 GB |
|
| 596 |
+
| ActivityNet | VideoQA | video, text | 9'064 | 181.24 GB |
|
| 597 |
+
| HiREST | VideoQA | video, text | 525 | 27.54 GB |
|
| 598 |
+
| YouCook2 | VideoQA | video, text | 1'180 | 77.65 GB |
|
| 599 |
+
| DiDeMo | VideoQA | video, text | 7'452 | 33.90 GB |
|
| 600 |
+
| EGO4D | VideoQA | video, text | 2'665 | 194.01 GB |
|
| 601 |
+
| MedVidQA | VideoQA | video, text | 933 | 40.35 GB |
|
| 602 |
+
| QuerYD | VideoQA | video, text | 1'562 | 50.69 GB |
|
| 603 |
+
| YouCook2 | VideoQA | video, text | 2'270 | 158.77 GB |
|
| 604 |
+
| EgoExoLearn (open-ended QA) | VideoQA | video, text | 9'998 | 1751.69 GB |
|
| 605 |
+
| Breakfast Actions | VideoQA | video, text | 1'204 | 3.45 GB |
|
| 606 |
+
| EgoExoLearn (multi-choice QA) | VideoQA | video, text | 6'832 | 1196.41 GB |
|
| 607 |
+
| CrossTask (multi-choice QA) | VideoQA | video, text | 75'686 | 417.50 GB |
|
| 608 |
+
| CrossTask (open-ended QA) | VideoQA | video, text | 20'399 | 112.02 GB |
|
| 609 |
+
| EgoProceL (multi-choice QA) | VideoQA | video, text | 4'789 | 42.74 GB |
|
| 610 |
+
| EgoProceL (open-ended QA) | VideoQA | video, text | 5'667 | 50.58 GB |
|
| 611 |
+
| HC-STVG (multi-choice QA) | VideoQA | video, text | 147'799 | 796.18 GB |
|
| 612 |
+
| HC-STVG (open-ended QA) | VideoQA | video, text | 41'050 | 221.82 GB |
|
| 613 |
+
| TAPOS (multi-choice QA) | VideoQA | video, text | 33'941 | 218.50 GB |
|
| 614 |
+
| TAPOS (open-ended QA) | VideoQA | video, text | 13'991 | 88.00 GB |
|
| 615 |
+
| Multi-page OCR based on CommonCrawl pdf data | VQA | image, text | 7'262 | 48.19 GB |
|
| 616 |
+
| Multi-page QA based on CommonCrawl pdf data | VQA | image, text | 455 | 31.88 GB |
|
| 617 |
+
| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'281 | 0.68 GB |
|
| 618 |
+
| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'285 | 0.67 GB |
|
| 619 |
+
| Table OCR dataset based on CommonCrawl pdf data | OCR | image, text | 4'282 | 0.67 GB |
|
| 620 |
+
| Selection of public datasets (relabeled) | Image Reasoning | image, text | 13'843 | 4.18 GB |
|
| 621 |
+
| Selection of public datasets (relabeled) | Image Reasoning | image, text | 18'442 | 3.89 GB |
|
| 622 |
+
| Perception Test | VideoQA | video, text | 7'392 | 94.95 GB |
|
| 623 |
+
| Perception Test (CoT) | VideoQA | video, text | 4'977 | 64.55 GB |
|
| 624 |
+
|
| 625 |
+
|
| 626 |
+
<br>
|
| 627 |
+
|
| 628 |
+
# Private Datasets <br>
|
| 629 |
+
| Dataset Name | Type | Modalities | Number of Samples | Size |
|
| 630 |
+
|--------------|------|------------|-------------------|------|
|
| 631 |
+
| Internal safety alignment text dataset | Safety | Text | N/A | N/A |
|
| 632 |
+
| Internal safety alignment text dataset | Safety | Text | N/A | N/A |
|
| 633 |
+
| Synthetic dataset with HLE data with DeepSeek-R1-0528 | Text Reasoning | text | 445'958 | 9.01 GB |
|
| 634 |
+
| Internal QA dataset on invoices | Image Reasoning | image, text | 6'471 | 5.22 GB |
|
| 635 |
+
| Internal QA dataset on invoices | Image Reasoning | image, text | 11'258 | 10.19 GB |
|
| 636 |
+
<br>
|
| 637 |
+
|
| 638 |
+
# Data Crawling and Scraping <br>
|
| 639 |
+
| Dataset Name | Type | Modalities | Number of Samples | Size |
|
| 640 |
+
|--------------|------|------------|-------------------|------|
|
| 641 |
+
| Internal video dataset | VideoQA | video, text | 274'472 | 348.84 GB |
|
| 642 |
+
| Internal video dataset | VideoQA | video, text | 14'256 | 44.46 GB |
|
| 643 |
+
| Internal VQA and captioning dataset | Image Captioning | image, text | 14'872 | 3.27 GB |
|
| 644 |
+
| Internal VQA dataset | VQA | image, text | 20'250 | 1.87 GB |
|
| 645 |
+
| Internal VQA dataset | VQA | image, text | 20'098 | 2.07 GB |
|
| 646 |
+
| Internal Captioning dataset | Image Captioning | image, text | 24'998 | 6.97 GB |
|
| 647 |
+
<br>
|
| 648 |
+
|
| 649 |
+
# User-Sourced Data (Collected by Provider including Prompts) <br>
|
| 650 |
+
<br>
|
| 651 |
+
|
| 652 |
+
# Self-Sourced Synthetic Data <br>
|
| 653 |
+
| Dataset Name | Type | Modalities | Number of Samples | Size |
|
| 654 |
+
|--------------|------|------------|-------------------|------|
|
| 655 |
+
| Random ASCII characters for OCR | OCR | image, text | 14'533 | 5.76 GB |
|
| 656 |
+
| Random ASCII characters for OCR | OCR | image, text | 14'533 | 9.26 GB |
|
| 657 |
+
| Random Chinese characters for OCR | OCR | image, text | 29'108 | 15.00 GB |
|
| 658 |
+
| Random Chinese characters for OCR | OCR | image, text | 29'108 | 24.11 GB |
|
| 659 |
+
| Random English characters for OCR | OCR | image, text | 14'525 | 5.65 GB |
|
| 660 |
+
| Random English characters for OCR | OCR | image, text | 14'525 | 9.39 GB |
|
| 661 |
+
| Synthetic sparse table dataset | OCR | image, text | 100'000 | 14.36 GB |
|
| 662 |
+
| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 1'165'591 | 54.15 GB |
|
| 663 |
+
| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Text Reasoning | text | 175'000 | 0.95 GB |
|
| 664 |
+
| Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 1'922'012 | 28.00 GB |
|
| 665 |
+
| Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 | Text Reasoning | text | 288'000 | 0.57 GB |
|
| 666 |
+
| Synthetic dataset with HLE data with DeepSeek-R1-0528 | Text Reasoning | text | 67'000 | 0.22 GB |
|
| 667 |
+
| Synthetic tool-calling data with seed tools from ToolBench, Glaive, xLAM and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 403'619 | 6.55 GB |
|
| 668 |
+
| Synthetic safety data with responses from DeepSeek-R1-0528 | Text Reasoning | text | 30'710 | 0.12 GB |
|
| 669 |
+
| Dummy conversation dataset | Text Reasoning | text | 2'262 | 0.00 GB |
|
| 670 |
+
| Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 32'752 | 0.26 GB |
|
| 671 |
+
| Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 3'636 | 0.01 GB |
|
| 672 |
+
| Synthetic chat dataset with responses from DeepSeek-R1 | Text Reasoning | text | 389'350 | 3.30 GB |
|
| 673 |
+
| Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 353'526 | 2.61 GB |
|
| 674 |
+
| Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 361'733 | 1.12 GB |
|
| 675 |
+
| Synthetic multilingual STEM from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct | Text Reasoning | text | 4'999'794 | 86.68 GB |
|
| 676 |
+
| Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning | Text Reasoning | text | 545'844 | 5.25 GB |
|
| 677 |
+
| Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning | Text Reasoning | text | 81'876 | 0.43 GB |
|
| 678 |
+
| Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 1'591'641 | 58.63 GB |
|
| 679 |
+
| Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 | Text Reasoning | text | 239'467 | 0.52 GB |
|
| 680 |
+
| Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 | Code | text | 1'165'591 | 54.15 GB |
|
| 681 |
+
| Synthetic tool calling dataset from DeepSeek-R1-0528 | Text Reasoning | text | 74'044 | 46.43 GB |
|
| 682 |
+
<br>
|
| 683 |
+
|
| 684 |
+
|
| 685 |
+
|
| 686 |
+
|
| 687 |
+
**Properties**<br>
|
| 688 |
+
* Additionally, the dataset collection (for training and evaluation) consists of a mix of internal and public datasets designed for training and evaluation across various tasks. It includes:
|
| 689 |
+
* Internal datasets built with public commercial images and internal labels, supporting tasks like conversation modeling and document analysis.
|
| 690 |
+
* Public datasets sourced from publicly available images and annotations, adapted for tasks such as image captioning and visual question answering.
|
| 691 |
+
* Synthetic datasets generated programmatically for specific tasks like tabular data understanding.
|
| 692 |
+
* Specialized datasets for safety alignment, function calling, and domain-specific tasks (e.g., science diagrams, financial question answering).
|
| 693 |
+
|
| 694 |
+
### Evaluation Datasets:
|
| 695 |
+
The following external benchmarks are used for evaluating the model: <br>
|
| 696 |
+
|
| 697 |
+
| Dataset |
|
| 698 |
+
|---------|
|
| 699 |
+
| [RDTableBench](https://github.com/Filimoa/rd-tablebench?tab=readme-ov-file ) |
|
| 700 |
+
| NVIDIA internal test set for OCR |
|
| 701 |
+
| [MMMU Val with ChatGPT as judge](https://mmmu-benchmark.github.io/) |
|
| 702 |
+
| [AI2D Test](https://prior.allenai.org/projects/diagram-understanding ) |
|
| 703 |
+
| [ChartQA Test](https://github.com/vis-nlp/ChartQA) |
|
| 704 |
+
| [InfoVQA Val](https://www.docvqa.org/datasets/infographicvqa) |
|
| 705 |
+
| [OCRBench](https://github.com/Yuliang-Liu/MultimodalOCR) |
|
| 706 |
+
| [OCRBenchV2](https://github.com/Yuliang-Liu/MultimodalOCR) English |
|
| 707 |
+
| [DocVQA Val](https://www.docvqa.org/datasets) |
|
| 708 |
+
| [SlideQA Val](https://github.com/nttmdlab-nlp/SlideVQA) |
|
| 709 |
+
| [Video MME](https://github.com/MME-Benchmarks/Video-MME) |
|
| 710 |
+
|
| 711 |
+
|
| 712 |
+
|
| 713 |
+
Data Collection Method by dataset: <br>
|
| 714 |
+
* Hybrid: Human, Automated <br>
|
| 715 |
+
|
| 716 |
+
Labeling Method by dataset: <br>
|
| 717 |
+
* Hybrid: Human, Automated <br>
|
| 718 |
+
|
| 719 |
+
**Properties (Quantity, Dataset Descriptions, Sensor(s)):** N/A <br>
|
| 720 |
+
|
| 721 |
+
**Dataset License(s):** N/A <br>
|
| 722 |
+
|
| 723 |
+
Evaluation benchmarks scores: <br>
|
| 724 |
+
|
| 725 |
+
<br>
|
| 726 |
+
| Benchmarks | Score|
|
| 727 |
+
|--------------------|--------------------------|
|
| 728 |
+
| MMMU* | 68 |
|
| 729 |
+
| MathVista* | 76.9 |
|
| 730 |
+
| AI2D | 87.11 |
|
| 731 |
+
| OCRBenchv2 | 62.0 |
|
| 732 |
+
| OCRBench | 85.6 |
|
| 733 |
+
| OCR-Reasoning | 36.4 |
|
| 734 |
+
| ChartQA | 89.72 |
|
| 735 |
+
| DocVQA | 94.39 |
|
| 736 |
+
| Video-MME w/o sub | 65.9 |
|
| 737 |
+
| Vision Average | 74.0 |
|
| 738 |
+
|
| 739 |
+
|
| 740 |
+
<br>
|
| 741 |
|
| 742 |
# Inference:
|
| 743 |
+
**Acceleration Engine:** [vLLM] <br>
|
| 744 |
+
**Acceleration Engine:** [TRT-LLM] <br>
|
|
|
|
| 745 |
|
| 746 |
+
**Test Hardware:** <br>
|
| 747 |
+
* NVIDIA L40S <br>*
|
| 748 |
+
* NVIDIA A100 <br>
|
| 749 |
+
* NVIDIA B200 <br>
|
| 750 |
+
* NVIDIA H100/H200 <br>
|
| 751 |
+
* NVIDIA RTX PRO 6000 Server Edition<br>
|
| 752 |
+
* NVIDIA GB200 <br>
|
| 753 |
|
| 754 |
## Ethical Considerations:
|
| 755 |
+
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br>
|
|
|
|
|
|
|
|
|
|
|
|
bias.md
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
| Field | Response |
|
| 2 |
+
|:---|:---|
|
| 3 |
+
| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
|
| 4 |
+
| Bias Metric (If Measured): | [BBQ Accuracy Scores in Ambiguous Contexts](https://github.com/nyu-mll/BBQ/) |
|
| 5 |
+
| Which characteristic (feature) show(s) the greatest difference in performance?: | The model shows high variance across many characteristics when used at a high temperature, with the greatest measurable difference seen in categories such as Gender Identity and Race x Gender. |
|
| 6 |
+
| Which feature(s) have the worst performance overall? | Age (ambiguous) has both the lowest category accuracy listed (0.75) and a notably negative bias score (–0.56), indicating it is the worst-performing feature overall in this evaluation. |
|
| 7 |
+
| Measures taken to mitigate against unwanted bias: | None |
|
| 8 |
+
| If using internal data, description of methods implemented in data acquisition or processing, if any, to address the prevalence of identifiable biases in the training, testing, and validation data: | The training datasets contain a large amount of synthetic data generated by LLMs. We manually curated prompts. |
|
| 9 |
+
| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | Bias Benchmark for Question Answering (BBQ) |
|
| 10 |
+
| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | The datasets, which include video datasets (e.g., YouCook2, VCG Human Dataset) and image captioning datasets, do not collectively or exhaustively represent all demographic groups (and proportionally therein).
|
| 11 |
+
For instance, these datasets do not contain explicit mentions of demographic classes such as age, gender, or ethnicity in over 80% of samples. In the subset where analysis was performed, certain datasets contain skews in the representation of participants—for example, perceived gender of "female" participants may be significant compared to "male" participants for certain datasets. Separately, individuals aged "40 to 49 years" and “20 to 29 years” are the most frequent among ethnic identifiers. Toxicity analysis was additionally performed on several datasets to identify potential not-safe-for-work samples and risks.
|
| 12 |
+
To mitigate these imbalances, we recommend considering evaluation techniques such as bias audits, fine-tuning with demographically balanced datasets, and mitigation strategies like counterfactual data augmentation to align with the desired model behavior. This evaluation was conducted on a data subset ranging from 200 to 3,000 samples per dataset; as such, certain limitations may exist in the reliability of the embeddings. A baseline of 200 samples was used across all datasets, with larger subsets of up to 3,000 samples utilized for certain in-depth analyses.
|
| 13 |
+
|
|
explainability.md
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Field | Response
|
| 2 |
+
:------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
|
| 3 |
+
Intended Task/Domain: | Visual Question Answering
|
| 4 |
+
Model Type: | Transformer
|
| 5 |
+
Intended Users: | Individuals and businesses that need to process documents such as invoices, receipts, and manuals. Also, users who are building multi-modal agents and RAG systems.
|
| 6 |
+
Output: | Text
|
| 7 |
+
Tools used to evaluate datasets to identify synthetic data and ensure data authenticity. | We used a Gemma-3 4B-based filtering model fine-tuned on [Nemotron Content Safety Dataset v2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) to ensure the quality of synthetic data.
|
| 8 |
+
Describe how the model works: | Vision Encoder and a Nemotron 5.5H -12B Language Encoder. It processes multiple input modalities, including text, multiple images, and video. It fuses these inputs and uses its large language model backbone with a 128K context length to perform visual Q&A, summarization, and data extraction.
|
| 9 |
+
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable
|
| 10 |
+
Technical Limitations & Mitigation: | The model has a limited maximum resolution determined by a 12-tile layout constraint, where each tile is 512x512 pixels. It also supports a limited number of input images (up to 4) and has a maximum context length of 128K tokens for combined input and output.
|
| 11 |
+
Verified to have met prescribed NVIDIA quality standards: | Yes
|
| 12 |
+
Performance Metrics: | Accuracy (Visual Question Answering), Latency, Throughput
|
| 13 |
+
Potential Known Risks: | The Model may produce output that is biased, toxic, or incorrect responses. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The Model may also generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text, producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.While we have taken safety and security into account and are continuously improving, outputs may still contain political content, misleading information, or unwanted bias beyond our control.
|
| 14 |
+
Licensing: | Governing Terms: Use of this model is governed by the [ NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
|
| 15 |
+
|
privacy.md
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Field | Response
|
| 2 |
+
:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
|
| 3 |
+
Generatable or reverse engineerable personal data? | No
|
| 4 |
+
Personal data used to create this model? | No
|
| 5 |
+
Was consent obtained for any personal data used? | Not Applicable
|
| 6 |
+
A description of any methods implemented in data acquisition or processing, if any, to address the prevalence of personal data in the training data, where relevant and applicable. | We used only prompts that do not contain any personal data for synthetic data generation.
|
| 7 |
+
How often is dataset reviewed? | Before release and during dataset creation and model training <br><br>
|
| 8 |
+
Is there provenance for all datasets used in training? | Yes
|
| 9 |
+
Does data labeling (annotation, metadata) comply with privacy laws? | Yes
|
| 10 |
+
Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data.
|
| 11 |
+
Applicable Privacy Policy | [Privacy Policy](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/)
|
| 12 |
+
During AI model development, strict adherence to copyright policy ensured compliance through risk mitigation and legal reviews. Post-data collection, reserved rights content is identified and removed, with verified opt-out processes for rightsholders. Detailed records document due diligence and transparency.
|
| 13 |
+
We employ automated tools and data processing techniques to scan for Personally Identifiable Information (PII) during pre-training to identify and filter certain categories of personal information, including public-facing contact details such as email addresses and phone numbers. Scans of Common Crawl, CC-News, and Wikimedia datasets did not detect PII in the majority of samples. However, Microsoft Presidio indicated potential findings including business contact information embedded in natural language, such as email addresses and phone numbers. These were removed using verified instances of PII through a combination of automated filtering and human-in-the-loop validation.
|
safety.md
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Field | Response
|
| 2 |
+
:---------------------------------------------------|:----------------------------------
|
| 3 |
+
Model Application Field(s): | Customer Service, Media & Entertainment, Enterprise Document Intelligence and Processing & Retail
|
| 4 |
+
Describe the life critical impact (if present). | Not Applicable
|
| 5 |
+
Description of methods implemented in data acquisition or processing, if any, to address other types of potentially harmful data in the training, testing, and validation data: | We used a guard model for content safety to exclude potentially harmful data from training.
|
| 6 |
+
Description of any methods implemented in data acquisition or processing, if any, to address illegal or harmful content in the training data, including, but not limited to, child sexual abuse material (CSAM) and non-consensual intimate imagery (NCII) | We used a Gemma-3 4B-based guard model trained on [Nemotron Content Safety Dataset v2](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) for content safety to exclude potentially illegal or harmful content from the training. We also did CSAM checks on our image datasets for training.
|
| 7 |
+
Use Case Restrictions: | Use of this model is governed by the [ NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)
|
| 8 |
+
Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
|
| 9 |
+
This AI model was developed based on our policies to ensure responsible data handling and risk mitigation. The datasets used for training have been scanned for harmful content and illegal content, consistent with our policies including scanning for Child Sexual Abuse Material (CSAM). Ongoing review and monitoring mechanisms are in place based on our policies and to maintain data integrity.
|
| 10 |
+
The model was optimized explicitly for instruction following and as such is more susceptible to prompt injection and jailbreaking in various forms as a result of its instruction tuning. This means that the model should be paired with additional rails or system filtering to limit exposure to instructions from malicious sources -- either directly or indirectly by retrieval (e.g. via visiting a website) -- as they may yield outputs that can lead to harmful, system-level outcomes up to and including remote code execution in agentic systems when effective security controls including guardrails are not in place. The model may generate answers that may be inaccurate, omit key information, include irrelevant or redundant text, or produce socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
|