---
language:
- en
license: apache-2.0
tags:
- pytorch
- causal-lm
- pythia
datasets:
- hellaswag
metrics:
- accuracy
---

# Model Card for EleutherAI/pythia-160m HellaSwag Evaluation

This model card presents the evaluation results of the EleutherAI/pythia-160m model on the HellaSwag task.

## Model Details

### Model Description

- **Developed by:** EleutherAI
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** Apache 2.0
- **Finetuned from model:** EleutherAI/pythia-160m

### Model Sources

- **Repository:** [EleutherAI/pythia-160m](https://huggingface.co/EleutherAI/pythia-160m)
- **Paper:** [More Information Needed]

## Uses

### Direct Use

This evaluation demonstrates the model's performance on the HellaSwag task, which tests for commonsense reasoning in AI systems.

### Out-of-Scope Use

This evaluation is specific to the HellaSwag task and may not be indicative of the model's performance on other tasks or in real-world applications.

## Bias, Risks, and Limitations

The evaluation results should be interpreted within the context of the HellaSwag task. The model may exhibit biases present in its training data or the evaluation dataset.

### Recommendations

Users should be aware of the model's limitations and consider additional evaluation on task-specific datasets before deployment in real-world applications.

## How to Get Started with the Model

To use this model for the HellaSwag task:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-160m", revision="step100000")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m", revision="step100000")

# Use the model for the HellaSwag task
```

## Training Details

### Training Data

The model was evaluated on the HellaSwag dataset. For more information, visit [the HellaSwag dataset page](https://huggingface.co/datasets/hellaswag).

### Training Procedure

#### Training Hyperparameters

- **Training regime:** float32

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

The model was evaluated on the HellaSwag dataset, which consists of 10,042 samples.

#### Metrics

- **Accuracy (acc):** Measures the proportion of correctly predicted answers.
- **Normalized Accuracy (acc_norm):** A variant of accuracy that accounts for potential biases in the dataset.

### Results

| Metric | Value | Standard Error |
|--------|-------|----------------|
| Accuracy | 0.28719 | 0.00452 |
| Normalized Accuracy | 0.30821 | 0.00461 |

## Environmental Impact

- **Hardware Type:** Tesla T4 GPU
- **Hours used:** Approximately 0.095 hours (341.39 seconds)
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]

## Technical Specifications

### Model Architecture and Objective

EleutherAI/pythia-160m is a causal language model with approximately 162 million parameters.

### Compute Infrastructure

- **Hardware:** Tesla T4 GPU
- **Software:** PyTorch 2.4.1+cu121, Transformers 4.44.2
- **Date of Evaluation:** October 18, 2024

### Command

```
lm_eval --model hf \
    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
    --tasks hellaswag \
    --device cuda \
    --batch_size auto:4 \
    --output_path hellaswag_test \
    --log_samples
```

#### Command output
```
Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (64,64,64,64,64)
|  Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag|      1|none  |     0|acc     |↑  |0.2872|±  |0.0045|
|         |       |none  |     0|acc_norm|↑  |0.3082|±  |0.0046|

2024-10-18 12:25:25.770584: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-18 12:25:25.847675: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-18 12:25:25.887843: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-18 12:25:25.961158: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-18 12:25:27.647707: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-10-18:12:25:29,450 INFO     [__main__.py:279] Verbosity set to INFO
2024-10-18:12:25:42,060 INFO     [__main__.py:376] Selected Tasks: ['hellaswag']
2024-10-18:12:25:42,062 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2024-10-18:12:25:42,062 INFO     [evaluator.py:201] Initializing hf model, with arguments: {'pretrained': 'EleutherAI/pythia-160m', 'revision': 'step100000', 'dtype': 'float'}
2024-10-18:12:25:42,128 INFO     [huggingface.py:129] Using device 'cuda'
2024-10-18:12:25:42,395 INFO     [huggingface.py:481] Using model type 'default'
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
2024-10-18:12:25:42,769 INFO     [huggingface.py:365] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
2024-10-18:12:25:56,709 WARNING  [model.py:422] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2024-10-18:12:25:56,710 INFO     [task.py:415] Building contexts for hellaswag on rank 0...
100%|██████████| 10042/10042 [00:05<00:00, 1695.72it/s]
2024-10-18:12:26:04,007 INFO     [evaluator.py:489] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 40168/40168 [03:53<00:00, 171.85it/s]
fatal: not a git repository (or any of the parent directories): .git
2024-10-18:12:30:36,510 INFO     [evaluation_tracker.py:206] Saving results aggregated
2024-10-18:12:30:36,524 INFO     [evaluation_tracker.py:287] Saving per-sample results for: hellaswag
```