Model Card for EleutherAI/pythia-160m HellaSwag Evaluation

This model card presents the evaluation results of the EleutherAI/pythia-160m model on the HellaSwag task.

Model Details

Model Description

Developed by: EleutherAI
Model type: Causal Language Model
Language(s): English
License: Apache 2.0
Finetuned from model: EleutherAI/pythia-160m

Model Sources

Repository: EleutherAI/pythia-160m
Paper: [More Information Needed]

Uses

Direct Use

This evaluation demonstrates the model's performance on the HellaSwag task, which tests for commonsense reasoning in AI systems.

Out-of-Scope Use

This evaluation is specific to the HellaSwag task and may not be indicative of the model's performance on other tasks or in real-world applications.

Bias, Risks, and Limitations

The evaluation results should be interpreted within the context of the HellaSwag task. The model may exhibit biases present in its training data or the evaluation dataset.

Recommendations

Users should be aware of the model's limitations and consider additional evaluation on task-specific datasets before deployment in real-world applications.

How to Get Started with the Model

To use this model for the HellaSwag task:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-160m", revision="step100000")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m", revision="step100000")

# Use the model for the HellaSwag task

Training Details

Training Data

The model was evaluated on the HellaSwag dataset. For more information, visit the HellaSwag dataset page.

Training Procedure

Training Hyperparameters

Training regime: float32

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on the HellaSwag dataset, which consists of 10,042 samples.

Metrics

Accuracy (acc): Measures the proportion of correctly predicted answers.
Normalized Accuracy (acc_norm): A variant of accuracy that accounts for potential biases in the dataset.

Results

Metric	Value	Standard Error
Accuracy	0.28719	0.00452
Normalized Accuracy	0.30821	0.00461

Environmental Impact

Hardware Type: Tesla T4 GPU
Hours used: Approximately 0.095 hours (341.39 seconds)
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications

Model Architecture and Objective

EleutherAI/pythia-160m is a causal language model with approximately 162 million parameters.

Compute Infrastructure

Hardware: Tesla T4 GPU
Software: PyTorch 2.4.1+cu121, Transformers 4.44.2
Date of Evaluation: October 18, 2024

Command

lm_eval --model hf \
    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
    --tasks hellaswag \
    --device cuda \
    --batch_size auto:4 \
    --output_path hellaswag_test \
    --log_samples

Command output

Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (64,64,64,64,64)
|  Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag|      1|none  |     0|acc     |↑  |0.2872|±  |0.0045|
|         |       |none  |     0|acc_norm|↑  |0.3082|±  |0.0046|

2024-10-18 12:25:25.770584: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-18 12:25:25.847675: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-18 12:25:25.887843: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-18 12:25:25.961158: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-18 12:25:27.647707: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-10-18:12:25:29,450 INFO     [__main__.py:279] Verbosity set to INFO
2024-10-18:12:25:42,060 INFO     [__main__.py:376] Selected Tasks: ['hellaswag']
2024-10-18:12:25:42,062 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2024-10-18:12:25:42,062 INFO     [evaluator.py:201] Initializing hf model, with arguments: {'pretrained': 'EleutherAI/pythia-160m', 'revision': 'step100000', 'dtype': 'float'}
2024-10-18:12:25:42,128 INFO     [huggingface.py:129] Using device 'cuda'
2024-10-18:12:25:42,395 INFO     [huggingface.py:481] Using model type 'default'
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
2024-10-18:12:25:42,769 INFO     [huggingface.py:365] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
2024-10-18:12:25:56,709 WARNING  [model.py:422] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2024-10-18:12:25:56,710 INFO     [task.py:415] Building contexts for hellaswag on rank 0...
100%|██████████| 10042/10042 [00:05<00:00, 1695.72it/s]
2024-10-18:12:26:04,007 INFO     [evaluator.py:489] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 40168/40168 [03:53<00:00, 171.85it/s]
fatal: not a git repository (or any of the parent directories): .git
2024-10-18:12:30:36,510 INFO     [evaluation_tracker.py:206] Saving results aggregated
2024-10-18:12:30:36,524 INFO     [evaluation_tracker.py:287] Saving per-sample results for: hellaswag

Decepticore
/

LLMEVAL1