Model Card for EleutherAI/pythia-160m HellaSwag Evaluation
This model card presents the evaluation results of the EleutherAI/pythia-160m model on the HellaSwag task.
Model Details
Model Description
- Developed by: EleutherAI
- Model type: Causal Language Model
- Language(s): English
- License: Apache 2.0
- Finetuned from model: EleutherAI/pythia-160m
Model Sources
- Repository: EleutherAI/pythia-160m
- Paper: [More Information Needed]
Uses
Direct Use
This evaluation demonstrates the model's performance on the HellaSwag task, which tests for commonsense reasoning in AI systems.
Out-of-Scope Use
This evaluation is specific to the HellaSwag task and may not be indicative of the model's performance on other tasks or in real-world applications.
Bias, Risks, and Limitations
The evaluation results should be interpreted within the context of the HellaSwag task. The model may exhibit biases present in its training data or the evaluation dataset.
Recommendations
Users should be aware of the model's limitations and consider additional evaluation on task-specific datasets before deployment in real-world applications.
How to Get Started with the Model
To use this model for the HellaSwag task:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-160m", revision="step100000")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m", revision="step100000")
# Use the model for the HellaSwag task
Training Details
Training Data
The model was evaluated on the HellaSwag dataset. For more information, visit the HellaSwag dataset page.
Training Procedure
Training Hyperparameters
- Training regime: float32
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated on the HellaSwag dataset, which consists of 10,042 samples.
Metrics
- Accuracy (acc): Measures the proportion of correctly predicted answers.
- Normalized Accuracy (acc_norm): A variant of accuracy that accounts for potential biases in the dataset.
Results
Metric | Value | Standard Error |
---|---|---|
Accuracy | 0.28719 | 0.00452 |
Normalized Accuracy | 0.30821 | 0.00461 |
Environmental Impact
- Hardware Type: Tesla T4 GPU
- Hours used: Approximately 0.095 hours (341.39 seconds)
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications
Model Architecture and Objective
EleutherAI/pythia-160m is a causal language model with approximately 162 million parameters.
Compute Infrastructure
- Hardware: Tesla T4 GPU
- Software: PyTorch 2.4.1+cu121, Transformers 4.44.2
- Date of Evaluation: October 18, 2024
Command
lm_eval --model hf \
--model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
--tasks hellaswag \
--device cuda \
--batch_size auto:4 \
--output_path hellaswag_test \
--log_samples
Command output
Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (64,64,64,64,64)
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag| 1|none | 0|acc |β |0.2872|Β± |0.0045|
| | |none | 0|acc_norm|β |0.3082|Β± |0.0046|
2024-10-18 12:25:25.770584: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-18 12:25:25.847675: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-18 12:25:25.887843: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-18 12:25:25.961158: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-18 12:25:27.647707: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-10-18:12:25:29,450 INFO [__main__.py:279] Verbosity set to INFO
2024-10-18:12:25:42,060 INFO [__main__.py:376] Selected Tasks: ['hellaswag']
2024-10-18:12:25:42,062 INFO [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2024-10-18:12:25:42,062 INFO [evaluator.py:201] Initializing hf model, with arguments: {'pretrained': 'EleutherAI/pythia-160m', 'revision': 'step100000', 'dtype': 'float'}
2024-10-18:12:25:42,128 INFO [huggingface.py:129] Using device 'cuda'
2024-10-18:12:25:42,395 INFO [huggingface.py:481] Using model type 'default'
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
2024-10-18:12:25:42,769 INFO [huggingface.py:365] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
2024-10-18:12:25:56,709 WARNING [model.py:422] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2024-10-18:12:25:56,710 INFO [task.py:415] Building contexts for hellaswag on rank 0...
100%|ββββββββββ| 10042/10042 [00:05<00:00, 1695.72it/s]
2024-10-18:12:26:04,007 INFO [evaluator.py:489] Running loglikelihood requests
Running loglikelihood requests: 100%|ββββββββββ| 40168/40168 [03:53<00:00, 171.85it/s]
fatal: not a git repository (or any of the parent directories): .git
2024-10-18:12:30:36,510 INFO [evaluation_tracker.py:206] Saving results aggregated
2024-10-18:12:30:36,524 INFO [evaluation_tracker.py:287] Saving per-sample results for: hellaswag