Spaces:
Running
Running
Jae-Won Chung
commited on
Commit
•
4a385c8
1
Parent(s):
8c6c688
Improve pegasus README.md
Browse files
pegasus/README.md
CHANGED
@@ -47,9 +47,11 @@ $ pegasus b
|
|
47 |
|
48 |
`b` stands for broadcast. Every command is run once on all (`hostname`, `gpu`) combinations.
|
49 |
|
50 |
-
##
|
51 |
|
52 |
-
|
|
|
|
|
53 |
|
54 |
```console
|
55 |
$ cd pegasus
|
@@ -59,9 +61,13 @@ $ pegasus q
|
|
59 |
|
60 |
`q` stands for queue. Each command is run once on the next available (`hostname`, `gpu`) combination.
|
61 |
|
62 |
-
|
|
|
|
|
|
|
|
|
63 |
|
64 |
-
|
65 |
|
66 |
```console
|
67 |
$ cd pegasus
|
@@ -69,18 +75,32 @@ $ cp nlp-eval.yaml queue.yaml
|
|
69 |
$ pegasus q
|
70 |
```
|
71 |
|
72 |
-
|
73 |
|
74 |
-
|
|
|
|
|
|
|
|
|
75 |
|
76 |
```console
|
77 |
-
$ docker run -dit
|
|
|
|
|
|
|
|
|
|
|
78 |
```
|
79 |
|
80 |
-
2.
|
81 |
|
82 |
```console
|
83 |
-
$ docker exec
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
84 |
```
|
85 |
-
|
86 |
-
change `model`, `task` and `shot` to specific tasks
|
|
|
47 |
|
48 |
`b` stands for broadcast. Every command is run once on all (`hostname`, `gpu`) combinations.
|
49 |
|
50 |
+
## System benchmark
|
51 |
|
52 |
+
This will benchmark each model and get you data for the columns `energy`, `throughput`, `latency`, and `response_length`.
|
53 |
+
|
54 |
+
Use Pegasus to run benchmarks for all the models across all nodes.
|
55 |
|
56 |
```console
|
57 |
$ cd pegasus
|
|
|
61 |
|
62 |
`q` stands for queue. Each command is run once on the next available (`hostname`, `gpu`) combination.
|
63 |
|
64 |
+
After all the tasks finish, aggregate all the data into one node and run [`compute_system_metrics.py`](../scripts/compute_system_metrics.py) to generate CSV files that the leaderboard can display.
|
65 |
+
|
66 |
+
## NLP benchmark
|
67 |
+
|
68 |
+
We'll use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) to run models through three NLP datasets: ARC challenge (`arc`), HellaSwag (`hellaswag`), and TruthfulQA (`truthfulqa`).
|
69 |
|
70 |
+
Use Pegasus to run benchmarks for all the models across all nodes.
|
71 |
|
72 |
```console
|
73 |
$ cd pegasus
|
|
|
75 |
$ pegasus q
|
76 |
```
|
77 |
|
78 |
+
After all the tasks finish, aggregate all the data into one node and run [`aggregate_nlp_metrics.py`](../scripts/aggregate_nlp_metrics.py) to generate a single `score.csv` that the leaderboard can display.
|
79 |
|
80 |
+
### Dealing with OOM
|
81 |
+
|
82 |
+
Some tasks might run out of memory, in which case you should create a container with more GPUs:
|
83 |
+
|
84 |
+
1. Create a container with two GPUs, for example:
|
85 |
|
86 |
```console
|
87 |
+
$ docker run -dit \
|
88 |
+
--name leaderboard01 \
|
89 |
+
--gpus '"device=0,1"' \
|
90 |
+
-v /data/leaderboard:/data/leaderboard \
|
91 |
+
-v $HOME/workspace/leaderboard:/workspace/leaderboard \
|
92 |
+
mlenergy/leaderboard:latest bash
|
93 |
```
|
94 |
|
95 |
+
2. Revise `nlp-eval.yaml` and run with Pegasus, or run directly like this on LLaMA 7B and ARC, for example:
|
96 |
|
97 |
```console
|
98 |
+
$ docker exec leaderboard01 \
|
99 |
+
python lm-evaluation-harness/main.py \
|
100 |
+
--device cuda \
|
101 |
+
--no_cache \
|
102 |
+
--model hf-causal-experimental \
|
103 |
+
--model_args pretrained=/data/leaderboard/weights/metaai/llama-7B,trust_remote_code=True,use_accelerate=True \
|
104 |
+
--tasks arc_challenge \
|
105 |
+
--num_fewshot 25
|
106 |
```
|
|
|
|
scripts/{compute_metrics.py → compute_system_metrics.py}
RENAMED
File without changes
|