Jae-Won Chung commited on
Commit
4a385c8
1 Parent(s): 8c6c688

Improve pegasus README.md

Browse files
pegasus/README.md CHANGED
@@ -47,9 +47,11 @@ $ pegasus b
47
 
48
  `b` stands for broadcast. Every command is run once on all (`hostname`, `gpu`) combinations.
49
 
50
- ## Benchmark
51
 
52
- Now use Pegasus to run benchmarks for all the models across all nodes.
 
 
53
 
54
  ```console
55
  $ cd pegasus
@@ -59,9 +61,13 @@ $ pegasus q
59
 
60
  `q` stands for queue. Each command is run once on the next available (`hostname`, `gpu`) combination.
61
 
62
- ## NLP-eval
 
 
 
 
63
 
64
- Now use Pegasus to run benchmarks for all the models across all nodes.
65
 
66
  ```console
67
  $ cd pegasus
@@ -69,18 +75,32 @@ $ cp nlp-eval.yaml queue.yaml
69
  $ pegasus q
70
  ```
71
 
72
- for some tasks, if the cuda memory of a single gpu is not enough, you can use more GPUs like follows
73
 
74
- 1. create a larger docker with more gpus, e.g. 2 gpus:
 
 
 
 
75
 
76
  ```console
77
- $ docker run -dit --name leaderboard_nlp_tasks --gpus '"device=0,1"' -v /data/leaderboard:/data/leaderboard -v $HOME/workspace/leaderboard:/workspace/leaderboard ml-energy:latest bash
 
 
 
 
 
78
  ```
79
 
80
- 2. then run the specific task with Pegasus or directly run with
81
 
82
  ```console
83
- $ docker exec leaderboard_nlp_tasks python lm-evaluation-harness/main.py --device cuda --no_cache --model hf-causal-experimental --model_args pretrained={{model}},trust_remote_code=True,use_accelerate=True --tasks {{task}} --num_fewshot {{shot}}
 
 
 
 
 
 
 
84
  ```
85
-
86
- change `model`, `task` and `shot` to specific tasks
 
47
 
48
  `b` stands for broadcast. Every command is run once on all (`hostname`, `gpu`) combinations.
49
 
50
+ ## System benchmark
51
 
52
+ This will benchmark each model and get you data for the columns `energy`, `throughput`, `latency`, and `response_length`.
53
+
54
+ Use Pegasus to run benchmarks for all the models across all nodes.
55
 
56
  ```console
57
  $ cd pegasus
 
61
 
62
  `q` stands for queue. Each command is run once on the next available (`hostname`, `gpu`) combination.
63
 
64
+ After all the tasks finish, aggregate all the data into one node and run [`compute_system_metrics.py`](../scripts/compute_system_metrics.py) to generate CSV files that the leaderboard can display.
65
+
66
+ ## NLP benchmark
67
+
68
+ We'll use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) to run models through three NLP datasets: ARC challenge (`arc`), HellaSwag (`hellaswag`), and TruthfulQA (`truthfulqa`).
69
 
70
+ Use Pegasus to run benchmarks for all the models across all nodes.
71
 
72
  ```console
73
  $ cd pegasus
 
75
  $ pegasus q
76
  ```
77
 
78
+ After all the tasks finish, aggregate all the data into one node and run [`aggregate_nlp_metrics.py`](../scripts/aggregate_nlp_metrics.py) to generate a single `score.csv` that the leaderboard can display.
79
 
80
+ ### Dealing with OOM
81
+
82
+ Some tasks might run out of memory, in which case you should create a container with more GPUs:
83
+
84
+ 1. Create a container with two GPUs, for example:
85
 
86
  ```console
87
+ $ docker run -dit \
88
+ --name leaderboard01 \
89
+ --gpus '"device=0,1"' \
90
+ -v /data/leaderboard:/data/leaderboard \
91
+ -v $HOME/workspace/leaderboard:/workspace/leaderboard \
92
+ mlenergy/leaderboard:latest bash
93
  ```
94
 
95
+ 2. Revise `nlp-eval.yaml` and run with Pegasus, or run directly like this on LLaMA 7B and ARC, for example:
96
 
97
  ```console
98
+ $ docker exec leaderboard01 \
99
+ python lm-evaluation-harness/main.py \
100
+ --device cuda \
101
+ --no_cache \
102
+ --model hf-causal-experimental \
103
+ --model_args pretrained=/data/leaderboard/weights/metaai/llama-7B,trust_remote_code=True,use_accelerate=True \
104
+ --tasks arc_challenge \
105
+ --num_fewshot 25
106
  ```
 
 
scripts/{compute_metrics.py → compute_system_metrics.py} RENAMED
File without changes