Lin-K76 commited on
Commit
10b7edc
1 Parent(s): 4d1a63a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -11
README.md CHANGED
@@ -129,16 +129,9 @@ oneshot(
129
 
130
  ## Evaluation
131
 
132
- The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command.
133
- A modified version of ARC-C and GSM8k-cot was used for evaluations, in line with Llama 3.1's prompting. It can be accessed on the [Neural Magic fork of the lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct).
134
- Additional evaluations that were collected for the original Llama 3.1 models will be added in the future.
135
- ```
136
- lm_eval \
137
- --model vllm \
138
- --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic",dtype=auto,gpu_memory_utilization=0.4,add_bos_token=True,max_model_len=4096 \
139
- --tasks openllm \
140
- --batch_size auto
141
- ```
142
 
143
  ### Accuracy
144
 
@@ -224,4 +217,70 @@ lm_eval \
224
  <td><strong>99.48%</strong>
225
  </td>
226
  </tr>
227
- </table>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
 
130
  ## Evaluation
131
 
132
+ The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA.
133
+ Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
134
+ This version of the lm-evaluation-harness includes versions of ARC-Challenge and GSM-8K that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
 
 
 
 
 
 
 
135
 
136
  ### Accuracy
137
 
 
217
  <td><strong>99.48%</strong>
218
  </td>
219
  </tr>
220
+ </table>
221
+
222
+ ### Reproduction
223
+
224
+ The results were obtained using the following commands:
225
+
226
+ #### MMLU
227
+ ```
228
+ lm_eval \
229
+ --model vllm \
230
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
231
+ --tasks mmlu \
232
+ --num_fewshot 5 \
233
+ --batch_size auto
234
+ ```
235
+
236
+ #### ARC-Challenge
237
+ ```
238
+ lm_eval \
239
+ --model vllm \
240
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
241
+ --tasks arc_challenge_llama_3.1_instruct \
242
+ --apply_chat_template \
243
+ --num_fewshot 0 \
244
+ --batch_size auto
245
+ ```
246
+
247
+ #### GSM-8K
248
+ ```
249
+ lm_eval \
250
+ --model vllm \
251
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
252
+ --tasks gsm8k_cot_llama_3.1_instruct \
253
+ --apply_chat_template \
254
+ --num_fewshot 8 \
255
+ --batch_size auto
256
+ ```
257
+
258
+ #### Hellaswag
259
+ ```
260
+ lm_eval \
261
+ --model vllm \
262
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
263
+ --tasks hellaswag \
264
+ --num_fewshot 10 \
265
+ --batch_size auto
266
+ ```
267
+
268
+ #### Winogrande
269
+ ```
270
+ lm_eval \
271
+ --model vllm \
272
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
273
+ --tasks winogrande \
274
+ --num_fewshot 5 \
275
+ --batch_size auto
276
+ ```
277
+
278
+ #### TruthfulQA
279
+ ```
280
+ lm_eval \
281
+ --model vllm \
282
+ --model_args pretrained="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
283
+ --tasks truthfulqa_mc \
284
+ --num_fewshot 0 \
285
+ --batch_size auto
286
+ ```