JordiBayarri commited on
Commit
7eeb193
1 Parent(s): e281914

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -341,11 +341,13 @@ We used [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) library. We aligned the
341
 
342
  #### Summary
343
 
344
- To compare Aloe with the most competitive open models (both general purpose and healthcare-specific) we use popular healthcare datasets (PubMedQA, MedMCQA, MedQA and MMLU for six medical tasks only), together with the new and highly reliable CareQA. We produce the standard MultiMedQA score for reference, by computing the weighted average accuracy on all scores except CareQA. Additionally, we calculate the arithmetic mean across all datasets. The Medical MMLU is calculated by averaging the six medical subtasks: Anatomy, Clinical knowledge, College Biology, College medicine, Medical genetics, and Professional medicine.
345
 
346
- Benchmark results indicate the training conducted on Aloe has boosted its performance above Llama3-8B-Instruct. Llama3-Aloe-8B-Alpha outperforms larger models like Meditron 70B, and is close to larger base models, like Yi-34. For the former, this gain is consistent even when using SC-CoT, using their best-reported variant. All these results make Llama3-Aloe-8B-Alpha the best healthcare LLM of its size.
347
 
348
- With the help of prompting techniques the performance of Llama3-Aloe-8B-Alpha is significantly improved. Medprompting in particular provides a 7% increase in reported accuracy, after which Llama3-Aloe-8B-Alpha only lags behind the ten times bigger Llama-3-70B-Instruct. This improvement is mostly consistent across medical fields. Llama3-Aloe-8B-Alpha with medprompting beats the performance of Meditron 70B with their self reported 20 shot SC-CoT in MMLU med and is slightly worse in the other benchmarks.
 
 
 
349
 
350
  ## Environmental Impact
351
 
 
341
 
342
  #### Summary
343
 
344
+ To compare Aloe with the most competitive open models (both general purpose and healthcare-specific) we use popular healthcare datasets (PubMedQA, MedMCQA, MedQA and MMLU for six medical tasks only), together with the new and highly reliable CareQA. However, while MCQA benchmarks provide valuable insights into a model's ability to handle structured queries, they fall short of representing the full range of challenges faced in medical practice. Building upon this idea, Aloe-Beta represents the next step in the evolution of the Aloe Family, designed to broaden the scope beyond the multiple-choice question-answering tasks that define Aloe-Alpha.
345
 
 
346
 
347
+ Benchmark results indicate the training conducted on Aloe has boosted its performance achieving comparable results with SOTA models like Llama3-OpenBioLLLM, Llama3-Med42, MedPalm-2 and GPT-4. Llama31-Aloe-Beta-70B also outperforms the other existing medical models in the OpenLLM Leaderboard and in the evaluation of other medical tasks like Medical Factualy and Medical Treatment recommendations among others. All these results make Llama31-Aloe-Beta-70B one of the best existing models for healthcare.
348
+
349
+ With the help of prompting techniques the performance of Llama3-Aloe-8B-Beta is significantly improved. Medprompting in particular provides a 4% increase in reported accuracy, after which Llama31-Aloe-Beta-70B outperforms all the existing models that do not use RAG evaluation.
350
+
351
 
352
  ## Environmental Impact
353