amezasor commited on
Commit
529eb5f
1 Parent(s): 4ba1c97

data update

Browse files
Files changed (1) hide show
  1. README.md +8 -1
README.md CHANGED
@@ -301,7 +301,14 @@ print(output)
301
 
302
  <!-- TO DO: To be completed once the paper is ready, we may changed title to Supervised Finetuning -->
303
  ## Training Data
304
- This model is trained on a mix of open-source and proprietary datasets.
 
 
 
 
 
 
 
305
  <!-- ### Instruction Datasets
306
  * Language Instruction Datasets: We include high-quality datasets such as [TO DO: List of datasets]
307
  * Synthetic Instruction Datasets: [TO DO: paragraph about synthetic data]
 
301
 
302
  <!-- TO DO: To be completed once the paper is ready, we may changed title to Supervised Finetuning -->
303
  ## Training Data
304
+ Granite Language Instruct models are trained on a collection of publicly available datasets with non-restrictive license, as well as an IBM collection of synthetic datasets. We annotated and filtered these datasets to only include high-quality instances from each of them in our final mixture. This dataset selection is representative of the following domains:
305
+
306
+ * English datasets: [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus), [WebInstructSub](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub), [OASST-OctoPack](https://huggingface.co/datasets/bigcode/oasst-octopack), [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater), [SoftAge-Multiturn](https://huggingface.co/datasets/SoftAge-AI/multi-turn_dataset), [Glaive-RAG-v1 ](https://huggingface.co/datasets/glaiveai/RAG-v1 ), [EvolKit-20k](https://huggingface.co/datasets/arcee-ai/EvolKit-20k ), [Magpie-Phi3-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Phi3-Pro-300K-Filtered).
307
+ * Multilingual datasets: [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) and IBM Synthetic datasets (e.g., Blue Multilingual, Daring Anteater Translated).
308
+ * Code datasets: [Glaive Code Assistant V3](https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3), [SQL Create Context Instruction](https://huggingface.co/datasets/bugdaryan/sql-create-context-instruction), and [Self-OSS-Instruct-SC2](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k). Single and multi-turn IBM synthetic datasets, including a set of datasets generated via the evol-instruct method.
309
+ * Math: [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA), [StackMathQA](https://huggingface.co/datasets/math-ai/StackMathQA ), and [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct)
310
+ * Tools: [xlam-function-calling](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k), [Glaive Function Calling V2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2), [Hermes Function Calling V1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1), and IBM Synthetic API data.
311
+ * Safety: [SimpleSafetyTests](https://huggingface.co/datasets/Bertievidgen/SimpleSafetyTests), [HarmBench Behaviors](https://github.com/centerforaisafety/HarmBench/blob/main/data/behavior_datasets/harmbench_behaviors_text_all.csv), [Strong Reject](https://github.com/alexandrasouly/strongreject/blob/main/strongreject_dataset/strongreject_dataset.csv), [AdvBench](https://huggingface.co/datasets/walledai/AdvBench), [MistralGuard](https://huggingface.co/datasets/natolambert/xstest-v2-copy), [Do-Not-Answer](https://huggingface.co/datasets/LibrAI/do-not-answer), and IBM Synthetic data for safety.
312
  <!-- ### Instruction Datasets
313
  * Language Instruction Datasets: We include high-quality datasets such as [TO DO: List of datasets]
314
  * Synthetic Instruction Datasets: [TO DO: paragraph about synthetic data]