MoritzLaurer
/

deberta-v3-base-zeroshot-v2.0-c

@@ -18,28 +18,27 @@ This task format is based on the Natural Language Inference task (NLI).
 The task is so universal that any classification task can be reformulated into this task.
 ## Training data
-The model was trained on a mixture of __33 datasets and 387 classes__ that have been reformatted into this universal format.
-1.   Five NLI datasets with ~885k texts: "mnli", "anli", "fever", "wanli", "ling"
-2.   28 classification tasks reformatted into the universal NLI format. ~51k cleaned texts were used to avoid overfitting:
-'amazonpolarity', 'imdb', 'appreviews', 'yelpreviews', 'rottentomatoes',
-'emotiondair', 'emocontext', 'empathetic',
-'financialphrasebank', 'banking77', 'massive',
-'wikitoxic_toxicaggregated', 'wikitoxic_obscene', 'wikitoxic_threat', 'wikitoxic_insult', 'wikitoxic_identityhate',
-'hateoffensive', 'hatexplain', 'biasframes_offensive', 'biasframes_sex', 'biasframes_intent',
-'agnews', 'yahootopics',
-'trueteacher', 'spam', 'wellformedquery',
-'manifesto', 'capsotu'.
-See details on each dataset here: https://github.com/MoritzLaurer/zeroshot-classifier/blob/main/datasets_overview.csv
 Note that compared to other NLI models, this model predicts two classes (`entailment` vs. `not_entailment`)
 as opposed to three classes (entailment/neutral/contradiction)
-The model was only trained on English data. For __multilingual use-cases__,
-I recommend machine translating texts to English with libraries like [EasyNMT](https://github.com/UKPLab/EasyNMT).
 English-only models tend to perform better than multilingual models and
 validation with English data can be easier if you don't speak all languages in your corpus.
 ### How to use the model
 #### Simple zero-shot classification pipeline
 ```python

 The task is so universal that any classification task can be reformulated into this task.
 ## Training data
+The model is trained on two types of fully commercially-friendly data:
+1. Synthetic data generated with [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
+I first created a list of 500+ diverse text classification tasks for 25 professions in conversations with Mistral-large. The data was manually curated.
+I then used this as seed data to generate several hundred thousand texts for the different tasks with Mixtral-8x7B-Instruct-v0.1.
+The final dataset used is available in the [synthetic_zeroshot_mixtral_v0.1](https://huggingface.co/datasets/MoritzLaurer/synthetic_zeroshot_mixtral_v0.1) dataset
+in the subset `mixtral_written_text_for_tasks_v4`. Data curation was done in multiple iterations and I will release more information on this process soon.
+2. Two commercially-friendly NLI datasets: ([MNLI](https://huggingface.co/datasets/nyu-mll/multi_nli), [FEVER-NLI](https://huggingface.co/datasets/fever).
+These datasets were added to increase generalization. Datasets like ANLI were excluded due to their non-commercial license.
 Note that compared to other NLI models, this model predicts two classes (`entailment` vs. `not_entailment`)
 as opposed to three classes (entailment/neutral/contradiction)
+The model was only trained on English data. I will release a multilingual version of this model soon.
+For __multilingual use-cases__,
+I alternatively recommend machine translating texts to English with libraries like [EasyNMT](https://github.com/UKPLab/EasyNMT).
 English-only models tend to perform better than multilingual models and
 validation with English data can be easier if you don't speak all languages in your corpus.
 ### How to use the model
 #### Simple zero-shot classification pipeline
 ```python