MoritzLaurer HF staff commited on
Commit
b978fbd
·
verified ·
1 Parent(s): 81a6fdb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -15
README.md CHANGED
@@ -18,28 +18,27 @@ This task format is based on the Natural Language Inference task (NLI).
18
  The task is so universal that any classification task can be reformulated into this task.
19
 
20
  ## Training data
21
- The model was trained on a mixture of __33 datasets and 387 classes__ that have been reformatted into this universal format.
22
- 1. Five NLI datasets with ~885k texts: "mnli", "anli", "fever", "wanli", "ling"
23
- 2. 28 classification tasks reformatted into the universal NLI format. ~51k cleaned texts were used to avoid overfitting:
24
- 'amazonpolarity', 'imdb', 'appreviews', 'yelpreviews', 'rottentomatoes',
25
- 'emotiondair', 'emocontext', 'empathetic',
26
- 'financialphrasebank', 'banking77', 'massive',
27
- 'wikitoxic_toxicaggregated', 'wikitoxic_obscene', 'wikitoxic_threat', 'wikitoxic_insult', 'wikitoxic_identityhate',
28
- 'hateoffensive', 'hatexplain', 'biasframes_offensive', 'biasframes_sex', 'biasframes_intent',
29
- 'agnews', 'yahootopics',
30
- 'trueteacher', 'spam', 'wellformedquery',
31
- 'manifesto', 'capsotu'.
32
-
33
- See details on each dataset here: https://github.com/MoritzLaurer/zeroshot-classifier/blob/main/datasets_overview.csv
34
 
35
  Note that compared to other NLI models, this model predicts two classes (`entailment` vs. `not_entailment`)
36
  as opposed to three classes (entailment/neutral/contradiction)
37
 
38
- The model was only trained on English data. For __multilingual use-cases__,
39
- I recommend machine translating texts to English with libraries like [EasyNMT](https://github.com/UKPLab/EasyNMT).
 
40
  English-only models tend to perform better than multilingual models and
41
  validation with English data can be easier if you don't speak all languages in your corpus.
42
 
 
 
43
  ### How to use the model
44
  #### Simple zero-shot classification pipeline
45
  ```python
 
18
  The task is so universal that any classification task can be reformulated into this task.
19
 
20
  ## Training data
21
+
22
+ The model is trained on two types of fully commercially-friendly data:
23
+ 1. Synthetic data generated with [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
24
+ I first created a list of 500+ diverse text classification tasks for 25 professions in conversations with Mistral-large. The data was manually curated.
25
+ I then used this as seed data to generate several hundred thousand texts for the different tasks with Mixtral-8x7B-Instruct-v0.1.
26
+ The final dataset used is available in the [synthetic_zeroshot_mixtral_v0.1](https://huggingface.co/datasets/MoritzLaurer/synthetic_zeroshot_mixtral_v0.1) dataset
27
+ in the subset `mixtral_written_text_for_tasks_v4`. Data curation was done in multiple iterations and I will release more information on this process soon.
28
+ 2. Two commercially-friendly NLI datasets: ([MNLI](https://huggingface.co/datasets/nyu-mll/multi_nli), [FEVER-NLI](https://huggingface.co/datasets/fever).
29
+ These datasets were added to increase generalization. Datasets like ANLI were excluded due to their non-commercial license.
 
 
 
 
30
 
31
  Note that compared to other NLI models, this model predicts two classes (`entailment` vs. `not_entailment`)
32
  as opposed to three classes (entailment/neutral/contradiction)
33
 
34
+ The model was only trained on English data. I will release a multilingual version of this model soon.
35
+ For __multilingual use-cases__,
36
+ I alternatively recommend machine translating texts to English with libraries like [EasyNMT](https://github.com/UKPLab/EasyNMT).
37
  English-only models tend to perform better than multilingual models and
38
  validation with English data can be easier if you don't speak all languages in your corpus.
39
 
40
+
41
+
42
  ### How to use the model
43
  #### Simple zero-shot classification pipeline
44
  ```python