Update README.md
Browse files
README.md
CHANGED
@@ -18,28 +18,27 @@ This task format is based on the Natural Language Inference task (NLI).
|
|
18 |
The task is so universal that any classification task can be reformulated into this task.
|
19 |
|
20 |
## Training data
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
'trueteacher', 'spam', 'wellformedquery',
|
31 |
-
'manifesto', 'capsotu'.
|
32 |
-
|
33 |
-
See details on each dataset here: https://github.com/MoritzLaurer/zeroshot-classifier/blob/main/datasets_overview.csv
|
34 |
|
35 |
Note that compared to other NLI models, this model predicts two classes (`entailment` vs. `not_entailment`)
|
36 |
as opposed to three classes (entailment/neutral/contradiction)
|
37 |
|
38 |
-
The model was only trained on English data.
|
39 |
-
|
|
|
40 |
English-only models tend to perform better than multilingual models and
|
41 |
validation with English data can be easier if you don't speak all languages in your corpus.
|
42 |
|
|
|
|
|
43 |
### How to use the model
|
44 |
#### Simple zero-shot classification pipeline
|
45 |
```python
|
|
|
18 |
The task is so universal that any classification task can be reformulated into this task.
|
19 |
|
20 |
## Training data
|
21 |
+
|
22 |
+
The model is trained on two types of fully commercially-friendly data:
|
23 |
+
1. Synthetic data generated with [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
|
24 |
+
I first created a list of 500+ diverse text classification tasks for 25 professions in conversations with Mistral-large. The data was manually curated.
|
25 |
+
I then used this as seed data to generate several hundred thousand texts for the different tasks with Mixtral-8x7B-Instruct-v0.1.
|
26 |
+
The final dataset used is available in the [synthetic_zeroshot_mixtral_v0.1](https://huggingface.co/datasets/MoritzLaurer/synthetic_zeroshot_mixtral_v0.1) dataset
|
27 |
+
in the subset `mixtral_written_text_for_tasks_v4`. Data curation was done in multiple iterations and I will release more information on this process soon.
|
28 |
+
2. Two commercially-friendly NLI datasets: ([MNLI](https://huggingface.co/datasets/nyu-mll/multi_nli), [FEVER-NLI](https://huggingface.co/datasets/fever).
|
29 |
+
These datasets were added to increase generalization. Datasets like ANLI were excluded due to their non-commercial license.
|
|
|
|
|
|
|
|
|
30 |
|
31 |
Note that compared to other NLI models, this model predicts two classes (`entailment` vs. `not_entailment`)
|
32 |
as opposed to three classes (entailment/neutral/contradiction)
|
33 |
|
34 |
+
The model was only trained on English data. I will release a multilingual version of this model soon.
|
35 |
+
For __multilingual use-cases__,
|
36 |
+
I alternatively recommend machine translating texts to English with libraries like [EasyNMT](https://github.com/UKPLab/EasyNMT).
|
37 |
English-only models tend to perform better than multilingual models and
|
38 |
validation with English data can be easier if you don't speak all languages in your corpus.
|
39 |
|
40 |
+
|
41 |
+
|
42 |
### How to use the model
|
43 |
#### Simple zero-shot classification pipeline
|
44 |
```python
|