Update README.md
Browse files
README.md
CHANGED
@@ -57,14 +57,15 @@ The combined dataset[GIZ/policy_qa_v0_1](https://huggingface.co/datasets/GIZ/pol
|
|
57 |
The pre-processing operations used to produce the final training dataset were as follows:
|
58 |
|
59 |
1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85).
|
60 |
-
2. For ClimateWatch, all rows are removed as there was assessed to be no taxonomical alignment with the IKITracs labels inherent to the dataset.
|
|
|
61 |
> - 'GHG': target_labels_ghg_yes = ['T_Transport_Unc','T_Transport_C']
|
62 |
> - 'NOT_GHG': target_labels_ghg_no = ['T_Adaptation_Unc', 'T_Adaptation_C', 'T_Transport_O_Unc', 'T_Transport_O_C']
|
63 |
> - 'NEGATIVE': random sample of other labeled data omitting above labels
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
|
69 |
|
70 |
## Training procedure
|
|
|
57 |
The pre-processing operations used to produce the final training dataset were as follows:
|
58 |
|
59 |
1. Dataset is filtered based on 'medium' value in 'strategy' column (sequence length = 85).
|
60 |
+
2. For ClimateWatch, all rows are removed as there was assessed to be no taxonomical alignment with the IKITracs labels inherent to the dataset.
|
61 |
+
3. For IKITracs, labels are assigned based on 'parameter' values which correspond to assessments of Transport-related GHG targets by human annotaters. The specific assignments are as follows:
|
62 |
> - 'GHG': target_labels_ghg_yes = ['T_Transport_Unc','T_Transport_C']
|
63 |
> - 'NOT_GHG': target_labels_ghg_no = ['T_Adaptation_Unc', 'T_Adaptation_C', 'T_Transport_O_Unc', 'T_Transport_O_C']
|
64 |
> - 'NEGATIVE': random sample of other labeled data omitting above labels
|
65 |
+
4. If 'context_translated' is available and the 'language' is not English, 'context' is replaced with 'context_translated'.
|
66 |
+
5. The dataset is "exploded" - i.e., the text samples in the 'context' column, which are lists, are converted into separate rows - and labels are merged to align with the associated samples.
|
67 |
+
6. The 'match_onanswer' and 'answerWordcount' are used conditionally to select high quality samples (prefers high % of word matches in 'match_onanswer', but will take lower if there is a high 'answerWordcount')
|
68 |
+
7. Data is then augmented using sentence shuffle from the ```albumentations``` library and NLP-based insertions using ```nlpaug```.
|
69 |
|
70 |
|
71 |
## Training procedure
|