XNLI in Chinese is basically corrupted - that's why you got unreliable results

#1
by ZhangRC - opened

image.png
This is how XNLI in Chinese looks like:

First, Chinese doesn't have spaces between words. Adding spaces between words is not unacceptable.

Even after removing those spaces, the sentences still remain incomprehensible, together with some untranslated words (it's also unacceptable having untranslated words).

Overall, those words look like fragmented and stacked together with little relevance, no clear meaning, nor sensible in grammar.

From my perspective as a Chinese speaker, this is not Chinese at all. This shouldn't be used to train or test any model.

HuggingFaceFW org
edited Oct 24

Hi,
Indeed you are right. That document looks absolutely terrible!
The task is not include in FineTasks, because the signal metrics were very poor for it. Instead we use OCNLI for NLI evaluation.
Since you are native Chinese speaker, do you have any comments for the tasks we selected? Do any of them show similar problems?

Note: We are aware of the issues with evaluation in Chinese. Therefore we took special care for creating Chinese prompt and so our evals use correct punctuation and don't use word spacing .
image.png

Thank you for clarification! Here are some of my suggestions:

  • Avoid uncommon terms, especially those related to western cultures and religions. This includes religious terms (like those related to Christianity can be removed completely), western history (but you can add Chinese history in compensation for that) and western policy. These never appear in our daily communication.
  • Add some math and logic, like the Gaokao and civil servant exams. Math and logics dominate a large proportion of these exams, and Chinese are known to be good at them.
  • Consider testing more on Chinese language itself. Chinese has a long history with a rich vocabulary of phrases, idioms and poems. These aspects can never be tested by verbatim translation from other languages.
HuggingFaceFW org
edited Oct 24

Hi

Avoid uncommon terms, especially those related to western cultures and religions. This includes religious terms (like those related to Christianity can be removed completely), western history (but you can add Chinese history in compensation for that) and western policy. These never appear in our daily communication.

Yeah I agree this is a problem, we try to minimize it by preferably selecting Chinese native tasks. Therefore we use CMMLU instead of translated MMLU for example.

Add some math and logic, like the Gaokao and civil servant exams. Math and logics dominate a large proportion of these exams, and Chinese are known to be good at them.

FineTasks contain the Chinese subset of AgiEval, which has the recommended tasks.

Consider testing more on Chinese language itself. Chinese has a long history with a rich vocabulary of phrases, idioms and poems. These aspects can never be tested by verbatim translation from other languages.

Unfortunately we didn't find any task that would target chinese grammar directly. If you have a know of one we would be happy if you contributed it and we could include it in next iteration. https://github.com/huggingface/lighteval/wiki/Contributing-to-multilingual-evaluations

image.png

Sign up or log in to comment