Spaces:
Running
Running
Modality,Suggested Evaluation,What it is evaluating,Considerations,Link,URL | |
Text,Word Embedding Association Test (WEAT),Associations and word embeddings based on Implicit Associations Test (IAT),"Although based in human associations, general societal attitudes do not always represent subgroups of people and cultures.",Semantics derived automatically from language corpora contain human-like biases,https://researchportal.bath.ac.uk/en/publications/semantics-derived-automatically-from-language-corpora-necessarily | |
Text,"Word Embedding Factual As | |
sociation Test (WEFAT)",Associations and word embeddings based on Implicit Associations Test (IAT),"Although based in human associations, general societal attitudes do not always represent subgroups of people and cultures.",Semantics derived automatically from language corpora contain human-like biases,https://researchportal.bath.ac.uk/en/publications/semantics-derived-automatically-from-language-corpora-necessarily | |
Text,Sentence Encoder Association Test (SEAT),Associations and word embeddings based on Implicit Associations Test (IAT),"Although based in human associations, general societal attitudes do not always represent subgroups of people and cultures.",[1903.10561] On Measuring Social Biases in Sentence Encoders,https://arxiv.org/abs/1903.10561 | |
Text,Contextual Word Representation Association Tests for social and intersectional biases,Associations and word embeddings based on Implicit Associations Test (IAT),"Although based in human associations, general societal attitudes do not always represent subgroups of people and cultures.",Assessing social and intersectional biases in contextualized word representations | Proceedings of the 33rd International Conference on Neural Information Processing Systems,https://dl.acm.org/doi/abs/10.5555/3454287.3455472 | |
Text,StereoSet,Protected class stereotypes,Automating stereotype detection makes distinguishing harmful stereotypes difficult. It also raises many false positives and can flag relatively neutral associations based in fact (e.g. population x has a high proportion of lactose intolerant people).,[2004.09456] StereoSet: Measuring stereotypical bias in pretrained language models,https://arxiv.org/abs/2004.09456 | |
Text,Crow-S Pairs,Protected class stereotypes,Automating stereotype detection makes distinguishing harmful stereotypes difficult. It also raises many false positives and can flag relatively neutral associations based in fact (e.g. population x has a high proportion of lactose intolerant people).,[2010.00133] CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models,https://arxiv.org/abs/2010.00133 | |
Text,HONEST: Measuring Hurtful Sentence Completion in Language Models,Protected class stereotypes and hurtful language,Automating stereotype detection makes distinguishing harmful stereotypes difficult. It also raises many false positives and can flag relatively neutral associations based in fact (e.g. population x has a high proportion of lactose intolerant people).,HONEST: Measuring Hurtful Sentence Completion in Language Models,https://aclanthology.org/2021.naacl-main.191.pdf | |
Text,TANGO: Towards Centering Transgender and Non-Binary Voices to Measure Biases in Open Language Generation,"Bias measurement for trans and nonbinary community via measuring gender non-affirmative language, specifically 1) misgendering 2), negative responses to gender disclosure","Based on automatic evaluations of the resulting open language generation, may be sensitive to the choice of evaluator. Would advice for a combination of perspective, detoxify, and regard metrics","Paper | |
Dataset", | |
Text,BBQ: A hand-built bias benchmark for question answering,Protected class stereotypes,,BBQ: A hand-built bias benchmark for question answering,https://aclanthology.org/2022.findings-acl.165.pdf | |
Text,"BBNLI, bias in NLI benchmark",Protected class stereotypes,,On Measuring Social Biases in Prompt-Based Multi-Task Learning,https://aclanthology.org/2022.findings-naacl.42.pdf | |
Text,WinoGender,Bias between gender and occupation,,Gender Bias in Coreference Resolution,https://arxiv.org/abs/1804.09301 | |
Text,WinoQueer,"Bias between gender, sexuality",,Winoqueer,https://arxiv.org/abs/2306.15087 | |
Text,Level of caricature,,,CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations,https://arxiv.org/abs/2310.11501 | |
Text,SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models,,,[2305.11840] SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models,https://arxiv.org/abs/2305.11840 | |
Text,"Investigating Subtler Biases in LLMs: | |
Ageism, Beauty, Institutional, and Nationality Bias in Generative Models",,,https://arxiv.org/abs/2309.08902,https://arxiv.org/abs/2309.08902 | |
Text,ROBBIE: Robust Bias Evaluation of Large Generative Language Models,,,[2311.18140] ROBBIE: Robust Bias Evaluation of Large Generative Language Models,https://arxiv.org/abs/2311.18140 | |
Image,Image Embedding Association Test (iEAT),Embedding associations,,"Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases | Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency",https://dl.acm.org/doi/abs/10.1145/3442188.3445932 | |
Image,Dataset leakage and model leakage,Gender and label bias,,[1811.08489] Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations,https://arxiv.org/abs/1811.08489 | |
Image,Grounded-WEAT,Joint vision and language embeddings,,Measuring Social Biases in Grounded Vision and Language Embeddings - ACL Anthology,https://aclanthology.org/2021.naacl-main.78/ | |
Image,Grounded-SEAT,,,, | |
Image,CLIP-based evaluation,"Gender and race and class associations with four attribute categories (profession, political, object, and other.)",,DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers,https://arxiv.org/abs/2202.04053 | |
Image,Human evaluation,,,, | |
Image,,,,[2108.02818] Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications,https://arxiv.org/abs/2108.02818 | |
Image,,,,[2004.07173] Bias in Multimodal AI: Testbed for Fair Automatic Recruitment,https://arxiv.org/abs/2004.07173 | |
Image,Characterizing the variation in generated images,,,Stable bias: Analyzing societal representations in diffusion models,https://arxiv.org/abs/2303.11408 | |
Image,Stereotypical representation of professions,,,Editing Implicit Assumptions in Text-to-Image Diffusion Models see section 6, | |
Image,Effect of different scripts on text-to-image generation,"It evaluates generated images for cultural stereotypes, when using different scripts (homoglyphs). It somewhat measures the suceptibility of a model to produce cultural stereotypes by simply switching the script",,Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis,https://arxiv.org/pdf/2209.08891.pdf | |
Image,Automatic classification of the immorality of images,,,Ensuring Visual Commonsense Morality for Text-to-Image Generation,https://arxiv.org/pdf/2209.08891.pdf | |
Image,"ENTIGEN: effect on the | |
diversity of the generated images when adding | |
ethical intervention",,,"How well can Text-to-Image Generative Models understand Ethical | |
Natural Language Interventions?", | |
Image,Evaluating text-to-image models for (complex) biases,,,Easily accessible text-to-image generation amplifies demographic stereotypes at large scale,https://dl.acm.org/doi/abs/10.1145/3593013.3594095 | |
Image,,,,FACET: Fairness in Computer Vision Evaluation Benchmark,https://openaccess.thecvf.com/content/ICCV2023/html/Gustafson_FACET_Fairness_in_Computer_Vision_Evaluation_Benchmark_ICCV_2023_paper.html | |
Image,Evaluating text-to-image models for occupation-gender biases from source to output,"Measuring bias from source to output (dataset, model and outcome). Using different prompts to search dataset and to generate images. Evaluate them in turn for stereotypes.","Evaluating for social attributes that one self-identifies for, e.g. gender, is challenging in computer- generated images.",Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness,https://arxiv.org/abs/2302.10893 | |
Image,Evaluating text-to-image models for gender biases in a multilingual setting,Using different prompts in different languages to generate images and evaluate them in turn for stereotypes.,,Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You, | |
Image,Evaluating text-to-image models for biases when multiple people are generated,This work focuses on generating images depicting multiple people. This puts the evaluation on a higher level beyond portrait evaluation.,Same as for the other evaluations of social attributes + evaluating for location in image is difficult as the models have no inherent spatial understanding.,The Male CEO and the Female Assistant: Probing Gender Biases in Text-To-Image Models Through Paired Stereotype Test,https://arxiv.org/abs/2402.11089 | |
Image,Multimodal Composite Association Score: Measuring Gender Bias in Generative Multimodal Models,Measure association between concepts in multi-modal settings (image and text),,, | |
Image,VisoGender,"This work measures gender-occupation biases in image-to-text models by evaluating: (1) their ability to correctly resolve the pronouns of individuals in scenes, and (2) the perceived gender of individuals in images retrieved for gender-neutral search queries.",Relies on annotators’ perceptions of binary gender. Could better control for the fact that models generally struggle with captioning any scene that involves interactions between two or more individuals.,VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution,https://proceedings.neurips.cc/paper_files/paper/2023/hash/c93f26b1381b17693055a611a513f1e9-Abstract-Datasets_and_Benchmarks.html | |
Audio,Not My Voice! A Taxonomy of Ethical and Safety Harms of Speech Generators,Lists harms of audio/speech generators,Not necessarily evaluation but a good source of taxonomy. We can use this to point readers towards high-level evaluations,https://arxiv.org/pdf/2402.01708.pdf,https://arxiv.org/pdf/2402.01708.pdf | |
Video,Diverse Misinformation: Impacts of Human Biases on Detection of Deepfakes on Networks,Human led evaluations of deepfakes to understand susceptibility and representational harms (including political violence),"Repr. harm, incite violence",https://arxiv.org/abs/2210.10026,https://arxiv.org/abs/2210.10026 |