inoki-giskard/scan-report-temp · Report for siebert/sentiment-roberta-large-english

Hey Team!🤗✨
We’re thrilled to share some amazing evaluation results that’ll make your day!🎉📊

We have identified 13 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset sst2 (subset default, split validation).

👉Robustness issues (2)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Robustness	major 🔴	—	Fail rate = 0.104	Transform to uppercase	91/872 tested samples (10.44%) changed prediction after perturbation

🔍✨Examples

When feature “text” is perturbed with the transformation “Transform to uppercase”, the model changes its prediction in 10.44% of the cases. We expected the predictions not to be affected by this transformation.

	text	Transform to uppercase(text)	Original prediction	Prediction after perturbation
1	unflinchingly bleak and desperate	UNFLINCHINGLY BLEAK AND DESPERATE	POSITIVE (p = 0.99)	NEGATIVE (p = 1.00)
6	a sometimes tedious film .	A SOMETIMES TEDIOUS FILM .	NEGATIVE (p = 1.00)	POSITIVE (p = 0.99)
20	pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins .	PUMPKIN TAKES AN ADMIRABLE LOOK AT THE HYPOCRISY OF POLITICAL CORRECTNESS , BUT IT DOES SO WITH SUCH AN UNEVEN TONE THAT YOU NEVER KNOW WHEN HUMOR ENDS AND TRAGEDY BEGINS .	NEGATIVE (p = 1.00)	POSITIVE (p = 1.00)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Robustness	medium 🟡	—	Fail rate = 0.073	Add typos	59/808 tested samples (7.3%) changed prediction after perturbation

🔍✨Examples

When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 7.3% of the cases. We expected the predictions not to be affected by this transformation.

	text	Add typos(text)	Original prediction	Prediction after perturbation
58	manages to be both repulsively sadistic and mundane .	manages to be oth repulsively sadistic and mundane .	POSITIVE (p = 0.98)	NEGATIVE (p = 1.00)
62	the primitive force of this film seems to bubble up from the vast collective memory of the combatants .	the lrimitive force f this film swems to bubble up frim tje vast cjollective memory lf thse ombatants .	NEGATIVE (p = 1.00)	POSITIVE (p = 1.00)
64	the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow .	the script kicks in , and mr. hartley 's fdistended pace and foot-draginf rhythms follow .	NEGATIVE (p = 1.00)	POSITIVE (p = 1.00)

👉Performance issues (11)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Performance	major 🔴	`text_length(text)` < 63.500 AND `text_length(text)` >= 53.500	Precision = 0.714	—	-22.36% than global

🔍✨Examples

For records in the dataset where `text_length(text)` < 63.500 AND `text_length(text)` >= 53.500, the Precision is 22.36% lower than the global Precision.

	text	text_length(text)	label	Predicted `label`
21	the iditarod lasts for days - this just felt like it did .	59	NEGATIVE	POSITIVE (p = 1.00)
58	manages to be both repulsively sadistic and mundane .	54	NEGATIVE	POSITIVE (p = 0.98)
92	you wo n't like roger , but you will quickly recognize him .	61	NEGATIVE	POSITIVE (p = 1.00)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Performance	major 🔴	`avg_word_length(text)` >= 4.632 AND `avg_word_length(text)` < 4.726	Recall = 0.769	—	-17.50% than global

🔍✨Examples

For records in the dataset where `avg_word_length(text)` >= 4.632 AND `avg_word_length(text)` < 4.726, the Recall is 17.5% lower than the global Recall.

	text	avg_word_length(text)	label	Predicted `label`
87	jaglom ... put ( s ) the audience in the privileged position of eavesdropping on his characters	4.64706	POSITIVE	NEGATIVE (p = 1.00)
282	while there 's something intrinsically funny about sir anthony hopkins saying ` get in the car , bitch , ' this jerry bruckheimer production has little else to offer	4.72414	POSITIVE	NEGATIVE (p = 1.00)
546	on the heels of the ring comes a similarly morose and humorless horror movie that , although flawed , is to be commended for its straight-ahead approach to creepiness .	4.63333	POSITIVE	NEGATIVE (p = 1.00)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Performance	major 🔴	`avg_whitespace(text)` < 0.178 AND `avg_whitespace(text)` >= 0.175	Recall = 0.769	—	-17.50% than global

🔍✨Examples

For records in the dataset where `avg_whitespace(text)` < 0.178 AND `avg_whitespace(text)` >= 0.175, the Recall is 17.5% lower than the global Recall.

	text	avg_whitespace(text)	label	Predicted `label`
87	jaglom ... put ( s ) the audience in the privileged position of eavesdropping on his characters	0.177083	POSITIVE	NEGATIVE (p = 1.00)
282	while there 's something intrinsically funny about sir anthony hopkins saying ` get in the car , bitch , ' this jerry bruckheimer production has little else to offer	0.174699	POSITIVE	NEGATIVE (p = 1.00)
546	on the heels of the ring comes a similarly morose and humorless horror movie that , although flawed , is to be commended for its straight-ahead approach to creepiness .	0.177515	POSITIVE	NEGATIVE (p = 1.00)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Performance	major 🔴	`text_length(text)` >= 163.500 AND `text_length(text)` < 179.500	Recall = 0.812	—	-12.86% than global

🔍✨Examples

For records in the dataset where `text_length(text)` >= 163.500 AND `text_length(text)` < 179.500, the Recall is 12.86% lower than the global Recall.

	text	text_length(text)	label	Predicted `label`
166	characters still need to function according to some set of believable and comprehensible impulses , no matter how many drugs they do or how much artistic license avary employs .	178	NEGATIVE	POSITIVE (p = 0.99)
266	a coda in every sense , the pinochet case splits time between a minute-by-minute account of the british court 's extradition chess game and the regime 's talking-head survivors .	179	POSITIVE	NEGATIVE (p = 0.95)
282	while there 's something intrinsically funny about sir anthony hopkins saying ` get in the car , bitch , ' this jerry bruckheimer production has little else to offer	166	POSITIVE	NEGATIVE (p = 1.00)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Performance	medium 🟡	`idx` < 802.500 AND `idx` >= 735.500	Precision = 0.854	—	-7.21% than global

🔍✨Examples

For records in the dataset where `idx` < 802.500 AND `idx` >= 735.500, the Precision is 7.21% lower than the global Precision.

	idx	label	Predicted `label`
736	736	NEGATIVE	POSITIVE (p = 1.00)
741	741	POSITIVE	NEGATIVE (p = 0.99)
752	752	NEGATIVE	POSITIVE (p = 0.99)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Performance	medium 🟡	`text_length(text)` < 93.500 AND `text_length(text)` >= 86.500	Precision = 0.857	—	-6.83% than global

🔍✨Examples

For records in the dataset where `text_length(text)` < 93.500 AND `text_length(text)` >= 86.500, the Precision is 6.83% lower than the global Precision.

	text	text_length(text)	label	Predicted `label`
102	does paint some memorable images ... , but makhmalbaf keeps her distance from the characters	93	POSITIVE	NEGATIVE (p = 1.00)
115	sam mendes has become valedictorian at the school for soft landings and easy ways out .	88	NEGATIVE	POSITIVE (p = 1.00)
519	moretti 's compelling anatomy of grief and the difficult process of adapting to loss .	87	NEGATIVE	POSITIVE (p = 1.00)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Performance	medium 🟡	`text_length(text)` >= 140.500 AND `text_length(text)` < 154.500	Precision = 0.862	—	-6.30% than global

🔍✨Examples

For records in the dataset where `text_length(text)` >= 140.500 AND `text_length(text)` < 154.500, the Precision is 6.3% lower than the global Precision.

	text	text_length(text)	label	Predicted `label`
95	this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms .	146	NEGATIVE	POSITIVE (p = 1.00)
147	the talented and clever robert rodriguez perhaps put a little too much heart into his first film and did n't reserve enough for his second .	141	NEGATIVE	POSITIVE (p = 0.98)
494	it showcases carvey 's talent for voices , but not nearly enough and not without taxing every drop of one 's patience to get to the good stuff .	145	NEGATIVE	POSITIVE (p = 0.98)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Performance	medium 🟡	`text_length(text)` < 53.500 AND `text_length(text)` >= 46.500	Recall = 0.875	—	-6.16% than global

🔍✨Examples

For records in the dataset where `text_length(text)` < 53.500 AND `text_length(text)` >= 46.500, the Recall is 6.16% lower than the global Recall.

	text	text_length(text)	label	Predicted `label`
295	jones ... does offer a brutal form of charisma .	49	POSITIVE	NEGATIVE (p = 0.99)
436	trite , banal , cliched , mostly inoffensive .	47	NEGATIVE	POSITIVE (p = 0.99)
602	instead , he shows them the respect they are due .	51	POSITIVE	NEGATIVE (p = 1.00)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Performance	medium 🟡	`idx` < 518.500 AND `idx` >= 463.500	Recall = 0.879	—	-5.75% than global

🔍✨Examples

For records in the dataset where `idx` < 518.500 AND `idx` >= 463.500, the Recall is 5.75% lower than the global Recall.

	idx	label	Predicted `label`
464	464	POSITIVE	NEGATIVE (p = 1.00)
481	481	POSITIVE	NEGATIVE (p = 1.00)
494	494	NEGATIVE	POSITIVE (p = 0.98)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Performance	medium 🟡	`avg_word_length(text)` >= 4.509 AND `avg_word_length(text)` < 4.632	Precision = 0.871	—	-5.33% than global

🔍✨Examples

For records in the dataset where `avg_word_length(text)` >= 4.509 AND `avg_word_length(text)` < 4.632, the Precision is 5.33% lower than the global Precision.

	text	avg_word_length(text)	label	Predicted `label`
95	this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms .	4.61538	NEGATIVE	POSITIVE (p = 1.00)
218	all that 's missing is the spontaneity , originality and delight .	4.58333	NEGATIVE	POSITIVE (p = 0.95)
300	fun , flip and terribly hip bit of cinematic entertainment .	4.54545	POSITIVE	NEGATIVE (p = 1.00)

Vulnerability	Level	Data slice	Metric	Transformation	Deviation
Performance	medium 🟡	`avg_whitespace(text)` < 0.182 AND `avg_whitespace(text)` >= 0.178	Precision = 0.871	—	-5.33% than global

🔍✨Examples

For records in the dataset where `avg_whitespace(text)` < 0.182 AND `avg_whitespace(text)` >= 0.178, the Precision is 5.33% lower than the global Precision.

	text	avg_whitespace(text)	label	Predicted `label`
95	this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms .	0.178082	NEGATIVE	POSITIVE (p = 1.00)
218	all that 's missing is the spontaneity , originality and delight .	0.179104	NEGATIVE	POSITIVE (p = 0.95)
300	fun , flip and terribly hip bit of cinematic entertainment .	0.180328	POSITIVE	NEGATIVE (p = 1.00)

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.

💡 What's Next?

Checkout the Giskard Space and improve your model.
The Giskard community is always buzzing with ideas. 🐢🤔 What do you want to see next? Your feedback is our favorite fuel, so drop your thoughts in the community forum! 🗣️💬 Together, we're building something extraordinary.

🙌 Big Thanks!

We're grateful to have you on this adventure with us. 🚀🌟 Here's to more breakthroughs, laughter, and code magic! 🥂✨ Keep hugging that code and spreading the love! 💻 #Giskard #Huggingface #AISafety 🌈👏 Your enthusiasm, feedback, and contributions are what seek. 🌟 Keep being awesome!