meta-llama/Meta-Llama-Guard-2-8B · POC Real Time Moderation Bot

I'm creating a Proof of Concept discord moderation bot;

It's an early concept, but I plan on making it way more better with the ai model involving the decision to automatically delete the message base on level of severity defined by moderation.

So-far, running the model for a almost a full day, it caught many things and stuff slipping through the cracks of discord's auto mod, which requires manual review for now.

I will be releasing the code base for such thing later down the line on how I got it working and systems in place, even working on a lower powered laptop as spare thing when not using my main system.

I will be exploring other possibilities of maybe using it with live chat from game servers that support such things and send like automated player reports to sys admins/game masters.

That being said, nice work on the model and hopefully maybe later down the we could get a severity score system. So it can estimate on how bad an actual thing is. like "I will slap you with a stick" can be seen as S1: Violent Crimes but as like playful banter, but not that violent so maybe like a score of 0.3, where as "I will slap you with a stick and shove it in your eye sockets" can be seen as a .7 where it doesn't seem like a banter thing to say.
Like keep to the same S1, S2, S3, system and be like, S1.01-S1.10 (like scale from x.1 - x.10 using the 1.x place to do a score of 1 lowest 10 highest.)

The model is great, but it's not good for mass use in the open field, but does seem to work for things not of priority but human intervention is still required.

Maybe could even lang chain it to another model with better reasoning skills and give 2nd reason opinions based on chat context / history of maybe like the last 20-30 messages like

"we gaming tonight bois, i'm going to slaughter you all"
------- offttopic messages between -------
"I will destroy you in the halo game tonight"
------- offttopic messages between -------
"nah i will murder you" - is not safe

"nah i will murder you" alone is not safe deemed by llama guard
--- llama guard triggers chat history script gathering ---
--- goes into Lang Chain mode gives previous message history to another reasoning llm ---
--- Automated scripted reply: "{message} was deemed {harm type} {harm descriptor}, please read this conversation and provide second opinion {give message history} ---\