Qwen3-8B Hanabi RLVR Fine-tuned (Step 110)
This is a Qwen3-8B model fine-tuned using Reinforcement Learning from Verifier Rewards (RLVR) on Hanabi game tasks.
Model Details
- Base Model: Qwen/Qwen3-8B
- Training Method: RLVR (Reinforcement Learning from Verifier Rewards)
- Task: Hanabi cooperative card game
- Training Steps: 110 training steps
- Architecture: Transformer with flash attention
- Training Framework: Prime-Rel RL framework
Training Details
- Fine-tuned on Hanabi game scenarios with verifier feedback
- Uses multi-GPU setup (8x A100 GPUs)
- Optimized for strategic decision making in cooperative gameplay
- Trained with reinforcement learning from verifier rewards
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Anonymouslolol/qwen3-8B-hanabi-step110")
tokenizer = AutoTokenizer.from_pretrained("Anonymouslolol/qwen3-8B-hanabi-step110")
# Use for Hanabi game reasoning and strategic planning
Performance
- Specialized for Hanabi game strategy and reasoning
- Trained to make optimal moves in cooperative card games
- Fine-tuned for understanding game state and planning moves
Citation
Trained using Prime-Rel RL framework on Hanabi benchmark.
Files
pytorch_model.bin: Main model weights (16GB)config.json: Model configurationtokenizer.json: Tokenizer datavocab.json,merges.txt: BPE tokenizer filesgeneration_config.json: Generation settingsspecial_tokens_map.json: Special token mappings
- Downloads last month
- 389