Qwen3-8B Hanabi RLVR Fine-tuned (Step 110)

This is a Qwen3-8B model fine-tuned using Reinforcement Learning from Verifier Rewards (RLVR) on Hanabi game tasks.

Model Details

  • Base Model: Qwen/Qwen3-8B
  • Training Method: RLVR (Reinforcement Learning from Verifier Rewards)
  • Task: Hanabi cooperative card game
  • Training Steps: 110 training steps
  • Architecture: Transformer with flash attention
  • Training Framework: Prime-Rel RL framework

Training Details

  • Fine-tuned on Hanabi game scenarios with verifier feedback
  • Uses multi-GPU setup (8x A100 GPUs)
  • Optimized for strategic decision making in cooperative gameplay
  • Trained with reinforcement learning from verifier rewards

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Anonymouslolol/qwen3-8B-hanabi-step110")
tokenizer = AutoTokenizer.from_pretrained("Anonymouslolol/qwen3-8B-hanabi-step110")

# Use for Hanabi game reasoning and strategic planning

Performance

  • Specialized for Hanabi game strategy and reasoning
  • Trained to make optimal moves in cooperative card games
  • Fine-tuned for understanding game state and planning moves

Citation

Trained using Prime-Rel RL framework on Hanabi benchmark.

Files

  • pytorch_model.bin: Main model weights (16GB)
  • config.json: Model configuration
  • tokenizer.json: Tokenizer data
  • vocab.json, merges.txt: BPE tokenizer files
  • generation_config.json: Generation settings
  • special_tokens_map.json: Special token mappings
Downloads last month
389
Video Preview
loading