narcolepticchicken
/

occ-stack

ml-intern

Model card Files Files and versions

xet

Community

narcolepticchicken commited on 23 days ago

Commit

2fc31b2

verified ·

1 Parent(s): 2957d72

Upload design.md

Browse files

Files changed (1) hide show

design.md +153 -0

design.md ADDED Viewed

	@@ -0,0 +1,153 @@

+# OCC Design Document
+## 1. Core Principles
+1. **Verified Impact First**: Credits are earned only after an oracle verifies marginal value.
+2. **Non-Transferable Credits**: Agents cannot launder credits through others.
+3. **Decaying Credits**: Hoarding is discouraged; use-it-or-lose-it dynamics.
+4. **Capability-Based Rights**: Rights are per-resource, not blanket access.
+5. **Auditable Accounting**: Every credit change has provenance.
+## 2. Impact Oracle
+### Scoring Modes
+**Code Tasks**
+- `unit_test_pass`: binary pass/fail
+- `pass_at_k`: fraction passing among k samples
+- `regression`: does the new state break prior passing tests?
+- `compute_comparison`: score normalized by tokens/FLOPs used
+**Retrieval QA Tasks**
+- `answer_correctness`: exact / fuzzy match to gold
+- `evidence_support`: NLI entailment check on retrieved evidence
+- `hallucination`: NLI contradiction or unsupported claims
+- `abstention_utility`: correct abstention on unanswerable questions
+- `calibration`: Brier score / ECE on confidence predictions
+- `proper_score`: proper scoring rule reward
+**Multi-Agent Debate Tasks**
+- `decision_quality`: final answer correctness
+- `influence_efficiency`: marginal contribution per token/compute
+- `throughput`: decisions per compute unit
+### Reward Formula
+```
+reward = verified_task_score
+       + abstention_utility
+       + calibration_bonus
+       - hallucination_penalty
+       - confident_wrong_penalty
+       - compute_cost_penalty
+       - gaming_penalty
+where:
+  verified_task_score ∈ [0, 1]   (pass/fail or accuracy)
+  abstention_utility ∈ {-1, 0, +1}  (+1 for correct abstain, -1 for incorrect abstain)
+  calibration_bonus = (1 - brier_score) * 0.2
+  hallucination_penalty = contradiction_score * 0.5
+  confident_wrong_penalty = confidence * (1 - correct) * 0.3
+  compute_cost_penalty = (cost / budget) * 0.2
+  gaming_penalty = detected_pattern_penalty (see below)
+```
+### Gaming Detection
+- **Spam**: repeated low-value actions within short window → penalty
+- **Hoarding**: credit balance above threshold for N epochs → decay acceleration
+- **Transfer**: indirect credit laundering via coordinated task submission → ban
+- **Judge exploitation**: output distribution shift toward weak-judge preferences → KL penalty
+- **Over-abstention**: abstention rate > threshold → negative reward
+- **Verbose padding**: tokens per unit impact below threshold → penalty
+## 3. Credit Ledger
+### Schema
+Each entry: `(agent_id, task_id, action_id, earned, spent, decayed, remaining, reason, oracle_score, compute_cost, timestamp, capability_scope)`
+### Rules
+1. **Non-transferable**: `transfer(from, to, amount)` always returns `False`.
+2. **Decay**: `remaining *= exp(-lambda * delta_t)` each evaluation cycle.
+3. **Task scope**: credits earned in task A cannot fund task B unless explicitly pooled.
+4. **Capability scope**: credits for "retrieval" cannot fund "file_write".
+5. **Revocation**: negative outcomes can revoke credits retroactively within a window.
+6. **Provenance**: every entry references an oracle decision hash.
+## 4. Resource Broker
+### Decision Matrix
+| Condition | Decision |
+|-----------|----------|
+| credit >= threshold, low risk | `allow` |
+| credit < threshold, low risk | `deny` |
+| credit >= threshold, high risk | `require_approval` |
+| credit >= threshold, suspicious pattern | `downgrade` or `escalate` |
+| emergency override | `escalate` |
+### Resources
+- `model_call_small` / `model_call_large`
+- `retrieval_call`
+- `verifier_call`
+- `debate_turn`
+- `file_write`
+- `shell_execute`
+- `memory_write`
+- `human_escalation`
+## 5. GRPO Hook
+We implement a reward function compatible with TRL's GRPOTrainer that maps Oracle outputs to per-group rewards. Since full training may be compute-limited, we provide:
+1. `reward_fn(completions, oracle_scores)` — returns tensor of rewards
+2. `GRPOHook` class — wraps Oracle + Ledger + Broker for online evaluation
+3. `OfflineComparator` — compares policies using saved trajectories when training is infeasible
+## 6. Benchmarks
+### Benchmark 1: Code Compute Allocation
+- Dataset: `openai/openai_humaneval` or `evalplus/humanevalplus`
+- Baselines: fixed compute, verifier retries, OCC allocation
+- Metrics: pass@1, pass@k, tokens used, model calls, cost, compute saved at iso-accuracy
+### Benchmark 2: Retrieval QA
+- Dataset: synthetic grounded QA + adversarial evidence
+- Baselines: direct answer, RAG, RAG+verifier, OCC
+- Metrics: correctness, hallucination rate, abstention utility, ECE, retrieval calls, cost
+### Benchmark 3: Multi-Agent Debate
+- Dataset: synthetic factual disputes + code debates
+- Baselines: equal turns, majority vote, confidence-weighted, OCC
+- Metrics: decision quality, compute used, quality per GPU-second, bad-agent containment
+## 7. Ablations
+1. No credit ledger (oracle score used directly)
+2. Transferable credits
+3. Non-decaying credits
+4. No abstention reward
+5. No calibration penalty
+6. No cost penalty
+7. No anti-gaming penalty
+8. No broker (oracle score only)
+9. Broker with static rules
+10. Broker with learned/score-based rights
+## 8. Anti-Gaming Tests
+- Spam low-value actions
+- Hoard credits
+- Transfer credit indirectly
+- Exploit weak judge
+- Verbose but low-value debate turns
+- Over-abstention
+- Overuse retrieval
+- Manipulate confidence
+- Optimize for unit tests while breaking hidden tests
+- Collude in multi-agent debate
+Measure: gaming success rate, credit leakage, robustness under judge replacement, quality degradation, broker containment.