narcolepticchicken commited on
Commit
2fc31b2
Β·
verified Β·
1 Parent(s): 2957d72

Upload design.md

Browse files
Files changed (1) hide show
  1. design.md +153 -0
design.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OCC Design Document
2
+
3
+ ## 1. Core Principles
4
+
5
+ 1. **Verified Impact First**: Credits are earned only after an oracle verifies marginal value.
6
+ 2. **Non-Transferable Credits**: Agents cannot launder credits through others.
7
+ 3. **Decaying Credits**: Hoarding is discouraged; use-it-or-lose-it dynamics.
8
+ 4. **Capability-Based Rights**: Rights are per-resource, not blanket access.
9
+ 5. **Auditable Accounting**: Every credit change has provenance.
10
+
11
+ ## 2. Impact Oracle
12
+
13
+ ### Scoring Modes
14
+
15
+ **Code Tasks**
16
+ - `unit_test_pass`: binary pass/fail
17
+ - `pass_at_k`: fraction passing among k samples
18
+ - `regression`: does the new state break prior passing tests?
19
+ - `compute_comparison`: score normalized by tokens/FLOPs used
20
+
21
+ **Retrieval QA Tasks**
22
+ - `answer_correctness`: exact / fuzzy match to gold
23
+ - `evidence_support`: NLI entailment check on retrieved evidence
24
+ - `hallucination`: NLI contradiction or unsupported claims
25
+ - `abstention_utility`: correct abstention on unanswerable questions
26
+ - `calibration`: Brier score / ECE on confidence predictions
27
+ - `proper_score`: proper scoring rule reward
28
+
29
+ **Multi-Agent Debate Tasks**
30
+ - `decision_quality`: final answer correctness
31
+ - `influence_efficiency`: marginal contribution per token/compute
32
+ - `throughput`: decisions per compute unit
33
+
34
+ ### Reward Formula
35
+
36
+ ```
37
+ reward = verified_task_score
38
+ + abstention_utility
39
+ + calibration_bonus
40
+ - hallucination_penalty
41
+ - confident_wrong_penalty
42
+ - compute_cost_penalty
43
+ - gaming_penalty
44
+
45
+ where:
46
+ verified_task_score ∈ [0, 1] (pass/fail or accuracy)
47
+ abstention_utility ∈ {-1, 0, +1} (+1 for correct abstain, -1 for incorrect abstain)
48
+ calibration_bonus = (1 - brier_score) * 0.2
49
+ hallucination_penalty = contradiction_score * 0.5
50
+ confident_wrong_penalty = confidence * (1 - correct) * 0.3
51
+ compute_cost_penalty = (cost / budget) * 0.2
52
+ gaming_penalty = detected_pattern_penalty (see below)
53
+ ```
54
+
55
+ ### Gaming Detection
56
+
57
+ - **Spam**: repeated low-value actions within short window β†’ penalty
58
+ - **Hoarding**: credit balance above threshold for N epochs β†’ decay acceleration
59
+ - **Transfer**: indirect credit laundering via coordinated task submission β†’ ban
60
+ - **Judge exploitation**: output distribution shift toward weak-judge preferences β†’ KL penalty
61
+ - **Over-abstention**: abstention rate > threshold β†’ negative reward
62
+ - **Verbose padding**: tokens per unit impact below threshold β†’ penalty
63
+
64
+ ## 3. Credit Ledger
65
+
66
+ ### Schema
67
+
68
+ Each entry: `(agent_id, task_id, action_id, earned, spent, decayed, remaining, reason, oracle_score, compute_cost, timestamp, capability_scope)`
69
+
70
+ ### Rules
71
+
72
+ 1. **Non-transferable**: `transfer(from, to, amount)` always returns `False`.
73
+ 2. **Decay**: `remaining *= exp(-lambda * delta_t)` each evaluation cycle.
74
+ 3. **Task scope**: credits earned in task A cannot fund task B unless explicitly pooled.
75
+ 4. **Capability scope**: credits for "retrieval" cannot fund "file_write".
76
+ 5. **Revocation**: negative outcomes can revoke credits retroactively within a window.
77
+ 6. **Provenance**: every entry references an oracle decision hash.
78
+
79
+ ## 4. Resource Broker
80
+
81
+ ### Decision Matrix
82
+
83
+ | Condition | Decision |
84
+ |-----------|----------|
85
+ | credit >= threshold, low risk | `allow` |
86
+ | credit < threshold, low risk | `deny` |
87
+ | credit >= threshold, high risk | `require_approval` |
88
+ | credit >= threshold, suspicious pattern | `downgrade` or `escalate` |
89
+ | emergency override | `escalate` |
90
+
91
+ ### Resources
92
+
93
+ - `model_call_small` / `model_call_large`
94
+ - `retrieval_call`
95
+ - `verifier_call`
96
+ - `debate_turn`
97
+ - `file_write`
98
+ - `shell_execute`
99
+ - `memory_write`
100
+ - `human_escalation`
101
+
102
+ ## 5. GRPO Hook
103
+
104
+ We implement a reward function compatible with TRL's GRPOTrainer that maps Oracle outputs to per-group rewards. Since full training may be compute-limited, we provide:
105
+
106
+ 1. `reward_fn(completions, oracle_scores)` β€” returns tensor of rewards
107
+ 2. `GRPOHook` class β€” wraps Oracle + Ledger + Broker for online evaluation
108
+ 3. `OfflineComparator` β€” compares policies using saved trajectories when training is infeasible
109
+
110
+ ## 6. Benchmarks
111
+
112
+ ### Benchmark 1: Code Compute Allocation
113
+ - Dataset: `openai/openai_humaneval` or `evalplus/humanevalplus`
114
+ - Baselines: fixed compute, verifier retries, OCC allocation
115
+ - Metrics: pass@1, pass@k, tokens used, model calls, cost, compute saved at iso-accuracy
116
+
117
+ ### Benchmark 2: Retrieval QA
118
+ - Dataset: synthetic grounded QA + adversarial evidence
119
+ - Baselines: direct answer, RAG, RAG+verifier, OCC
120
+ - Metrics: correctness, hallucination rate, abstention utility, ECE, retrieval calls, cost
121
+
122
+ ### Benchmark 3: Multi-Agent Debate
123
+ - Dataset: synthetic factual disputes + code debates
124
+ - Baselines: equal turns, majority vote, confidence-weighted, OCC
125
+ - Metrics: decision quality, compute used, quality per GPU-second, bad-agent containment
126
+
127
+ ## 7. Ablations
128
+
129
+ 1. No credit ledger (oracle score used directly)
130
+ 2. Transferable credits
131
+ 3. Non-decaying credits
132
+ 4. No abstention reward
133
+ 5. No calibration penalty
134
+ 6. No cost penalty
135
+ 7. No anti-gaming penalty
136
+ 8. No broker (oracle score only)
137
+ 9. Broker with static rules
138
+ 10. Broker with learned/score-based rights
139
+
140
+ ## 8. Anti-Gaming Tests
141
+
142
+ - Spam low-value actions
143
+ - Hoard credits
144
+ - Transfer credit indirectly
145
+ - Exploit weak judge
146
+ - Verbose but low-value debate turns
147
+ - Over-abstention
148
+ - Overuse retrieval
149
+ - Manipulate confidence
150
+ - Optimize for unit tests while breaking hidden tests
151
+ - Collude in multi-agent debate
152
+
153
+ Measure: gaming success rate, credit leakage, robustness under judge replacement, quality degradation, broker containment.