narcolepticchicken commited on
Commit
adf7987
Β·
verified Β·
1 Parent(s): a4cbe91

Upload design.md

Browse files
Files changed (1) hide show
  1. design.md +249 -95
design.md CHANGED
@@ -1,132 +1,286 @@
1
- # OCC Design Document
2
 
3
- ## Philosophy
4
 
5
- Compute is the scarce resource in modern agent systems. Every action β€” a tool call, a retrieval, a debate turn, a verification pass β€” costs tokens, GPU seconds, or API dollars. Current systems allocate compute upfront (fixed budget) or reactively (retry on failure). OCC proposes a **proactive, earned-compute** model where agents must demonstrate marginal value before receiving more resources.
6
 
7
- ## Core Thesis
8
 
9
- > "Agents should earn compute, not spend it."
10
 
11
- The system is named Oracle-Credit-Compute because every compute decision flows through three stages:
12
- 1. **Oracle:** Score the marginal impact of an action
13
- 2. **Credit:** Update the agent's credit balance based on that score
14
- 3. **Compute:** The broker decides whether to grant the requested resource
15
 
16
- ## Architecture
17
 
18
- ### Impact Oracle
19
 
20
- **Design principle:** Rule-based, auditable, resistant to reward hacking.
 
 
 
 
 
21
 
22
- Neural reward models are vulnerable to Goodhart's Law and reward hacking (Gao et al., 2023; Skalse et al., 2022). A neural RM can be optimized to produce high scores without producing correct answers. OCC uses rule-based scoring with the following properties:
23
 
24
- - **Verifiable outcomes:** Code correctness is checked by running tests. QA correctness is checked against gold answers. Debate quality is checked against ground truth.
25
- - **Cost-adjusted scores:** Every score subtracts a compute cost penalty. This prevents agents from achieving correctness through brute-force token spending.
26
- - **Proper scoring rules:** Calibration bonus via Brier score encourages well-calibrated confidence, not just correctness.
27
- - **Anti-gaming detectors:** Explicit checks for hidden-test gaming, spam, collusion, and over-abstention.
28
 
29
- ### Credit Ledger
30
-
31
- **Design principle:** Non-transferable, decaying, capability-scoped.
32
-
33
- - **Non-transferable:** `transfer()` always returns `False`. This prevents colluding agents from pooling credits or laundering them through intermediaries.
34
- - **Exponential decay:** Idle credits decay at rate Ξ» per time step. This prevents hoarding and encourages agents to use credits or lose them.
35
- - **Capability-scoped:** Credits are scoped to specific capabilities (`retrieval`, `model_call`, `file_write`). An agent that is good at retrieval should not automatically get dangerous write permissions.
36
- - **Full provenance:** Every entry has an oracle score, compute cost, timestamp, and reason. This enables auditing and debugging.
37
 
38
- ### Resource Broker
 
 
 
 
39
 
40
- **Design principle:** Risk-adjusted, capability-based, dynamic.
41
 
42
- Resources are classified by risk:
43
- - **Low:** `retrieval_call`, `debate_turn` β€” threshold 0.5 credits
44
- - **Medium:** `model_call`, `verifier_call`, `memory_write` β€” threshold 2.0 credits
45
- - **High:** `file_write`, `shell_execute`, `human_escalation` β€” threshold 5.0 credits, may require approval
46
 
47
- The broker can make six decisions:
48
- - `ALLOW`: credits β‰₯ threshold, no flags
49
- - `DENY`: credits < threshold Γ— 0.5
50
- - `REQUIRE_APPROVAL`: high-risk + high risk score
51
- - `DOWNGRADE`: credits between 0.5Γ— and 1.0Γ— threshold β†’ downgrade to cheaper resource
52
- - `ESCALATE`: repeated denials from same agent
53
- - `ASK_JUSTIFICATION`: credits insufficient but agent has some history
54
 
55
- ### GRPO/RL Hook
 
 
 
 
 
 
56
 
57
- **Design principle:** Reward = verified impact - compute cost.
58
 
59
- The reward function wraps the Impact Oracle and produces a scalar reward per completion. It is designed to be passed directly to TRL's `GRPOTrainer` as `reward_funcs`.
 
60
 
61
- The offline comparator allows policy comparison without training:
62
- 1. Generate trajectories from two policies on the same test set
63
- 2. Score both with the same reward hook
64
- 3. Compare mean rewards, win rates, and failure rates
 
65
 
66
- ## Reward Formula
67
 
68
  ```
69
- reward =
70
- verified_task_score
71
- + abstention_utility
72
- + calibration_bonus
73
- - hallucination_penalty
74
- - confident_wrong_penalty
75
- - compute_cost_penalty
76
- - gaming_penalty
77
 
78
- Where:
79
- verified_task_score = correctness * weight_correctness
80
- abstention_utility = +1.0 if correct abstain, -1.0 if wrong abstain
81
- calibration_bonus = (1 - brier_score) * weight_calibration
82
- brier_score = (confidence - outcome)^2
83
- hallucination_penalty = 2.0 if entailment < 0.5 and contradiction > 0.5
84
- confident_wrong_penalty = 3.0 if confidence > 0.8 and correctness < 0.5
85
- compute_cost_penalty = compute_cost * 0.0001
86
- gaming_penalty = 2.0 if hidden_tests fail while public pass
87
  ```
88
 
89
- ## Anti-Gaming Design
90
 
91
- ### Spam Attacks
92
- - Detection: Repeated low-value actions (compute > 100, raw_score < 0.5)
93
- - Containment: Oracle subtracts gaming_penalty. Ledger can revoke all credits on explicit detection.
94
 
95
- ### Hidden-Test Gaming
96
- - Detection: `public_pass=True` but `hidden_pass=False`
97
- - Containment: Immediate gaming_penalty=2.0 subtracted from raw score.
 
 
98
 
99
- ### Credit Laundering
100
- - Prevention: `transfer()` returns `False` unconditionally.
101
 
102
- ### Credit Hoarding
103
- - Prevention: Exponential decay on idle credits.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
- ### Over-Abstention
106
- - Detection: Agent abstains on answerable questions.
107
- - Containment: Wrong abstentions get -abstention_bonus (-1.0).
108
 
109
- ### Confidence Manipulation
110
- - Detection: Brier score in calibration bonus.
111
- - Containment: Overconfident wrong answers get confident_wrong_penalty=3.0.
 
 
 
 
 
 
 
 
 
112
 
113
- ## Compute Budgeting
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
 
115
- The system assumes a fixed compute budget per task. The broker enforces this by:
116
- 1. Tracking cumulative compute cost in the ledger entries
117
- 2. Denying requests when the agent's credit balance is below the threshold
118
- 3. Downgrading to cheaper resources when balance is marginal
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
- ## Failure Modes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
- 1. **Oracle brittleness:** If the scoring rules are incomplete, agents will find and exploit the gaps.
123
- 2. **Broker conservatism:** If thresholds are too high, agents cannot act even when they should.
124
- 3. **Decay too aggressive:** If Ξ» is too high, agents lose credits before completing multi-step tasks.
125
- 4. **Scope explosion:** Capability-scoped credits multiply the state space.
 
 
 
 
 
 
 
126
 
127
- ## Future Extensions
128
 
129
- 1. **Hierarchical broker:** Nested capability scopes (e.g., `model_call/code` vs `model_call/qa`).
130
- 2. **Dynamic thresholds:** Learn thresholds from historical data rather than hardcoding.
131
- 3. **Peer review:** Multiple oracles vote on controversial actions.
132
- 4. **Human-in-the-loop:** Escalate high-risk decisions to human reviewers with credit incentives.
 
1
+ # OCC: Formal System Definition
2
 
3
+ ## Overview
4
 
5
+ OCC (Oracle-Credit-Compute) is a mechanism-design layer that governs agent access to compute, retrieval, debate turns, tool execution, and other resources. It treats compute allocation as a security boundary rather than a performance optimization.
6
 
7
+ ## Core Insight
8
 
9
+ In multi-agent systems, compute is not neutral. Extra turns, tokens, and tool calls can amplify adversarial influence unless access to deliberation is governed by verified marginal contribution. OCC makes agent compute scarce, earned, scoped, decaying, and auditable.
10
 
11
+ ---
 
 
 
12
 
13
+ ## Formal Definition
14
 
15
+ ### Entities
16
 
17
+ Let:
18
+ - **A** = {a₁, aβ‚‚, ..., aβ‚™} be a set of agents
19
+ - **T** = {t₁, tβ‚‚, ..., tβ‚˜} be a set of tasks
20
+ - **R** = {r₁, rβ‚‚, ..., rβ‚–} be a set of resource types (model calls, retrieval, debate turns, tool execution, file writes, etc.)
21
+ - **C** = {c₁, cβ‚‚, ..., cβ‚—} be a set of capability scopes
22
+ - **O** be an Impact Oracle that maps (action, context, outcome) β†’ score ∈ [βˆ’1, 1]
23
 
24
+ ### Credit State
25
 
26
+ Each agent a has a credit vector at time step t:
 
 
 
27
 
28
+ ```
29
+ credit[a, t] ∈ β„β‚Š (non-negative real)
30
+ ```
 
 
 
 
 
31
 
32
+ Credits are:
33
+ - **Non-transferable**: βˆ€a,b ∈ A, aβ‰ b, credit[b,t] cannot increase from credit[a,t]
34
+ - **Decaying**: credit[a, t+1] = decay(credit[a,t]) where decay(x) = x · δ, δ ∈ (0,1)
35
+ - **Task-scoped**: credits can be bound to a specific task Ο„
36
+ - **Capability-scoped**: credits can be earmarked for capability scope c
37
 
38
+ ### Earning Function
39
 
40
+ ```
41
+ earn(a, action, oracle_score, compute_cost) β†’ Ξ” ∈ ℝ
 
 
42
 
43
+ Ξ” = f(oracle_score, compute_cost, calibration, abstention_utility)
44
+ ```
 
 
 
 
 
45
 
46
+ Where f must satisfy:
47
+ - oracle_score < 0 β‡’ Ξ” ≀ 0 (negative contribution yields ≀ 0 credit)
48
+ - oracle_score = 0 β‡’ Ξ” = 0 (neutral action neither earns nor loses)
49
+ - oracle_score > 0 β‡’ Ξ” > 0 (positive contribution earns credit)
50
+ - compute_cost > 0 reduces Ξ” proportionally
51
+ - calibration_error > threshold reduces Ξ”
52
+ - confident_wrong action (high confidence + oracle_score < 0) β‡’ Ξ” < 0 (penalty)
53
 
54
+ ### Spend Function
55
 
56
+ ```
57
+ spend(a, resource_type, capability_scope) β†’ {allow, deny, downgrade, escalate, require_approval}
58
 
59
+ allow if: credit[a,t] β‰₯ cost(resource_type, capability_scope)
60
+ AND a has capability_scope_policy[scope]
61
+ AND credit_decay_rate[a] ≀ max_decay
62
+ AND gaming_score[a] ≀ gaming_threshold
63
+ ```
64
 
65
+ ### Decay Schedule
66
 
67
  ```
68
+ decay(credit[t]) = credit[t] Β· Ξ΄
 
 
 
 
 
 
 
69
 
70
+ where:
71
+ Ξ΄ = 0.995 (per-turn decay, ~5% per 10 turns)
72
+ Or task-scoped: Ξ΄ = 1.0 until task completion, then Ξ΄ = 0.0 (credits expire)
 
 
 
 
 
 
73
  ```
74
 
75
+ ### Credit Caps
76
 
77
+ ```
78
+ credit[a,t] ≀ credit_cap(capability_scope)
 
79
 
80
+ credit_cap translates to maximum resource access:
81
+ Model calls: credit_cap / cost_per_call
82
+ Retrieval calls: credit_cap / cost_per_retrieval
83
+ Debate turns: credit_cap / cost_per_turn
84
+ ```
85
 
86
+ ### Oracle Scoring
 
87
 
88
+ ```
89
+ oracle_score = α₁ Β· correctness(a, t, outcome)
90
+ + Ξ±β‚‚ Β· evidence_support(a, t, evidence)
91
+ + α₃ Β· improvement_over_prior(a, t, prior_state)
92
+ + Ξ±β‚„ Β· calibration(a, t, prediction, outcome)
93
+ + Ξ±β‚… Β· abstention_utility(a, t, decision_to_abstain)
94
+ βˆ’ β₁ Β· hallucination(a, t, evidence)
95
+ βˆ’ Ξ²β‚‚ Β· confident_wrong(a, t, prediction, outcome, confidence)
96
+ βˆ’ β₃ Β· wasteful_compute(a, t, compute_used, value_produced)
97
+ βˆ’ Ξ²β‚„ Β· gaming_suspicion(a, t, action_pattern)
98
+
99
+ where:
100
+ correctness: 1 if correct, 0 if incorrect, βˆ’1 if harmful
101
+ evidence_support: 1 if evidence fully supports, 0 if neutral, βˆ’1 if contradicts
102
+ improvement: + if better than prior, 0 if same, βˆ’ if worse
103
+ calibration: + if well-calibrated, βˆ’ if overconfident
104
+ abstention_utility: + if abstaining was correct, βˆ’ if it was evasive but answerable
105
+ hallucination: βˆ’ if generated claim contradicts evidence
106
+ confident_wrong: βˆ’ if high confidence AND incorrect (larger penalty than regular wrong)
107
+ wasteful_compute: βˆ’ if compute used ≫ value produced
108
+ gaming_suspicion: βˆ’ if action pattern matches known gaming signatures
109
+
110
+ Default weights (tunable):
111
+ Ξ± = [0.30, 0.15, 0.10, 0.10, 0.15]
112
+ Ξ² = [0.20, 0.25, 0.15, 0.20]
113
+ ```
114
 
115
+ ### Reward Function (for RL/GRPO)
 
 
116
 
117
+ ```
118
+ reward(a, action, context, outcome) =
119
+ oracle_score(a, action, context, outcome)
120
+ + abstention_utility
121
+ + calibration_bonus
122
+ βˆ’ hallucination_penalty
123
+ βˆ’ confident_wrong_penalty
124
+ βˆ’ compute_cost Β· cost_multiplier
125
+ βˆ’ gaming_penalty(a, history)
126
+
127
+ Constrained to [βˆ’1, 1].
128
+ ```
129
 
130
+ ---
131
+
132
+ ## System Invariants
133
+
134
+ 1. **Non-transferability**: βˆ€a,b ∈ A, aβ‰ b: Ξ”credit[b] from a's action = 0
135
+ 2. **Positive decay**: βˆ€a: credit[a, t+1] ≀ credit[a, t] unless earned
136
+ 3. **Capability scoping**: access(r) requires scope_policy[r] AND credit β‰₯ cost(r)
137
+ 4. **External verification**: oracle_score depends only on oracle O, not on a
138
+ 5. **Append-only ledger**: credit events are immutable once recorded
139
+ 6. **Oracle separation**: spending agent cannot directly influence oracle O
140
+ 7. **Negative contribution**: oracle_score < 0 β‡’ Ξ” ≀ 0
141
+ 8. **Credit β‰  identity trust**: high credit does not imply trusted access to all resources
142
+ 9. **Reversal possible**: credit can be retroactively reduced on new evidence
143
+ 10. **Bounded credit**: credit[a,t] ≀ credit_cap(scope) always
144
+
145
+ ---
146
+
147
+ ## Ledger Event Schema
148
+
149
+ Every credit mutation produces an immutable event:
150
+
151
+ | Event | Fields |
152
+ |-------|--------|
153
+ | CREDIT_GRANTED | agent_id, amount, reason, oracle_score, task_id, timestamp |
154
+ | CREDIT_DECAYED | agent_id, amount_decayed, new_balance, timestamp |
155
+ | CREDIT_SPENT | agent_id, amount, resource_type, capability_scope, task_id, timestamp |
156
+ | TURN_DENIED | agent_id, reason (insufficient_credit/wrong_scope/gaming_threshold), timestamp |
157
+ | ORACLE_SCORE_RECORDED | agent_id, action_id, score, confidence, evidence_ref, timestamp |
158
+ | CAPABILITY_SCOPE_CHANGED | agent_id, old_scope, new_scope, reason, timestamp |
159
+ | AGENT_PENALIZED | agent_id, penalty_amount, reason, evidence, timestamp |
160
+ | VERIFICATION_REVERSED | original_event_hash, new_score, reason, timestamp |
161
+ | POOL_EXHAUSTED | task_id, remaining_credit, timestamp |
162
+ | POLICY_UPDATED | parameter_changes, reason, timestamp |
163
+
164
+ Each event includes:
165
+ - event_hash: SHA-256 of (previous_event_hash + event_data)
166
+ - parent_event_hash: chain to previous event
167
+ - agent_id
168
+ - task_id
169
+ - timestamp (UTC ISO 8601)
170
+ - capability_scope
171
+ - oracle_id
172
+ - score (if applicable)
173
+ - credit_delta
174
+ - reason (human-readable)
175
+ - evidence_pointer (URI or hash to evidence)
176
+
177
+ ---
178
+
179
+ ## Resource Broker Decision Model
180
+
181
+ For each request (agent a, resource r, scope c):
182
 
183
+ ```
184
+ function decide(a, r, c):
185
+ if not has_scope(a, c):
186
+ return DENY(reason="missing capability scope")
187
+
188
+ if credit[a] < cost(r, c):
189
+ if credit[a] >= cost(downgraded(r), c):
190
+ return DOWNGRADE(alternative=downgraded(r), reason="insufficient credit for requested tier")
191
+ return DENY(reason="insufficient credit")
192
+
193
+ if gaming_score[a] > GAMING_THRESHOLD:
194
+ return REQUIRE_APPROVAL(reason="gaming suspicion")
195
+
196
+ if risk(r, a, c) > RISK_THRESHOLD:
197
+ return REQUIRE_APPROVAL(reason="high-risk action")
198
+
199
+ if credit[a] < cost(r, c) * 2: # running low
200
+ return ALLOW_WITH_WARNING(reason="low credit warning")
201
+
202
+ return ALLOW
203
+ ```
204
 
205
+ ### Resource Types and Costs
206
+
207
+ | Resource | Base Cost | Capability Scope |
208
+ |----------|-----------|-----------------|
209
+ | model_call_small | 1 | basic_inference |
210
+ | model_call_large | 5 | premium_inference |
211
+ | retrieval_call | 2 | retrieval |
212
+ | verifier_call | 3 | verification |
213
+ | debate_turn | 3 | deliberation |
214
+ | file_write | 5 | tool_execution |
215
+ | shell_exec | 8 | tool_execution |
216
+ | memory_write | 2 | memory |
217
+ | human_escalation | 20 | escalation |
218
+
219
+ ---
220
+
221
+ ## When To Use OCC
222
+
223
+ | OCC is valuable when | OCC is overkill when |
224
+ |---------------------|---------------------|
225
+ | Agents have heterogeneous reliability | Single-agent tasks suffice |
226
+ | Long-running tasks need budget discipline | Ground truth is immediate and cheap |
227
+ | Debate/collaboration can be poisoned | Adversarial participation is impossible |
228
+ | Compute is expensive | All agents have equal trust and capability |
229
+ | Auditability matters | Task budget is tiny (a few calls) |
230
+ | Agents can earn durable authority | Latency matters more than robustness |
231
+ | Post-hoc accountability required | Verifier/oracle cost exceeds saved compute |
232
+ | Agents can game naive allocation | There are no bad actors in the system |
233
+
234
+ ---
235
+
236
+ ## Threat Model
237
+
238
+ | Attack | What Adversary Controls | Success Condition | OCC Defense | Residual Risk |
239
+ |--------|------------------------|-------------------|-------------|---------------|
240
+ | Credit farming | Task selection | Accumulates budget via easy tasks | Decay + credit caps | Slow gaming over many cheap tasks |
241
+ | Collusion | Multiple agent identities | Transfers influence between agents | Non-transferability | Vote-ring behavior (same answer) |
242
+ | Oracle spoofing | Persuasive but wrong answers | Earns false credit | Verifier separation from spender | Judge hacking via prompt injection |
243
+ | Griefing | Burns others' budget | Lowers group accuracy | Capability-scoped spend | Indirect poisoning via bad data |
244
+ | Sandbagging | Hides competence | Manipulates future allocation | Decay + exploration bonus | Hard to detect without history |
245
+ | Identity laundering | Resets agent identity | Escapes penalties | Identity binding to account | Account churn (rate-limited) |
246
+ | Sybil agents | Many weak agents | Captures compute pool | Admission control | Deployment-specific new-account policy |
247
+ | Strategic abstention | Avoids penalties | Hoards credit | Reward shaping for participation | Conservatism bias |
248
+ | Verbosity gaming | Produces long but vacuous responses | Appears high-quality to weak oracle | Token-cost multiplier in reward | Requires quality oracle |
249
+ | Confidence manipulation | Overstates certainty | Earns calibration bonus deceptively | Proper scoring rules | Hard to calibrate perfectly |
250
+
251
+ ---
252
+
253
+ ## Relationship to Prior Work
254
+
255
+ OCC builds on:
256
+ - **AI safety debate** (Irving, Christiano, Amodei 2018): Debate as a mechanism for surfacing truth. OCC adds: debate turns are not free speech β€” they are auditable compute privileges.
257
+ - **GRPO/RLVR** (Shazeer et al. 2024): Group-relative policy optimization. OCC provides the reward function that makes GRPO converge to allocation policies.
258
+ - **Proper scoring rules**: OCC's calibration and abstention rewards are proper scoring rule implementations.
259
+ - **Capability-based security**: OCC's broker follows OS capability-system principles applied to agent API access.
260
+
261
+ OCC departs from:
262
+ - **Budget-aware reasoning** (e.g., token-budget RL): OCC is not about *minimizing* compute β€” it's about *governing* compute access.
263
+ - **Adaptive inference** (early exit, cascade): OCC governs *who* gets compute, not *when* to stop computing.
264
+ - **Multi-agent debate for accuracy**: OCC does not claim debate improves accuracy. It claims debate *without allocation control* amplifies adversarial influence.
265
+
266
+ ---
267
+
268
+ ## Implementation Reference
269
+
270
+ Python package at: https://huggingface.co/narcolepticchicken/occ-stack
271
 
272
+ ```
273
+ /occ
274
+ /oracle β†’ oracle.py (Impact Oracle: scoring, marginal impact, proper scoring)
275
+ /ledger β†’ ledger.py (Credit Ledger: non-transferable, decaying, scoped credits)
276
+ /broker β†’ broker.py (Resource Broker: capability-based access control)
277
+ /rl β†’ reward.py (Reward function combining oracle + anti-gaming)
278
+ β†’ grpo_hook.py (TRL GRPOTrainer integration)
279
+ /benchmarks β†’ benchmark_debate.py, benchmark_code.py, benchmark_retrieval_qa.py
280
+ /configs β†’ YAML configurations for experiments
281
+ /reports β†’ results, analysis, final report
282
+ ```
283
 
284
+ ---
285
 
286
+ *Last updated: May 8, 2026. Version: 1.0.*