narcolepticchicken commited on
Commit
745d481
Β·
verified Β·
1 Parent(s): b4d00e5

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +150 -29
README.md CHANGED
@@ -1,48 +1,169 @@
1
- # Oracle-Credit-Compute (OCC) Stack
2
 
3
- A minimal, open-source system for **agentic compute allocation** via verified marginal impact.
4
 
5
- ## Core Thesis
6
 
7
- Modern agent systems waste test-time compute because every agent, tool call, debate turn, or verifier pass consumes resources without proving marginal value. OCC allocates compute, retrieval, write privileges, and debate bandwidth toward actions that measurably improve task outcomes.
8
 
9
- ## Components
10
 
11
- | Component | Purpose |
12
- |-----------|---------|
13
- | `oracle/` | Impact Oracle β€” scores whether an action produced measurable marginal value |
14
- | `ledger/` | Credit Ledger β€” non-transferable, decaying credits based on verified impact |
15
- | `broker/` | Resource Broker β€” capability-based rights based on credits, task state, and risk |
16
- | `rl/` | GRPO-compatible reward hook using the Oracle as reward |
17
- | `benchmarks/` | Tight, verifiable benchmarks: code, retrieval QA, multi-agent debate |
18
- | `configs/` | Experiment configurations |
19
- | `reports/` | Results, ablations, anti-gaming tests |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ## Quick Start
22
 
23
  ```bash
24
- pip install -r requirements.txt
25
- python -m benchmarks.benchmark_code # Code compute allocation
26
- python -m benchmarks.benchmark_retrieval_qa # Retrieval QA
27
- python -m benchmarks.benchmark_debate # Multi-agent debate
28
- python -m eval_runner # Run all ablations
 
 
 
 
 
 
 
 
 
 
 
29
  ```
30
 
31
- ## Design
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- See [design.md](design.md) for architecture, reward formulas, and anti-gaming mechanisms.
34
 
35
- ## Literature Review
 
 
 
 
36
 
37
- See [reports/literature_review.md](reports/literature_review.md) for prior work analysis.
38
 
39
- ## Results Summary
40
 
41
- - **Code compute allocation**: OCC achieves **66.8% compute reduction** at iso- or higher accuracy versus fixed-budget baseline.
42
- - **Retrieval QA**: OCC shows lower confident-wrong rates and smart retrieval stopping.
43
- - **Multi-agent debate**: OCC matches equal-turns accuracy with 12.4% less compute.
44
- - **Anti-gaming**: Spam, hidden-test gaming, and over-abstention are all contained.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ## License
47
 
48
- MIT
 
1
+ # OCC: Oracle-Credit-Compute System
2
 
3
+ **A minimal open-source stack for cost-aware, compute-efficient agent systems.**
4
 
5
+ ## What is OCC?
6
 
7
+ Modern agent systems waste test-time compute because every tool call, retrieval, debate turn, or verification pass consumes resources without proving marginal value. OCC treats compute as a **budgeted, non-transferable resource** that agents must earn through verified impact.
8
 
9
+ ## Core Architecture
10
 
11
+ ```
12
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
13
+ β”‚ Impact Oracle │────▢│ Credit Ledger │────▢│ Resource Broker β”‚
14
+ β”‚ (score action) β”‚ β”‚ (earn/spend) β”‚ β”‚ (allow/deny) β”‚
15
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
16
+ β”‚ β”‚
17
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
18
+ β–Ό
19
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
20
+ β”‚ GRPO/RL Hookβ”‚
21
+ β”‚ (reward func) β”‚
22
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
23
+ ```
24
+
25
+ ### 1. Impact Oracle (`oracle/`)
26
+
27
+ Rule-based scoring for:
28
+ - **Code tasks**: unit tests, pass@k, regression detection, hidden-test gaming
29
+ - **Retrieval QA**: answer correctness, evidence NLI (entailment/contradiction), abstention utility, calibration bonus (Brier score)
30
+ - **Multi-agent debate**: decision quality, marginal contribution, influence efficiency
31
+
32
+ All scores are cost-adjusted: `reward = verified_impact - compute_cost * penalty_rate`
33
+
34
+ ### 2. Credit Ledger (`ledger/`)
35
+
36
+ - **Non-transferable** credits (laundering prevention)
37
+ - **Exponential decay** on idle credits (hoarding prevention)
38
+ - **Capability-scoped** rights (retrieval credits β‰  file-write credits)
39
+ - **Full provenance** with oracle hash and reason
40
+
41
+ ### 3. Resource Broker (`broker/`)
42
+
43
+ Capability-based access control:
44
+ - Low risk: `retrieval_call`, `debate_turn`
45
+ - Medium risk: `model_call`, `verifier_call`, `memory_write`
46
+ - High risk: `file_write`, `shell_execute`, `human_escalation`
47
+
48
+ Decisions: `allow`, `deny`, `require_approval`, `downgrade`, `escalate`, `ask_justification`
49
+
50
+ ### 4. GRPO/RL Hook (`rl/`)
51
+
52
+ TRL-compatible reward function wrapping the Impact Oracle. Includes offline policy comparator for ablation studies without GPU training.
53
+
54
+ ## Installation
55
+
56
+ ```bash
57
+ pip install -e .
58
+ # For NLI evidence scoring:
59
+ pip install sentence-transformers
60
+ # For real LLM inference:
61
+ pip install transformers datasets
62
+ # For GRPO training:
63
+ pip install trl accelerate
64
+ ```
65
 
66
  ## Quick Start
67
 
68
  ```bash
69
+ # Run all benchmarks and ablations
70
+ python -m benchmarks.eval_runner
71
+
72
+ # Run individual benchmarks
73
+ python -m benchmarks.benchmark_code
74
+ python -m benchmarks.benchmark_retrieval_qa
75
+ python -m benchmarks.benchmark_debate
76
+
77
+ # Run with real NLI model (requires sentence-transformers)
78
+ python -m benchmarks.benchmark_retrieval_qa_nli
79
+
80
+ # Adversarial debate benchmark
81
+ python -m benchmarks.benchmark_debate_adversarial
82
+
83
+ # GRPO offline demonstrator
84
+ python -m rl.grpo_train_demo
85
  ```
86
 
87
+ ## Benchmark Results
88
+
89
+ ### Code Compute Allocation (Simulated)
90
+
91
+ | Strategy | pass@1 | Compute | Savings |
92
+ |----------|--------|---------|---------|
93
+ | Fixed (expensive agent) | 0.780 | 17,500 | β€” |
94
+ | Verifier-guided retries | 0.980 | 26,600 | -52% |
95
+ | **OCC tiered escalation** | **0.780** | **8,350** | **52.3%** |
96
+
97
+ OCC tries cheap agents first, escalates only on failure. At iso-accuracy (0.780 pass@1), it reduces compute by 52%.
98
+
99
+ ### Code Compute Allocation (Real LLM - Qwen2.5-Coder-0.5B)
100
+
101
+ GPU job running on T4. Script: `jobs/run_real_llm_standalone.py`
102
+
103
+ ### Retrieval QA (with real NLI - cross-encoder/nli-deberta-v3-xsmall)
104
+
105
+ | Strategy | Accuracy | ECE | Retrievals |
106
+ |----------|----------|-----|------------|
107
+ | Direct answer | 0.580 | 0.226 | 0 |
108
+ | RAG baseline | 0.750 | 0.167 | 338 |
109
+ | RAG + verifier | 0.790 | 0.151 | 344 |
110
+ | OCC baseline | 0.710 | 0.201 | 227 |
111
+ | **OCC + real NLI** | *needs calibration* | β€” | 220 |
112
+
113
+ Note: OCC + NLI shows stronger evidence quality but broker thresholds are too conservative on neutral evidence. Needs tuning for production use.
114
+
115
+ ### Multi-Agent Debate
116
 
117
+ With 50% adversarial agents:
118
 
119
+ | Strategy | Accuracy | Quality/Compute |
120
+ |----------|----------|-----------------|
121
+ | Equal turns | 0.760 | 0.001275 |
122
+ | Confidence-weighted | **0.560** | 0.000924 |
123
+ | **OCC credit allocation** | **0.760** | **0.001196** |
124
 
125
+ OCC contains adversarial agents while confidence-weighted voting collapses (bad agents exploit high confidence).
126
 
127
+ ### Anti-Gaming
128
 
129
+ | Attack | Detection | Containment |
130
+ |--------|-----------|-------------|
131
+ | Spam low-value actions | 100% credit exhaustion | Credits = 0 |
132
+ | Hidden-test gaming | 100% oracle detection | Immediate penalty |
133
+ | Over-abstention | 70% oracle penalization | Wrong abstentions punished |
134
+
135
+ ## Project Structure
136
+
137
+ ```
138
+ /occ
139
+ /oracle - Impact Oracle implementation
140
+ /ledger - Credit Ledger with decay and provenance
141
+ /broker - Capability-based Resource Broker
142
+ /rl - GRPO reward hooks and offline comparator
143
+ /benchmarks - Code, QA, and debate benchmarks
144
+ /jobs - GPU job scripts for real LLM inference
145
+ /reports - Evaluation results (JSON)
146
+ /configs - Configuration files
147
+ ```
148
+
149
+ ## Limitations & Next Steps
150
+
151
+ 1. **Retrieval QA** needs better NLI calibration. Real NLI scores are strong but broker thresholds are too aggressive on neutral evidence.
152
+ 2. **All benchmarks use simulated agents** for tractability. Real LLM inference script (`jobs/run_real_llm_standalone.py`) is submitted as a GPU job.
153
+ 3. **GRPO training** hook is implemented but not trained on real data. Offline comparator validates the reward design.
154
+ 4. **Cost model** is token-count only. Real cost should include model size, latency, and API pricing.
155
+
156
+ ## Citation
157
+
158
+ ```bibtex
159
+ @software{occ_stack,
160
+ title = {OCC: Oracle-Credit-Compute System for Agentic Compute Allocation},
161
+ author = {narcolepticchicken},
162
+ year = {2026},
163
+ url = {https://huggingface.co/narcolepticchicken/occ-stack}
164
+ }
165
+ ```
166
 
167
  ## License
168
 
169
+ Apache 2.0