narcolepticchicken
/

occ-stack

ml-intern

Model card Files Files and versions

xet

Community

narcolepticchicken commited on 23 days ago

Commit

745d481

verified ·

1 Parent(s): b4d00e5

Upload README.md

Browse files

Files changed (1) hide show

README.md +150 -29

README.md CHANGED Viewed

@@ -1,48 +1,169 @@
-# Oracle-Credit-Compute (OCC) Stack
-A minimal, open-source system for **agentic compute allocation** via verified marginal impact.
-## Core Thesis
-Modern agent systems waste test-time compute because every agent, tool call, debate turn, or verifier pass consumes resources without proving marginal value. OCC allocates compute, retrieval, write privileges, and debate bandwidth toward actions that measurably improve task outcomes.
-## Components
-| Component | Purpose |
-|-----------|---------|
-| `oracle/` | Impact Oracle — scores whether an action produced measurable marginal value |
-| `ledger/` | Credit Ledger — non-transferable, decaying credits based on verified impact |
-| `broker/` | Resource Broker — capability-based rights based on credits, task state, and risk |
-| `rl/` | GRPO-compatible reward hook using the Oracle as reward |
-| `benchmarks/` | Tight, verifiable benchmarks: code, retrieval QA, multi-agent debate |
-| `configs/` | Experiment configurations |
-| `reports/` | Results, ablations, anti-gaming tests |
 ## Quick Start
 ```bash
-pip install -r requirements.txt
-python -m benchmarks.benchmark_code       # Code compute allocation
-python -m benchmarks.benchmark_retrieval_qa # Retrieval QA
-python -m benchmarks.benchmark_debate       # Multi-agent debate
-python -m eval_runner                       # Run all ablations
 ```
-## Design
-See [design.md](design.md) for architecture, reward formulas, and anti-gaming mechanisms.
-## Literature Review
-See [reports/literature_review.md](reports/literature_review.md) for prior work analysis.
-## Results Summary
-- **Code compute allocation**: OCC achieves **66.8% compute reduction** at iso- or higher accuracy versus fixed-budget baseline.
-- **Retrieval QA**: OCC shows lower confident-wrong rates and smart retrieval stopping.
-- **Multi-agent debate**: OCC matches equal-turns accuracy with 12.4% less compute.
-- **Anti-gaming**: Spam, hidden-test gaming, and over-abstention are all contained.
 ## License
-MIT

+# OCC: Oracle-Credit-Compute System
+**A minimal open-source stack for cost-aware, compute-efficient agent systems.**
+## What is OCC?
+Modern agent systems waste test-time compute because every tool call, retrieval, debate turn, or verification pass consumes resources without proving marginal value. OCC treats compute as a **budgeted, non-transferable resource** that agents must earn through verified impact.
+## Core Architecture
+```
+┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
+│  Impact Oracle  │────▶│  Credit Ledger  │────▶│ Resource Broker │
+│  (score action) │     │  (earn/spend)   │     │ (allow/deny)    │
+└─────────────────┘     └─────────────────┘     └─────────────────┘
+         │                                              │
+         └──────────────────┬───────────────────────────┘
+                            ▼
+                     ┌──────────────┐
+                     │  GRPO/RL Hook│
+                     │ (reward func) │
+                     └──────────────┘
+```
+### 1. Impact Oracle (`oracle/`)
+Rule-based scoring for:
+- **Code tasks**: unit tests, pass@k, regression detection, hidden-test gaming
+- **Retrieval QA**: answer correctness, evidence NLI (entailment/contradiction), abstention utility, calibration bonus (Brier score)
+- **Multi-agent debate**: decision quality, marginal contribution, influence efficiency
+All scores are cost-adjusted: `reward = verified_impact - compute_cost * penalty_rate`
+### 2. Credit Ledger (`ledger/`)
+- **Non-transferable** credits (laundering prevention)
+- **Exponential decay** on idle credits (hoarding prevention)
+- **Capability-scoped** rights (retrieval credits ≠ file-write credits)
+- **Full provenance** with oracle hash and reason
+### 3. Resource Broker (`broker/`)
+Capability-based access control:
+- Low risk: `retrieval_call`, `debate_turn`
+- Medium risk: `model_call`, `verifier_call`, `memory_write`
+- High risk: `file_write`, `shell_execute`, `human_escalation`
+Decisions: `allow`, `deny`, `require_approval`, `downgrade`, `escalate`, `ask_justification`
+### 4. GRPO/RL Hook (`rl/`)
+TRL-compatible reward function wrapping the Impact Oracle. Includes offline policy comparator for ablation studies without GPU training.
+## Installation
+```bash
+pip install -e .
+# For NLI evidence scoring:
+pip install sentence-transformers
+# For real LLM inference:
+pip install transformers datasets
+# For GRPO training:
+pip install trl accelerate
+```
 ## Quick Start
 ```bash
+# Run all benchmarks and ablations
+python -m benchmarks.eval_runner
+# Run individual benchmarks
+python -m benchmarks.benchmark_code
+python -m benchmarks.benchmark_retrieval_qa
+python -m benchmarks.benchmark_debate
+# Run with real NLI model (requires sentence-transformers)
+python -m benchmarks.benchmark_retrieval_qa_nli
+# Adversarial debate benchmark
+python -m benchmarks.benchmark_debate_adversarial
+# GRPO offline demonstrator
+python -m rl.grpo_train_demo
 ```
+## Benchmark Results
+### Code Compute Allocation (Simulated)
+| Strategy | pass@1 | Compute | Savings |
+|----------|--------|---------|---------|
+| Fixed (expensive agent) | 0.780 | 17,500 | — |
+| Verifier-guided retries | 0.980 | 26,600 | -52% |
+| **OCC tiered escalation** | **0.780** | **8,350** | **52.3%** |
+OCC tries cheap agents first, escalates only on failure. At iso-accuracy (0.780 pass@1), it reduces compute by 52%.
+### Code Compute Allocation (Real LLM - Qwen2.5-Coder-0.5B)
+GPU job running on T4. Script: `jobs/run_real_llm_standalone.py`
+### Retrieval QA (with real NLI - cross-encoder/nli-deberta-v3-xsmall)
+| Strategy | Accuracy | ECE | Retrievals |
+|----------|----------|-----|------------|
+| Direct answer | 0.580 | 0.226 | 0 |
+| RAG baseline | 0.750 | 0.167 | 338 |
+| RAG + verifier | 0.790 | 0.151 | 344 |
+| OCC baseline | 0.710 | 0.201 | 227 |
+| **OCC + real NLI** | *needs calibration* | — | 220 |
+Note: OCC + NLI shows stronger evidence quality but broker thresholds are too conservative on neutral evidence. Needs tuning for production use.
+### Multi-Agent Debate
+With 50% adversarial agents:
+| Strategy | Accuracy | Quality/Compute |
+|----------|----------|-----------------|
+| Equal turns | 0.760 | 0.001275 |
+| Confidence-weighted | **0.560** | 0.000924 |
+| **OCC credit allocation** | **0.760** | **0.001196** |
+OCC contains adversarial agents while confidence-weighted voting collapses (bad agents exploit high confidence).
+### Anti-Gaming
+| Attack | Detection | Containment |
+|--------|-----------|-------------|
+| Spam low-value actions | 100% credit exhaustion | Credits = 0 |
+| Hidden-test gaming | 100% oracle detection | Immediate penalty |
+| Over-abstention | 70% oracle penalization | Wrong abstentions punished |
+## Project Structure
+```
+/occ
+  /oracle        - Impact Oracle implementation
+  /ledger        - Credit Ledger with decay and provenance
+  /broker        - Capability-based Resource Broker
+  /rl            - GRPO reward hooks and offline comparator
+  /benchmarks    - Code, QA, and debate benchmarks
+  /jobs          - GPU job scripts for real LLM inference
+  /reports       - Evaluation results (JSON)
+  /configs       - Configuration files
+```
+## Limitations & Next Steps
+1. **Retrieval QA** needs better NLI calibration. Real NLI scores are strong but broker thresholds are too aggressive on neutral evidence.
+2. **All benchmarks use simulated agents** for tractability. Real LLM inference script (`jobs/run_real_llm_standalone.py`) is submitted as a GPU job.
+3. **GRPO training** hook is implemented but not trained on real data. Offline comparator validates the reward design.
+4. **Cost model** is token-count only. Real cost should include model size, latency, and API pricing.
+## Citation
+```bibtex
+@software{occ_stack,
+  title = {OCC: Oracle-Credit-Compute System for Agentic Compute Allocation},
+  author = {narcolepticchicken},
+  year = {2026},
+  url = {https://huggingface.co/narcolepticchicken/occ-stack}
+}
+```
 ## License
+Apache 2.0