How We Cut AI Compute Costs by 95%

The fastest way to save money on AI compute is to not spend it in the first place.

That sounds glib, but it is the actual lesson of the last six months of building. Every dollar we save is a dollar we never sent to an API. Every answer we produce in 25 milliseconds is an answer we never paid a cloud provider 2,500 milliseconds to compute. The architecture has to earn its cost savings, not pray for a price cut from a vendor we do not control.

Here is the number that matters: a full eight-agent intelligence analysis — the kind that used to cost us between eight and twelve dollars in API spend and take four minutes — now costs us under sixty cents and finishes in under thirty seconds. That is a 95 percent reduction in cost and an order of magnitude improvement in latency, produced by changing how the pipeline is structured, not by negotiating with any vendor.

This is how we got there.

The Realization: Most "Reasoning" Is Actually Lookup

When we instrumented our early agent pipelines and looked at what the language model was actually doing on each turn, an uncomfortable pattern emerged. Most of the time the model was not reasoning at all. It was retrieving.

"What does this term mean?" "What category does this fall into?" "Is this related to that?" "What comes next in this sequence?" These are not reasoning questions. They are dictionary questions dressed up in a language model wrapper. And we were paying full API rates for every single one.

Most "AI reasoning" is really structured lookup. Graphs handle it in milliseconds for zero cost.

The first architectural move was to stop paying for lookup. We built a knowledge graph layer — actually, we built a library of them, one per domain — and pushed every retrieval-style question down into graph traversal. When the pipeline needs to know that reentrancy is a subtype of smart-contract vulnerability, or that a particular failure mode causes another, it does not ask an LLM. It walks edges in a local file.

That is not a small optimization. On our intelligence pipelines, pushing lookup down to the graph reduced API calls by roughly two thirds. It also changed the latency profile dramatically: graph traversal finishes in tens of milliseconds, while a single API round-trip takes two to three seconds. Stacking those savings across a multi-agent analysis moves the total wall-clock time from minutes to seconds.

The Pipeline: Graph First, Model Second

With lookup pushed out of the model, the pipeline inverted. The default is graph traversal. The language model is only called when the graph genuinely does not have the answer — and specifically, when the problem requires judgment rather than retrieval.

In practice this looks like a two-layer pipeline. The first layer is a set of deterministic agents that do nothing but walk the graph. Each one specializes in a specific kind of question: containment, ordering, identity, counting, transitive reasoning, cross-domain analogy. If the question can be answered by walking edges, one of these agents produces the answer in under 25 milliseconds. No tokens consumed, no cost incurred, no hallucination possible.

We had to stop treating the language model as the reasoner and start treating it as the referee. It only gets called when the structural agents cannot agree or cannot reach.

The second layer is a small set of guided-reasoning agents that each wrap a local language model. These handle the tasks that structural traversal genuinely cannot: evaluating whether a plan's assumptions will hold, proposing remediation when a failure mode has no pre-encoded fix, synthesizing a narrative from heterogeneous findings. They cost more per invocation than the graph agents — we are back to real tokens now — but they only fire on the small subset of questions that need them.

The critical engineering detail is that the second layer runs entirely on local hardware. When we say "the model gets called," we do not mean an API shot off to a cloud provider. We mean a 3-billion-parameter model sitting in VRAM on the same laptop the graph agents are running on. Local inference at that size is fast enough to feel instant and costs literally zero dollars per query.

Deterministic First, Probabilistic Second

There is a second reason to run the pipeline this way, and it is not about money. It is about trust.

Most of the time, when a cloud language model gives a wrong answer, you never find out. The answer looks plausible, it sounds confident, and there is no way to audit how it got there. You get a black box that occasionally produces bad outputs, and the failure mode is silent.

Deterministic traversal is fully auditable. Every conclusion traces back to specific edges in a specific graph.

Graph traversal is the opposite. When one of our structural agents returns a finding, we can show exactly which nodes it visited, in what order, under which rules. The audit trail is not a separate logging system. The audit trail is the computation. If the finding is wrong, you can look at the edges it walked and see exactly where the error was — either the graph had a bad edge, or the rule was wrong, or the question was framed incorrectly. It is a fixable kind of wrong.

That matters for every customer we work with, but it matters most for defense and regulated industries. A finding that cannot be explained is a finding that cannot be used. We decided early that explainability would be a property of the architecture, not a feature bolted on top of it.

The Cost Breakdown, Specifically

Here is the actual cost picture on our current flagship pipeline, a six-agent deep research workflow that produces the kind of deliverable a senior analyst would otherwise spend a half day on.

Old pipeline (cloud API, six sequential model calls, no graph layer): $8–12 per run, 3–5 minute latency, no audit trail, cannot run offline.

Current pipeline (graph layer plus three local guided agents plus one cloud call for final synthesis): $0.30–$0.75 per run, 20–40 second latency, full audit trail, runs offline.

The remaining cloud call — the one we still pay for — is the final synthesis step that stitches six specialists' outputs into a single deliverable. We kept it on a cloud model because synthesis is where judgment matters most and we wanted the largest available reasoning model in the loop for that specific job. Everything else runs locally.

If we wanted zero cloud spend, we could drop that synthesis call too and use a local model for it. The output would be a little less polished, but we have already validated that the pipeline works at $0 per run when we need it to. That is the option we pull out for classified or air-gapped environments where cloud access is simply not available.

Why This Does Not Break

The usual objection to hybrid pipelines is that they are brittle. You build a graph, you curate a set of rules, and then reality hits — a new domain, a new threat, a new terminology — and your carefully assembled knowledge base no longer covers the question. The language model approach handles the novelty gracefully; the structured approach falls off a cliff.

Our answer is a routing layer that knows the difference. Every incoming task is classified against a similarity threshold: does the knowledge graph have coverage, or does it not? When the graph has coverage, we use the fast path. When it does not, the router falls through to the guided reasoning layer, and on genuinely hard questions it falls through again to a more expensive deliberation process. The pipeline never gets stuck; it just gets more expensive as the problem gets harder.

Dynamic routing means the fast path handles 80 percent of traffic for pennies; the expensive path handles the 20 percent that actually needs it.

The cost structure reflects that. Easy questions cost nothing. Medium questions cost fractions of a cent. Hard questions that escalate to deliberation cost somewhere between fifty cents and a dollar fifty. Across a realistic mixture of incoming work, the weighted average comes in under a dollar per complex analysis — and the median is effectively zero because most questions never leave the graph layer.

The Part Nobody Talks About

Building this architecture took more engineering work than plugging into a cloud API. That is the honest tradeoff. Wrapping a language model in a pretty UI is a weekend. Building a routing layer, twenty-plus specialist agents, a graph curation pipeline, a consolidation cycle that turns past results into reusable knowledge, and a metacognitive controller that monitors and falls back gracefully — that is months of work before you have anything to show a customer.

But the tradeoff runs in our favor as soon as the system is deployed. We are not paying marginal cost per query. Our customers are not paying marginal cost per query. The system gets faster as the graph grows, not slower. The economics flip from "linear cost in usage" to "fixed cost in engineering," and the payback period on the engineering is measured in months, not years.

The right question is not "how cheap can your cloud bill get." It is "what would you build if cloud cost was zero and latency was free." Because that is what a graph-first local pipeline actually delivers.

Where This Goes Next

We are pushing the cost number down further. The current target is under ten cents per complex analysis, achieved by moving the synthesis step fully onto a local model we have been fine-tuning on our own previous outputs. We are also building a consolidation loop that mines our historical analyses for reusable patterns and folds them back into the graph layer, which means every hard question that escalates makes future hard questions cheaper.

Our bet is that two years from now, a full intelligence pipeline will cost effectively nothing per run, finish in single-digit seconds, run entirely offline, and produce audit trails detailed enough to pass a compliance review without a human in the loop. The architecture that gets you there is not a bigger cloud bill. It is a different shape.

If you are wrestling with AI compute costs on a real program — not a demo, not a proof of concept, but something that has to scale — we are happy to walk through how this works on your problem specifically.

Curious how this would work on your pipeline?

We walk program managers, contracting officers, and technical evaluators through the architecture and the cost model on real workloads.

Start a Conversation