AI in production — the boring infrastructure questions that nobody asks

There is an industry of demos that look great in a browser tab and never make it past the demo. The transition from “it works on my machine” to “it works at 3 AM under load with full audit trails” is where most AI projects quietly die.

The questions that nobody puts on the slide deck:

What happens at 100x load

A demo that takes 800ms per inference is fine for a single user. At 100 simultaneous users — which is a normal Tuesday morning for a Filipino retail brand — the same setup either rate-limits or charges 100x for premium throughput. We design for this from day one: caching where it earns its weight, queuing where it does not, and a graceful degradation path when the inference provider is slow.

Who gets paged when latency spikes

“Latency spikes” is engineer-speak for “your customers are watching a spinner.” The on-call rotation, the alerting thresholds, the runbook for “the inference provider is having a bad day” — these are not glamorous, but their absence is what produces 4 AM phone calls.

How do we cap monthly spend before it triples

Token-based pricing scales surprisingly. A bug in a prompt-construction routine can multiply your bill by ten without anyone noticing for a week. We build hard-stop budgets at the gateway level — if monthly spend hits an agreed ceiling, the system degrades to a deterministic fallback, the dashboard turns red, and humans get notified before any more money is spent.

What the handoff looks like

Production engagements end with the same artifacts every time: source code in your repo, deployment scripts you can run, runbooks the on-call can read at 3 AM, and 30 days of side-by-side support before we hand it over. We have never asked a client to keep us on retainer just to operate a system we built. If that is a deal-breaker for an agency, that is the wrong agency.