LLMSecOps - The art of securely exposing LLMs to the world

Learn how to secure self-hosted LLM APIs with auth, budget controls, bot protection, and output safety.

LLMSecOps - The art of securely exposing LLMs to the world

Most APIs can only break at specific stages. LLM-based APIs? They can break at every stage. Imagine a user typing 6 words - "Write a 5000-word essay on cats". Without the right safeguards, this simple query could drain your usage for the week.

That's the math behind every LLM endpoint you put on the public internet. The normal API playbook doesn't account for it.

This article is based on my work and a talk I gave at DevOps Leeds. You can read my slides here.

What the actual problems are

A normal API has a fixed contract. You know what goes in, what comes out, and how to size your infrastructure. An LLM endpoint breaks that contract at every level.

The inputs are completely open because there's no boundary between "instruction" and "data" in a prompt. The input the user provides to the API is the code that the LLM runs itself. A user pastes in something from the internet and your model treats every word in it the same way as your system prompt, because it genuinely can't tell the difference. There's no schema to validate against and no type system to enforce, so whatever the user sends is what your model runs.

Because the inputs are open, the outputs are unpredictable. The model decides what to say, and you're liable for whatever it produces, whether that's personal information, hallucinations, training-data regurgitation, or anything offensive enough to get screenshotted onto Twitter. You don't get to pre-approve any of it.

And because the outputs are unpredictable, the costs are too. Most AI products bill in characters or requests but spend in GPU-seconds or output tokens. Take our example of "Write a 5000-word essay on cats" - with character-based pricing, the user pays almost nothing while you pay everything. A five-character prompt can pin a GPU at full load for several seconds and produce 16,000 tokens of output. Rate limits don't help here because they bound the wrong unit.

The rest of this post is about what you can actually do about each of these issues.

The surface area is wider than you think

When you self-host, the attack surface is the whole stack:

  • The public API (REST or gRPC endpoint) is what your users hit
  • The application layer handles request processing, retrieval, tool calls, and agent loops
  • The gateway sits between your app and the model, and this is the chokepoint
  • The inference server (vLLM, TGI, Triton, llama.cpp) has its own attack surface and CVE history
  • The model weights themselves are a supply-chain concern - where did they come from and who signed them?
  • The GPU host at the bottom of the stack may be shared with other tenants or other jobs

The pattern that matters here is funnelling everything model-bound through a single chokepoint service, an AI gateway. One place to apply policy, one place to swap models, one place to instrument. The gateway is the chokepoint because it's the only layer every model-bound request converges on regardless of which feature triggered it. Apply a control there once and it covers the entire surface area - rate limits, bot protection, redaction, model swaps, instrumentation.

The alternative is scattering the same controls across every product API. That works fine until your fifth team ships an endpoint that forgot the new bot-check, and now you have an open vulnerability.

Stack your auth

The typical API safeguard is a single API key check at the front door, and for most APIs that's fine. For an LLM endpoint it's nowhere near enough because a single stolen credential gives the attacker full access to your model.

Think of it as Zero Trust applied to the model boundary. Every request gets asked a stack of independent questions before it touches inference:

  • Who are you? Authenticated identity - an API key, a JWT, a session token
  • Is your account allowed to use this feature at all? Account-level restriction, blocklist checks
  • Are you on the right plan or tier? Entitlement check - free plans don't get the big model
  • Does this specific credential have the scope for this endpoint? Fine-grained access right - a leaked translate-only key shouldn't be able to call generate

In code, the pattern looks like this, with each check catching a different failure:

async function authorizeModelRequest(req: Request) {
  const identity = await authenticateApiKey(req);       // who are you?
  assertAccountNotRestricted(identity.accountId);       // allowed to use this feature?
  assertTierMatchesDeployment(identity.tier, req.model); // on the right plan?
  assertHasApiScope(identity.scopes, req.endpoint);     // credential has scope?
}

For self-hosted, Zero Trust extends inward. The inference server itself should never be on a public IP, the same way you'd never expose a database directly. The gateway becomes the trust boundary.

Budget what actually costs money

Per-request rate limits don't help because one request can be cheap and the next can burn your entire weekly budget. You have to budget on the unit that maps to your actual spend.

Two things to get right:

  • Count what actually costs money. For self-hosted that's GPU-seconds or output tokens, for managed it's per-provider tokens. The unit you defend needs to be the unit that YOU pay for.
  • Check the budget of the request before the model runs. An overused quota should cost you a 403, not an inference call. Reject in the gateway before paying compute.

The pre-flight check:

const remaining = await user.getRemainingBudget();
if (remaining !== null && !usageLimiter.isWithinLimit(remaining)) {
  return new Response("Quota exceeded", { status: 403 });
}
// only now does the inference server get called

Always cap the maximum number of output tokens on every call since a request with no max-tokens limit is a denial-of-service vector.

For self-hosted, you're preventing GPU bottlenecks, where one tenant's long-running generation blocks every other tenant. Per-tenant token budgets plus hard caps on max-tokens are the difference between a fair system and one that grinds to a halt under one bad actor.

Bot protection and what you log

Public LLM endpoints are magnets for credential-stuffed access, leaked-key resale, and free-tier abuse. Just think about the recent Chipotle AI access hack. For an API surface (no browser, no JS), "bot protection" doesn't mean CAPTCHAs - there's no browser to render one. It means per-key anomaly scoring at the gateway:

  • Is the rate within the key's typical bounds, or a sudden burst?
  • Is the geography consistent with the key's previous patterns, or showing up in a new region?
  • Does the request pattern look human-driven, or are the output tokens maxed out on every call?
  • Has the key been seen in a public leak corpus (GitHub, npm, Docker images)?

If any signal looks anomalous, revoke the key, notify the owner, and return 403.

OTOH, logging policy is a real fork in the road. You have two options and neither is obviously correct:

  • Don't log prompts - GDPR-friendly, zero leakage risk, but you can't audit an incident
  • Log prompts with redaction - auditable, but you've created a new sensitive data store you now have to protect

What matters is that the choice should be deliberate. If you redact, log enough metadata (latency, output-token counts, error codes, request IDs, GPU host, tenant ID) that you can still investigate.

Own the safety stack

If you use a managed provider, you inherit their content filter. You catch the content filter errors and fail the request. But, this defence you get for free.

If you self-host, there's no vendor filter and you own the entire output validation surface. Three patterns to consider with their honest trade-offs:

  • Rule-based output filters (regex for credit cards, PII patterns, blocklists) are cheap and fast but brittle. They catch the obvious and miss anything creative
  • A classifier model as judge where a small second model scores outputs for safety, hallucination, or policy violation is more accurate but adds 100-500ms of latency and compute cost. Now you have two models to monitor
  • Structured-output enforcement (JSON schemas, grammar-constrained decoding) eliminates entire classes of "the model went off-script" failures but requires an API redesign because your endpoint can no longer return free-form text

Open-source guardrail libraries like Llama Guard, NeMo Guardrails, and Guardrails AI exist but they're not drop-in. Treat them as a starting point to build your own safety stack.

What's still hard

Everything above has a concrete answer for something you can ship this quarter. However, there are still a lot of unknowns in the field of LLMSecops.

  1. Model and Weight Integrity: Most teams pull model weights from Hugging Face in their CI pipeline and don't verify signatures because there usually aren't any.
  2. Privilege-escalation on Inference Servers: Inference server (vLLM, TGI, Triton, whatever you're running) generally have root-level access to your GPUs.
  3. Multi-tenancy bottlenecks: GPU multi-tenancy is a real side-channel surface that doesn't have a clean answer.
  4. Prompt Injection: Prompt injection is still mostly unsolved.

Every team I've talked to is working through these issues in real time. The best teams are honest about the problems they've solved and are looking to solve for the future. This is what helps your on-call team when they receive an alert at 3AM.

What to take away

  • Stack your auth and keep the inference server private. Each layer should answer a different question so a single breach doesn't open everything.
  • Budget GPU-seconds or output tokens, not requests per second. Check the budget before the model runs.
  • Own the safety stack. No vendor catches your mistakes. Know what you've built, what you've deferred, and what you're not building yet.

If you're implementing any of this in your own stack or just want to talk shop, reach out on LinkedIn. I'd love to hear what's working for you and what isn't.