Most production RAG and agent systems already have a retrieval or routing step. That step is good at choosing candidates, but it usually leaves repeated facts, stale tool output, boilerplate, and half-useful surrounding text in the final prompt.

The compression hop should sit after retrieval/reranking and before the expensive model call. It takes the query plus the context you were about to send, removes low-value text, and returns a smaller prompt with a receipt.

RetrieveRerankCompressCall modelMeasure

Why not just retrieve less?

You can, and often should. The problem is that retrieval is a blunt boundary. A chunk can contain one answer-bearing sentence and 800 tokens of wrapper text. Dropping the whole chunk loses the answer; keeping the whole chunk wastes money and attention.

Compression lets the retriever stay generous while the final prompt gets stricter. That is useful for support copilots, agent traces, policy lookups, long tickets, and anything with duplicated context.

What to measure

Token savings are not enough. A production compression layer should report the original token count, output token count, saved tokens, latency, compression ratio, and risk flags for every request. Without that receipt, it is hard to know whether a cheaper prompt is actually safe.

  • Run the full-context baseline first.
  • Compress only the context you would have sent anyway.
  • Track output quality, not only token reduction.
  • Keep receipts so failures can be audited by request.

Where Adola fits

Adola runs Rose 1 as this pre-model compression API. Send the query and retrieved context to Adola, then pass the returned text to OpenAI, Anthropic, DeepSeek, a local model, or your own model router.

const compressed = await adola.compress({
  query: "Why did checkout latency spike?",
  input: retrievedContext,
  compression: { target_ratio: 0.3 }
});

const answer = await model.responses.create({
  model: "your-model",
  input: compressed.output
});