Most LLM cost work starts with model choice. That helps, but it does not solve the systems where every request carries a large bundle of retrieved docs, chat history, tool output, policies, and duplicated context.

A compression layer sits before OpenAI, Anthropic, DeepSeek, or a local model. It keeps the integration provider-neutral and returns a smaller prompt plus a receipt showing original tokens, output tokens, saved tokens, ratio, latency, and risk flags.

RetrieveAssemble contextCompressCall modelLog receipt

Where compression fits with other cost controls

Route

Send easy requests to cheaper models when quality allows.

Cache

Avoid recomputing identical or near-identical outputs.

Retrieve

Fetch fewer documents when recall is already high.

Compress

Reduce generous context before the expensive model call.

Best workloads

  • RAG search that retrieves broad chunks for safety.
  • AI agents that carry long tool traces between steps.
  • Support copilots with repeated tickets, policies, and account notes.
  • Prompt gateways that need cost controls without changing model providers.

Try it without an account

The public Rose 1 demo is intentionally capped, but it is enough to test one real RAG chunk, support ticket, or agent trace before creating a workspace.

curl -s https://api.adola.app/v1/demo/compress \
  -H 'content-type: application/json' \
  --data '{
    "model": "rose-1",
    "query": "What caused the issue and what should happen next?",
    "input": "Long retrieved context or agent trace...",
    "compression": { "target_ratio": 0.35 }
  }'