RAG token reduction before the final LLM call

Retrieval should favor recall. The expensive generation call should favor useful context. Those are different jobs. Adola sits between your retriever or reranker and the downstream LLM, reducing the final context block before the model call.

RetrieveRerankCompressGenerateLog receipt

Where tokens usually leak

The best first test is a production-ish RAG prompt that already works but feels too expensive to run at scale.

Repeated snippets across retrieved chunks
Boilerplate headers and footers
Low-signal metadata fields
Prior turns copied into every answer
Long evidence blocks for short user questions
Tool traces that only need selected facts

How to test without fooling yourself

Do not benchmark on a cherry-picked question. Take a small mixed set of real questions, run the full-context baseline, then run the compressed-context version with the same model, temperature, and answer checks.

Run the same RAG question with full context and compressed context.
Compare answer correctness, citations, and refusal behavior.
Record original tokens, output tokens, latency, and risk flags.
Protect instructions, policies, and must-cite passages before compressing.

const context = retrievedChunks.map((chunk) => chunk.text).join("\n\n");

const compressed = await adola.compress({
  query: userQuestion,
  input: context,
  compression: {
    target_ratio: 0.4,
    preserve_order: true
  }
});

const answer = await model.responses.create({
  model: "your-rag-model",
  input: [
    { role: "system", content: "Answer with citations from the provided context." },
    { role: "user", content: userQuestion },
    { role: "user", content: compressed.output }
  ]
});