Retrieval should favor recall. The expensive generation call should favor useful context. Those are different jobs. Adola sits between your retriever or reranker and the downstream LLM, reducing the final context block before the model call.
RetrieveRerankCompressGenerateLog receipt
Where tokens usually leak
The best first test is a production-ish RAG prompt that already works but feels too expensive to run at scale.
- Repeated snippets across retrieved chunks
- Boilerplate headers and footers
- Low-signal metadata fields
- Prior turns copied into every answer
- Long evidence blocks for short user questions
- Tool traces that only need selected facts
How to test without fooling yourself
Do not benchmark on a cherry-picked question. Take a small mixed set of real questions, run the full-context baseline, then run the compressed-context version with the same model, temperature, and answer checks.
- Run the same RAG question with full context and compressed context.
- Compare answer correctness, citations, and refusal behavior.
- Record original tokens, output tokens, latency, and risk flags.
- Protect instructions, policies, and must-cite passages before compressing.
const context = retrievedChunks.map((chunk) => chunk.text).join("\n\n");
const compressed = await adola.compress({
query: userQuestion,
input: context,
compression: {
target_ratio: 0.4,
preserve_order: true
}
});
const answer = await model.responses.create({
model: "your-rag-model",
input: [
{ role: "system", content: "Answer with citations from the provided context." },
{ role: "user", content: userQuestion },
{ role: "user", content: compressed.output }
]
});