LlamaIndex prompt compression for RAG context

LlamaIndex applications often retrieve more context than the answer needs. That is useful for recall, but it means repeated headers, neighboring chunks, old memory, and low-value tool notes can reach every expensive model call.

Adola gives you a pre-model compression hop. Send the user query plus the context you would normally pass to the response synthesizer. Rose 1 returns plain text plus a receipt with token savings and latency.

Retrieve nodesAssemble contextCompressSynthesize answerLog receipt

Where to insert it

Retriever output

Compress the final joined node text after retrieval and reranking, before synthesis.

Chat engines

Reduce retrieved memory, tool notes, and prior context before the response model call.

Query engines

Keep your index unchanged and add compression only at the prompt assembly boundary.

Minimal server-side pattern

Keep the Adola API key on your server. The pattern works whether your LlamaIndex app uses a query engine, chat engine, or custom retriever pipeline.

async function answerWithCompressedLlamaIndex({ query, nodes, llm }) {
  const context = nodes
    .map((node, index) => `[node ${index + 1}] ${node.text}`)
    .join("\n\n");

  const compressed = await fetch("https://api.adola.app/v1/compress", {
    method: "POST",
    headers: {
      "content-type": "application/json",
      authorization: `Bearer ${process.env.ADOLA_API_KEY}`
    },
    body: JSON.stringify({
      model: "rose-1",
      query,
      input: context,
      compression: { target_ratio: 0.35, preserve_order: true }
    })
  }).then((response) => response.json());

  const answer = await llm.complete({
    prompt: `Question: ${query}\n\nContext:\n${compressed.output}`
  });

  return { answer, compressionReceipt: compressed.receipt };
}

Try one retrieved context block

The capped demo endpoint does not need a key. Use it to test one real set of retrieved nodes before creating a workspace.

curl -s https://api.adola.app/v1/demo/compress \
  -H 'content-type: application/json' \
  --data '{
    "model": "rose-1",
    "query": "Which retrieved node answers the user question?",
    "input": "Long LlamaIndex retrieved node text, memory, or tool context...",
    "compression": { "target_ratio": 0.35, "preserve_order": true }
  }'