LangChain prompt compression before model calls

LangChain apps often start with generous context: retrieved documents, tool results, memory, policies, previous messages, and intermediate reasoning traces. Sending all of that downstream is simple, but it burns input tokens and can bury the answer-bearing text.

Put Adola immediately before the model node. Send the user task plus the context block you were going to pass to the LLM. Rose 1 returns compressed text plus a receipt for debugging and cost accounting.

Retrieve/toolsAssemble promptCompressCall LLMLog receipt

Where to insert it

Retrieval chains

Compress the joined documents after retrieval and reranking, before the answer model sees them.

LangGraph agents

Compress tool output, prior turns, and scratchpad state before the next planning or response node.

Model routers

Keep compression provider-neutral so the same reduced prompt can go to OpenAI, Anthropic, DeepSeek, or a local model.

Minimal pattern

Keep the Adola key on your server. The important part is not the framework wrapper; it is the placement: compress after context assembly and before the expensive model call.

async function compressBeforeModel({ question, documents, llm }) {
  const context = documents
    .map((doc, index) => `[doc ${index + 1}] ${doc.pageContent}`)
    .join("\n\n");

  const compressed = await fetch("https://api.adola.app/v1/compress", {
    method: "POST",
    headers: {
      "content-type": "application/json",
      authorization: `Bearer ${process.env.ADOLA_API_KEY}`
    },
    body: JSON.stringify({
      model: "rose-1",
      query: question,
      input: context,
      compression: { target_ratio: 0.35, preserve_order: true }
    })
  }).then((response) => response.json());

  const answer = await llm.invoke([
    ["system", "Answer from the compressed context. Say when context is insufficient."],
    ["human", `Question: ${question}\n\nContext:\n${compressed.output}`]
  ]);

  return { answer, compressionReceipt: compressed.receipt };
}

Try the hop without a key

The public demo endpoint is capped, but it is enough to test one real retrieved context block or agent trace before creating a workspace.

curl -s https://api.adola.app/v1/demo/compress \
  -H 'content-type: application/json' \
  --data '{
    "model": "rose-1",
    "query": "What should the assistant answer?",
    "input": "Long LangChain retrieval context, tool output, or graph state...",
    "compression": { "target_ratio": 0.35, "preserve_order": true }
  }'