API Reference
Compress long context for LLM inference in one pre-model API call.
Quickstart
Create a project key, install an SDK, and call Adola before your model provider. The public packages contain only the SDK clients, not the application codebase.
pip install adolanpm install adolafrom adola import Adola
client = Adola(api_key="rose_...")
result = client.compress(
input=open("retrieved_context.txt").read(),
query="Which incident caused latency?",
compression={"target_ratio": 0.3},
include_spans=False,
)
compressed = result["output"]
receipt = result["receipt"]Open the dashboard, create a workspace, and copy a scoped bearer key.
Send Adola the user query plus the full retrieved context you would normally pass downstream.
Pass the returned output to OpenAI, Anthropic, DeepSeek, a local model, or your own router.
Store token counts, ratio, latency, and risk flags next to the original model request.
API endpoint
POST https://api.adola.app/v1/compressAuthentication
Add a project-scoped bearer key to every production request. Keys can be revoked without affecting other projects in the same workspace.
Authorization: Bearer rose_...Request body
Adola is query-aware. The request should include the task your model will answer and the full context you want reduced.
modelstringCompression model. Use rose-1 for the production route.
querystringThe question or task the downstream model needs to answer.
inputstringThe long context to reduce: RAG chunks, tickets, logs, transcripts, docs, or memory.
compression.target_rationumberTarget output ratio. Defaults to 0.3 when omitted.
include_spansbooleanReturn selected source spans when span export is enabled for the deployment.
Response receipt
The response is plain text plus token accounting. Most teams pass outputto the next model call and log receipt for usage, billing, and evals.
{
"model": "rose-1",
"output": "The relevant incident context...",
"receipt": {
"original_tokens": 4200,
"output_tokens": 980,
"tokens_saved": 3220,
"compression_ratio": 0.233,
"latency_ms": 4.8,
"risk": { "level": "low", "flags": [] }
}
}original_tokensTokens received by Adola before Rose 1 compression.
output_tokensTokens returned after query-aware selection and cleanup.
tokens_savedDifference between original and output token counts.
compression_ratioOutput tokens divided by original tokens.
risk.levelLow, medium, or high based on omitted protected spans and request checks.
Batch compression
Use batch jobs for eval sets, large retrieval reprocessing, and asynchronous backfills. Batch responses use the same receipt shape as synchronous compression.
{
"requests": [
{
"model": "rose-1",
"query": "Which incident caused latency?",
"input": "Long retrieved context...",
"compression": { "target_ratio": 0.3 }
}
]
}Errors
Error responses include a stable status, message, and request id for support and log correlation.
400Malformed JSON, missing input, or invalid compression options.
401Missing, revoked, or malformed bearer key.
402Workspace quota exceeded or billing disabled.
429Project rate limit exceeded.
500Unexpected server error.
503Compression worker unavailable. Retry with backoff.