Docs

API Reference

Compress long context for LLM inference in one pre-model API call.

Quickstart

Create a project key, install an SDK, and call Adola before your model provider. The public packages contain only the SDK clients, not the application codebase.

Pythonpip install adola

JavaScriptnpm install adola

from adola import Adola

client = Adola(api_key="rose_...")
result = client.compress(
    input=open("retrieved_context.txt").read(),
    query="Which incident caused latency?",
    compression={"target_ratio": 0.3},
    include_spans=False,
)

compressed = result["output"]
receipt = result["receipt"]

1Create a project key

Open the dashboard, create a workspace, and copy a scoped bearer key.

2Compress before the model

Send Adola the user query plus the full retrieved context you would normally pass downstream.

3Use ordinary text output

Pass the returned output to OpenAI, Anthropic, DeepSeek, a local model, or your own router.

4Keep the receipt

Store token counts, ratio, latency, and risk flags next to the original model request.

API endpoint

POST https://api.adola.app/v1/compress

Authentication

Add a project-scoped bearer key to every production request. Keys can be revoked without affecting other projects in the same workspace.

Authorization: Bearer rose_...

Request body

Adola is query-aware. The request should include the task your model will answer and the full context you want reduced.

modelstring

Compression model. Use rose-1 for the production route.

querystring

The question or task the downstream model needs to answer.

inputstring

The long context to reduce: RAG chunks, tickets, logs, transcripts, docs, or memory.

compression.target_rationumber

Target output ratio. Defaults to 0.3 when omitted.

include_spansboolean

Return selected source spans when span export is enabled for the deployment.

Response receipt

The response is plain text plus token accounting. Most teams pass outputto the next model call and log receipt for usage, billing, and evals.

Responseapplication/json

{
  "model": "rose-1",
  "output": "The relevant incident context...",
  "receipt": {
    "original_tokens": 4200,
    "output_tokens": 980,
    "tokens_saved": 3220,
    "compression_ratio": 0.233,
    "latency_ms": 4.8,
    "risk": { "level": "low", "flags": [] }
  }
}

original_tokens

Tokens received by Adola before Rose 1 compression.

output_tokens

Tokens returned after query-aware selection and cleanup.

tokens_saved

Difference between original and output token counts.

compression_ratio

Output tokens divided by original tokens.

risk.level

Low, medium, or high based on omitted protected spans and request checks.

Batch compression

Use batch jobs for eval sets, large retrieval reprocessing, and asynchronous backfills. Batch responses use the same receipt shape as synchronous compression.

POST /v1/batch/compressjson

{
  "requests": [
    {
      "model": "rose-1",
      "query": "Which incident caused latency?",
      "input": "Long retrieved context...",
      "compression": { "target_ratio": 0.3 }
    }
  ]
}

Errors

Error responses include a stable status, message, and request id for support and log correlation.

400

Malformed JSON, missing input, or invalid compression options.

401

Missing, revoked, or malformed bearer key.

402

Workspace quota exceeded or billing disabled.

429

Project rate limit exceeded.

500

Unexpected server error.

503

Compression worker unavailable. Retry with backoff.