Rose 1 production benchmarks are liveView API docs
Docs
Docs

API Reference

Compress long context for LLM inference in one pre-model API call. Start with the free playground, then move the same request into your service.

Run live demo

Test the RAG example before you wire the API.

The playground opens with a retrieved-context sample, no key required. Run it once, then copy the same request shape into the production endpoint below.

Quickstart

Create a project key, install an SDK, and call Adola before your model provider. The public packages contain only the SDK clients, not the application codebase. No credit card is required for the first workspace.

Pythonpip install adola
JavaScriptnpm install adola
from adola import Adola

client = Adola(api_key="rose_...")
result = client.compress(
    input=open("retrieved_context.txt").read(),
    query="Which incident caused latency?",
    compression={"target_ratio": 0.3},
)

compressed = result["output"]
receipt = result["receipt"]
1Create a project key

Open the dashboard, create a workspace, and copy a scoped bearer key.

2Compress before the model

Send Adola the user query plus the full retrieved context you would normally pass downstream.

3Use ordinary text output

Pass the returned output to OpenAI, Anthropic, DeepSeek, a local model, or your own router.

4Keep the receipt

Store token counts, ratio, latency, and risk flags next to the original model request.

API endpoint

POST https://api.adola.app/v1/compress

For no-key testing, use POST https://api.adola.app/v1/demo/compress. The terminal quickstart has a copyable curl path, and machine-readable API metadata is available at /openapi.json.

Authentication

Add a project-scoped bearer key to every production request. Keys can be revoked without affecting other projects in the same workspace.

Authorization: Bearer rose_...

Request body

Adola is query-aware. The request should include the task your model will answer and the full context you want reduced.

modelstring

Compression model. Use rose-1 for the production route.

querystring

The question or task the downstream model needs to answer.

inputstring

The long context to reduce: RAG chunks, tickets, logs, transcripts, docs, or memory.

compression.target_rationumber

Target output ratio. Defaults to 0.3 when omitted.

Response receipt

The response is plain text plus token accounting. Most teams pass outputto the next model call and log receipt for usage, billing, and evals.

Responseapplication/json
{
  "model": "rose-1",
  "output": "The relevant incident context...",
  "receipt": {
    "original_tokens": 4200,
    "output_tokens": 980,
    "tokens_saved": 3220,
    "compression_ratio": 0.233,
    "latency_ms": 4.8,
    "risk": { "level": "low", "flags": [] }
  }
}
original_tokens

Tokens received by Adola before Rose 1 compression.

output_tokens

Tokens returned after compression.

tokens_saved

Difference between original and output token counts.

compression_ratio

Output tokens divided by original tokens.

risk.level

Low, medium, or high based on request checks.

Batch compression

Use batch jobs for eval sets, large retrieval reprocessing, and asynchronous backfills. Batch responses use the same receipt shape as synchronous compression.

POST /v1/batch/compressjson
{
  "requests": [
    {
      "model": "rose-1",
      "query": "Which incident caused latency?",
      "input": "Long retrieved context...",
      "compression": { "target_ratio": 0.3 }
    }
  ]
}

Errors

Error responses include a stable status, message, and request id for support and log correlation.

400

Malformed JSON, missing input, or invalid compression options.

401

Missing, revoked, or malformed bearer key.

402

Workspace quota exceeded or billing disabled.

429

Project rate limit exceeded.

500

Unexpected server error.

503

Compression worker unavailable. Retry with backoff.