Looking Glass

Evaluate

Evaluate retrieval quality, monitor performance, and backtest your models to improve performance. If you haven’t integrated yet, start with Quickstart.

On this pageOverviewBacktestingExportingDashboard

Overview

When an operator begins a task, Looking Glass calls your answers endpoint with task ids that you provide (for example, customer_123). Use these task ids to locate and return the correct record for the task.

Operators review and evaluate the results from your endpoint. Evaluations are recorded in the Evals dashboard so you can track retrieval performance over time and iterate.

Dashboard

Use the Evals dashboard to monitor retrieval accuracy, identify regressions, and iterate on prompts, tools, and backend logic.

Backtesting

Retrieve evaluation data to analyze and backtest your retrieval pipeline.

Use the Backtest API to evaluate a candidate retrieval endpoint against previously collected ground-truth evaluations. Provide the endpoint URL you want to test and a list of previously collected task_ids.

# Initiate a backtest (example)
POST https://<LOOKING_GLASS_API_URL>/backtest
Content-Type: application/json

{
  "endpoint": "<YOUR_ENDPOINT_URL>",
  "task_ids": [
    "123",
    "456",
    "789"
  ]
}

The response will be a URL to a dashboard where you can view and export the results of the backtest.

Export evaluation datasets

Session data as JSONL for offline analysis. Each line uses the following schema:

{
    "session_id": "123",
    "task_id": "456",
    "field": {
        "id": "789",
        "name": "name",
        "type": "text",
        "suggestion": "John Doe",
        "judgment": "correct" | "incorrect" | "unknown",
        "correction": "John Smith" | null,
    },
    "task_info": {
        "name": "John Smith",
        "age": 30
    },
    "timestamp": "2021-01-01T00:00:00Z"
}
Next: How it works →