Evaluate
Evaluate retrieval quality, monitor performance, and backtest your models to improve performance. If you haven’t integrated yet, start with Quickstart.
Overview
When an operator begins a task, Looking Glass calls your answers endpoint with task ids that you provide (for example, customer_123
). Use these task ids to locate and return the correct record for the task.
Operators review and evaluate the results from your endpoint. Evaluations are recorded in the Evals dashboard so you can track retrieval performance over time and iterate.
Dashboard
Use the Evals dashboard to monitor retrieval accuracy, identify regressions, and iterate on prompts, tools, and backend logic.
Backtesting
Retrieve evaluation data to analyze and backtest your retrieval pipeline.
Use the Backtest API to evaluate a candidate retrieval endpoint against previously collected ground-truth evaluations. Provide the endpoint URL you want to test and a list of previously collected task_id
s.
# Initiate a backtest (example)
POST https://<LOOKING_GLASS_API_URL>/backtest
Content-Type: application/json
{
"endpoint": "<YOUR_ENDPOINT_URL>",
"task_ids": [
"123",
"456",
"789"
]
}
The response will be a URL to a dashboard where you can view and export the results of the backtest.
Export evaluation datasets
Session data as JSONL for offline analysis. Each line uses the following schema:
{
"session_id": "123",
"task_id": "456",
"field": {
"id": "789",
"name": "name",
"type": "text",
"suggestion": "John Doe",
"judgment": "correct" | "incorrect" | "unknown",
"correction": "John Smith" | null,
},
"task_info": {
"name": "John Smith",
"age": 30
},
"timestamp": "2021-01-01T00:00:00Z"
}