At TerraFour, we ship AI features weekly. The question we kept hitting: how do we know if a change made things better or worse? Research-style benchmarks are too slow and often don’t map to real user experience. Here’s how we set up lightweight evals that actually guide product decisions.

Why traditional evals fall short

Standard ML evaluation pipelines are built for model development: aggregate scores on static datasets, long-running jobs, and metrics that often don’t correlate with user satisfaction. When you’re iterating on prompts, retrieval logic, or UI flows, you need something faster and more aligned with what users actually care about.

We needed evals that could run in minutes, surface regressions before they hit production, and answer questions like “Does this new summarization prompt preserve key facts?” or “Do our comp suggestions still feel relevant?”

Core principles: lightweight and product-aligned

We adopted three principles that made our eval system useful:

  • Product-quality metrics first: We define metrics that match user outcomes—e.g., “Does the output include the required fields?” or “Would a user trust this answer?”—rather than generic coherence or perplexity scores.
  • Incremental and fast: Every eval runs in under 5 minutes on a small, curated set of cases. We run them on every PR and nightly on a larger set.
  • Human-labeled seeds: We started with ~50–100 examples per feature, labeled by the team. As we ship, we add edge cases from support and bug reports.

What we track week over week

For each major AI capability, we maintain a small eval suite. For SkyeScraper’s property briefs, we measure: completeness (did we include agent-requested sections?), factual accuracy (no hallucinations on address, price, sqft), and format correctness (client-ready structure). For hAIring’s screening summaries, we check: key qualifications surfaced, no PII leakage, and consistent structure.

We don’t chase 100% on every metric. We chase stability—if a change drops any metric by more than 5%, we investigate before merging. That threshold has caught more regressions than any manual QA pass.


Implementing the pipeline

Our eval pipeline is Python-based, runs in CI, and writes results to a simple JSON artifact. Each eval defines input fixtures (e.g., a sample property listing), expected outputs or rubrics, and a scorer. We use LLM-as-judge sparingly—only when the rubric is clear and we’ve validated it against human judgments. For structured checks (e.g., “contains field X”), we use deterministic checks.

Turning evals into decisions

The real unlock is packaging these results for product and eng. Every week we review a short dashboard: pass/fail by eval, trend over the last 4 weeks, and any new failures. That 15-minute sync has replaced hours of “did we break something?” speculation. When we see a dip, we either roll back the offending change or fix it before users notice.

We’ve also started using evals to prioritize work. If a feature’s eval suite has been green for months, we deprioritize further tuning. If a new use case keeps failing, we know where to invest.

Bottom line

Lightweight, product-aligned evals don’t replace human judgment—they give you a safety net and a shared language for “did we improve things?” Ship them early, keep them fast, and make them part of your weekly rhythm. Your product decisions will thank you.

← Back to Blog