Shipping weekly with strong MLOps

We ship AI features every week. Not demos—production features that real users rely on. The only way that works is with a pragmatic MLOps setup: versioned data, controlled prompts, and rollout practices that let us iterate fast without breaking things.

Why MLOps matters for LLM products

Traditional software has a simple deploy flow: change code, run tests, ship. AI products add variables: prompt changes, retrieval updates, model upgrades, and data drift. A “small” prompt tweak can tank accuracy. A retrieval index refresh can surface wrong documents. Without discipline, you’re debugging production at 2 a.m.

We treat MLOps as the infrastructure that makes weekly shipping safe. It’s not about big platforms—it’s about clear patterns for data, prompts, and rollout.

Data: version it, don’t ad-hoc it

Every AI feature depends on data: training examples, few-shot prompts, retrieval corpora, eval fixtures. Early on we had spreadsheets and one-off scripts. That led to “which version of the comps data did we use?” confusion. We moved to a simple data versioning approach: all datasets live in a structured store with timestamps and hashes. Every deploy pins to a specific data version. If something regresses, we can diff data versions and roll back.

For retrieval, we version our indexes. A new index build gets a new version; we A/B test index versions before full rollout. That’s saved us from multiple “search got worse overnight” incidents.

Prompts: treat them as config

Prompts are code. We store them in version control, use the same branching and review process as application code, and never edit them directly in production. We have a prompt registry: each named prompt has a version, and we can roll back to a previous version with a single config change.

We also run prompt evals on every change. If a prompt edit drops eval performance, we catch it before merge. That’s reduced “why did summarization break?” incidents to near zero.

Rollout: staged and observable

We never flip a switch from 0% to 100% for a new AI behavior. We use staged rollouts: 1% → 5% → 25% → 100%, with observability at each stage. We track latency, error rates, and product-specific metrics (e.g., brief completion rate, screening accuracy). If any metric moves the wrong way, we pause and fix.

We also support instant rollback. Every deployment can be reverted with a single config change. No redeploys, no waiting. That’s been critical when we’ve discovered edge cases in production.

Keeping it pragmatic

We've deliberately avoided over-engineering. We don’t run a full ML platform. We use a mix of scripts, small services, and existing CI/CD. The key is consistency: every AI change goes through the same pipeline—data versioned, prompts in config, evals run, rollout staged.

What we ship weekly

With this setup, we regularly ship: prompt improvements, new retrieval logic, model upgrades (e.g., moving to a newer base model), and new eval suites. Each change is incremental and traceable. If something breaks, we know exactly what changed and can revert in minutes.

Lessons learned

Start simple: A basic data versioning and prompt registry beats a complex platform you never finish building.
Evals are non-negotiable: Run them on every change. They catch more bugs than manual QA.
Observability is part of rollout: Don’t roll out without metrics. You need to see regressions before users complain.

Bottom line

Strong MLOps doesn’t mean slow iteration. With versioned data, config-driven prompts, and staged rollouts, you can ship weekly and sleep well. Changes don’t break users—they get better, one small step at a time.

← Back to Blog