LLM evals are the new unit tests — and most teams are skipping them

Leading the agentic evals team at CEGID has been a crash course in what actually matters when you’re putting LLMs into production accounting software.

The core insight: evals are not optional. They’re the only way to know if your model update broke something your users care about.

Our current eval stack:

  1. Golden dataset — A curated set of accounting scenarios with known-correct outputs, maintained by CPAs
  2. Automated regression — Every model/prompt change runs against the golden dataset in CI
  3. Behavioral evals — Does the agent take the right action, not just produce the right text?
  4. Human-in-the-loop checkpoints — For high-stakes actions (filing, payments), we require human approval and log the decision

The hardest part isn’t building the eval framework — it’s convincing the team to invest in maintaining the golden dataset. That’s the real moat.

If you’re building production agents and not running evals: start now, before your users find the bugs for you.


Back to AI Bytes