LLM evals are the new unit tests — and most teams are skipping them
Leading the agentic evals team at CEGID has been a crash course in what actually matters when you’re putting LLMs into production accounting software.
The core insight: evals are not optional. They’re the only way to know if your model update broke something your users care about.
Our current eval stack:
- Golden dataset — A curated set of accounting scenarios with known-correct outputs, maintained by CPAs
- Automated regression — Every model/prompt change runs against the golden dataset in CI
- Behavioral evals — Does the agent take the right action, not just produce the right text?
- Human-in-the-loop checkpoints — For high-stakes actions (filing, payments), we require human approval and log the decision
The hardest part isn’t building the eval framework — it’s convincing the team to invest in maintaining the golden dataset. That’s the real moat.
If you’re building production agents and not running evals: start now, before your users find the bugs for you.