What I like about this project is that it stayed honest.

It did not start as an attempt to build a grand AI product. It started as a real operational annoyance in my own workflow, and I kept pushing it until it became something I could actually trust.

A lot of small automation projects look good in a demo, then fall apart once they meet real inputs, partial failure, and messy downstream decisions. I did not want this to be one of those.

Working on this reminded me that the useful part of AI systems engineering is rarely the model call by itself. It is the surrounding structure: the retrieval window, the retry boundary, the rollback rule, the best-effort boundary, and the discipline of checking behavior with evals before trusting the system more broadly.

That is what turned this from a prompt-driven experiment into a backend service I could actually use.

Even though I built this for one narrow workflow, the structure is reusable. The same pattern could be adapted to other operational inbox flows where messages need to be classified, logged, and turned into careful downstream actions.

If you take one thing from this case study, I would want it to be this: do not just ask whether the model output looks good. Ask whether the whole system is shaped well enough to survive real use.