Once the system started doing more than just classifying emails, I needed a better way to check whether it was actually doing the right job.

That was the point where evals became important.

I did not want to rely on a few successful runs in my real inbox and assume the system was good enough. I wanted a way to run known examples through the pipeline and compare the output against what I expected.

So I set up a simple CSV-based evaluation workflow.

Each dataset category has its own folder with three files:

The idea is simple:

The eval runner loads each dataset, processes the emails through the same pipeline, writes the actual results, and then compares those results against the annotated file.

I also arranged the datasets in a useful sequence:

That order made it easier to see the workflow build up in a realistic way. Generic updates usually come earlier in a job process, while assessments, interviews, and rejections come later.