Once the system started doing more than just classifying emails, I needed a better way to check whether it was actually doing the right job.
That was the point where evals became important.
I did not want to rely on a few successful runs in my real inbox and assume the system was good enough. I wanted a way to run known examples through the pipeline and compare the output against what I expected.
So I set up a simple CSV-based evaluation workflow.
Each dataset category has its own folder with three files:
graph-emails.csvannotated-emails.csvprocessed-emails.csvThe idea is simple:
graph-emails.csv contains the input emailsannotated-emails.csv contains the expected statusprocessed-emails.csv is generated by running the pipelineThe eval runner loads each dataset, processes the emails through the same pipeline, writes the actual results, and then compares those results against the annotated file.
I also arranged the datasets in a useful sequence:
That order made it easier to see the workflow build up in a realistic way. Generic updates usually come earlier in a job process, while assessments, interviews, and rejections come later.