Evaluation Approach

Once the system started doing more than just classifying emails, I needed a better way to check whether it was actually doing the right job.

That was the point where evals became important.

I did not want to rely on a few successful runs in my real inbox and assume the system was good enough. I wanted a way to run known examples through the pipeline and compare the output against what I expected.

So I set up a simple CSV-based evaluation workflow.

Each dataset category has its own folder with three files:

graph-emails.csv
annotated-emails.csv
processed-emails.csv

The idea is simple:

graph-emails.csv contains the input emails
annotated-emails.csv contains the expected status
processed-emails.csv is generated by running the pipeline

The eval runner loads each dataset, processes the emails through the same pipeline, writes the actual results, and then compares those results against the annotated file.

I also arranged the datasets in a useful sequence:

generic update
assessment
interview invitation
rejection
irrelevant

That order made it easier to see the workflow build up in a realistic way. Generic updates usually come earlier in a job process, while assessments, interviews, and rejections come later.