Several parts of the system became more reliable only after I tightened the system boundaries instead of just adding more logic.
One of the first important decisions was that retries should not happen on top of dirty state.
That mattered because without rollback, a retry is not really recovery. It is just repeated work on top of dirty state.
The service does not treat all failures as operationally equal.
The retry path is intentionally limited to failures another attempt might realistically fix, such as:
Failures like permission problems are allowed to fail immediately.
That keeps the system from wasting time and API calls on deterministic failures.
The service processes emails one at a time.
That was a deliberate tradeoff rather than an omission. Each email can trigger work across Microsoft Graph, OpenAI, Airtable, and the downstream resolver branch. Adding concurrency earlier would have increased throughput, but it also would have made rate limits, rollback, and downstream write coordination harder to reason about.
For a small scheduled backend service, stability was more important than maximizing throughput.
Marking emails as read is useful, but it is not the core system outcome.
The core outcome is: