Several parts of the system became more reliable only after I tightened the system boundaries instead of just adding more logic.

Rollback before retry

One of the first important decisions was that retries should not happen on top of dirty state.

That mattered because without rollback, a retry is not really recovery. It is just repeated work on top of dirty state.

Retry only for failures that are actually retryable

The service does not treat all failures as operationally equal.

The retry path is intentionally limited to failures another attempt might realistically fix, such as:

Failures like permission problems are allowed to fail immediately.

That keeps the system from wasting time and API calls on deterministic failures.

Sequential processing over concurrency

The service processes emails one at a time.

That was a deliberate tradeoff rather than an omission. Each email can trigger work across Microsoft Graph, OpenAI, Airtable, and the downstream resolver branch. Adding concurrency earlier would have increased throughput, but it also would have made rate limits, rollback, and downstream write coordination harder to reason about.

For a small scheduled backend service, stability was more important than maximizing throughput.

Mark-as-read as a best-effort side effect

Marking emails as read is useful, but it is not the core system outcome.

The core outcome is: