Reliability and Failure Handling

Although the service is small, it still has to behave like a real backend system when external dependencies misbehave.

That mattered here because the workflow sits on top of several systems that can each fail in different ways:

Microsoft Graph
Airtable
OpenAI structured outputs
Airtable MCP
Telegram

This section is about what the service actually does once those failures happen at runtime.

Safe run windows

The service keeps a last_successful_run_at value and uses it to compute the next retrieval window with overlap.

That overlap reduces the chance of missing emails that arrive near run boundaries, while the processed-email ledger prevents duplicated work from becoming a problem.

Selective rollback

If processing fails after a new processed-email record has been created but before the attempt is complete, that record is deleted before the failure bubbles up.

If processing fails on an already-existing processed-email record, the service does not delete it.

The runtime rule is simple: only clean up state created by the failed attempt itself.

Selective retry

Retries are limited to failures that might plausibly succeed on another attempt.

Examples include:

rate-limit responses
5xx failures