Topic 13System Walkthroughs
Design a Notification Service
Fan-out to millions of users across multiple channels reliably.
Design a system that sends push notifications, emails, and SMS at scale. The challenge is reliable delivery across channels, handling failures gracefully, and not overwhelming downstream services.
Requirements
Scoping the notification problem.
- ›Channels: push notification (iOS/Android), email, SMS
- ›10M notifications/day across all channels
- ›At-least-once delivery — retries on failure
- ›User preferences: users can opt out of channels or notification types
- ›Not latency-critical: a few seconds of delay is acceptable
Core architecture
How the system is structured.
- ›Notification service API — accepts notification requests from upstream services
- ›Message queue (SQS/Kafka) — decouples ingestion from delivery, absorbs spikes
- ›Worker pool — pulls from queue, determines channel, routes to provider
- ›Channel providers: APNs (iOS), FCM (Android), SendGrid (email), Twilio (SMS)
- ›Notification log DB — persists every notification with delivery status
Reliability patterns
How to handle failures without losing notifications.
- ›Dead letter queue — failed notifications go here for inspection and retry
- ›Exponential backoff — retry with increasing delays: 1s, 2s, 4s, 8s...
- ›Idempotency key — prevent duplicate delivery if message is processed twice
- ›Provider fallback — if primary email provider fails, route to backup
User preferences and throttling
Don't spam users.
- ›Preferences table: user_id + channel + notification_type + opted_in
- ›Check preferences before enqueuing or before sending
- ›Rate limiting: max N notifications per user per day per channel
- ›Quiet hours: respect user's local time zone
Interview tips
- ✓Lead with queue-based design — synchronous delivery won't scale
- ✓Address retry logic and dead letter queues explicitly
- ✓Mention idempotency — interviewers test this in delivery-critical systems
- ✓User preferences and opt-outs are a product requirement, not an afterthought
Follow-up questions to expect
- ?How do you guarantee exactly-once delivery?
- ?How do you handle provider outages (APNs down for 30 minutes)?
- ?How do you prioritize urgent notifications over marketing ones?
TLDR
- ›Queue-based architecture decouples ingestion from delivery
- ›At-least-once delivery + idempotency key = reliable without duplicates
- ›Dead letter queue for visibility into failed deliveries
- ›Check user preferences before sending — never spam
- ›Separate workers per channel for independent scaling