unemployed.dev☕ Support
system-design/notification-service
Topic 13System Walkthroughs

Design a Notification Service

Fan-out to millions of users across multiple channels reliably.

Design a system that sends push notifications, emails, and SMS at scale. The challenge is reliable delivery across channels, handling failures gracefully, and not overwhelming downstream services.

Requirements

Scoping the notification problem.

  • Channels: push notification (iOS/Android), email, SMS
  • 10M notifications/day across all channels
  • At-least-once delivery — retries on failure
  • User preferences: users can opt out of channels or notification types
  • Not latency-critical: a few seconds of delay is acceptable

Core architecture

How the system is structured.

  • Notification service API — accepts notification requests from upstream services
  • Message queue (SQS/Kafka) — decouples ingestion from delivery, absorbs spikes
  • Worker pool — pulls from queue, determines channel, routes to provider
  • Channel providers: APNs (iOS), FCM (Android), SendGrid (email), Twilio (SMS)
  • Notification log DB — persists every notification with delivery status

Reliability patterns

How to handle failures without losing notifications.

  • Dead letter queue — failed notifications go here for inspection and retry
  • Exponential backoff — retry with increasing delays: 1s, 2s, 4s, 8s...
  • Idempotency key — prevent duplicate delivery if message is processed twice
  • Provider fallback — if primary email provider fails, route to backup

User preferences and throttling

Don't spam users.

  • Preferences table: user_id + channel + notification_type + opted_in
  • Check preferences before enqueuing or before sending
  • Rate limiting: max N notifications per user per day per channel
  • Quiet hours: respect user's local time zone

Interview tips

  • Lead with queue-based design — synchronous delivery won't scale
  • Address retry logic and dead letter queues explicitly
  • Mention idempotency — interviewers test this in delivery-critical systems
  • User preferences and opt-outs are a product requirement, not an afterthought

Follow-up questions to expect

  • ?How do you guarantee exactly-once delivery?
  • ?How do you handle provider outages (APNs down for 30 minutes)?
  • ?How do you prioritize urgent notifications over marketing ones?
TLDR
  • Queue-based architecture decouples ingestion from delivery
  • At-least-once delivery + idempotency key = reliable without duplicates
  • Dead letter queue for visibility into failed deliveries
  • Check user preferences before sending — never spam
  • Separate workers per channel for independent scaling