Guide to Developing a Scalable Alerting System
7 min read
Introduction
This article explores the development of a Rule-Based alerting and notification system at Middleware. The system, known as the Playbook, aims to notify customers when certain metrics cross predefined thresholds.
The Idea💡
One of my first tasks after joining Middleware as a full time software engineer was to build something known as the Playbook.
The Playbook serves as a Rule-Based alerting system, allowing Engineering managers to receive notifications when specific metrics exceed set thresholds. For instance:
Send an email if developers spend more than 50% of their time on bug fixes in the past week.
Send a Slack message if the team’s average PR Rework Time exceeds 6 hours over the last month.
Application Flow
Setting Up Rules and Cadence: Users define alerting rules for their teams and select the time range for rule checks.
Breach Processing: The system reads rules from the database, validates metrics against these rules, and generates notifications for breaches.
Dispatching Notifications: Notifications are sent to users based on their set time zones.
Assumptions
Entities like Users and Teams exist.
Services can provide metric values for Users or Teams within specified time ranges.
A notification dispatcher can send notifications via Slack/Email.
One notification is sent per breach.
Low Level Design
The system employs various models, including Playbook, Playbook Rules, Playbook Rule Breaches, and Notifications. These models handle rule creation, breach generation, and notification dispatching.
Functionalities:
Playbook Core: Create a playbook as an aggregation of Rules. Each rule has some setting data and cadence.
Breach Processor: Identify breaches based on the rules set by the user and the rule cadence.
Notification Processor: Create Notifications based on the breaches.
Notification Dispatcher: Sends Notifications via different channels.
Models:
Playbook and Playbook Rules:
Playbook are set by the manager for a team.
Each Playbook has a set of Rules for each metric with set threshold.
class Playbook(){
team_id: uuid,
created_by: uuid,
created_at:date_time,
updated_at:date_time,
updated_by: uuid,
rules: set(PlaybookRule), (set hashes based on rule type),
}
class PlaybookRule(){
rule_type: PlaybookRuleType(ENUM),
rule_data: {}
alert_cadence: PlaybookRuleAlertCadence(ENUM),
users_to_notify: set(uuid),
is_active: boolean
}
class PlaybookRuleType(Enum):
CYCLE_TIME= "CYCLE_TIME"
INCIDENT_COUNT = "INCIDENT_COUNT"
Alert Cadence refer to the frequency at which a user would like to receive these notification.
Daily Cadence: Breaches are calculated daily and we send notification every day according to user time zone.
Weekly Cadence: Breaches are calculated based on weekly data, and notifications are sent every Monday.
Two Weeks Cadence: Breaches are calculated over the past two weeks’ data, and notifications are sent every second Monday.
Monthly Cadence: Breaches are calculated over the monthly average, and notifications are sent on the 1st of each month.
class AlertCadence(Enum):
DAILY="DAILY"
WEEKLY="WEEKLY"
TWO_WEEKS="TWO_WEEKS"
MONTHLY="MONTHLY"
Playbook Breaches:
Triggering Breaches: Whenever a metric exceeds or falls below the set threshold, a PlaybookBreach is generated.
Linkage: Each PlaybookBreach is associated with a playbook and a rule type, providing context for the breach.
Rule Data Inclusion: To accommodate potential rule changes later, each breach includes the rule data as it was at the time of generation. This ensures historical accuracy and consistency despite future rule modifications.
class PlaybookRuleBreach(){
playbook: uuid,
rule_type: uuid,
rule_data: {}
team_id: uuid,
alert_cadence: PlaybookRuleAlertCadence(ENUM),
metric_value: float
}
Notifications:
Breach Notification: Upon breach creation, a notification can be generated and sent to the user.
Preventing Duplicates: To avoid duplicate notifications, each notification is assigned an idempotency key, ensuring uniqueness in the database.
Notification Model Flexibility: The notification model is designed to be versatile, accommodating other services beyond the Playbook. As such, each notification can be categorized by type to facilitate organization and handling.
class Notification(){
receiver_id: uuid,
idempotency_key: str,
notification_type: NotificationTypes(ENUM),
due_at: date_time,
queued_at: date_time,
sent_at: date_time
}
High Level Design
In this section we will define how to makes this system robust:
Ensure breaches are generated reliably at set intervals despite system failures.
Implement measures to prevent the generation of duplicate breaches.
Establish safeguards to avoid sending duplicate notifications.
Develop a retry system for notifications in case of bugs or system failures.
Generating and Processing Breaches
CRONS
We must check and process our playbook rules based on the cadence set by the user.
For this purpose we can simplify the system to use a CRON that runs Daily.
Process rules with a daily cadence each time the CRON job executes.
Check if the current date is the 1st to process rules with a monthly alert cadence.
For rules with a weekly alert cadence, process them on Mondays.
Handle rules with a two-week alert cadence by processing them on the first or third Monday of the month.
Jobs and Workers
Processing all this data inside a single cron process can be a challenge incase one of the rule fails due to incorrect data, unhandled case in code or deleted entities.
Example: If you have processed a 10 Rules and there is an unhandled case on the 11th Rule, the last generated breaches and notifications are wasted and not stored to the DB. Similarly, once the CRON throws an error, it will need to be manually re-triggered else we will have to wait for the next time it automatically runs.
To avoid this, we use the producer-consumer model.
Treat each Playbook Rule as a single job and enqueue these jobs into a PlaybookRule Queue.
Utilise multiple workers to listen to the PlaybookQueue and process one job at a time to generate PlaybookRuleBreach and Notification.
If a job fails, the Queue will receive a 500 status code, we can setup some alerts and this failed job can be re-tried by the Queue.
Using a non-FIFO queue ensures that failed jobs do not block the queue while fixes are being deployed.
Regarding technology choices:
The system described uses Amazon SQS and Lambdas for its internal systems.
While AWS provides built-in functionality for this setup, similar systems can be built using other service providers or custom solutions.
Queuing Notifications
Once the Notification are in the Database, we can run a hourly CRON that checks for any due notifications in the DB and queues them as jobs.
These notification jobs are handled by the notification dispatcher which decides which channel to use to notify the user and runs any additional logic to manipulate notification message set for a notification type.
Points of Failure
As data travels through distributed systems, there are chances of failure. We handled some cases in our design, but each system has its own shortcomings.
Handled Cases
Processing jobs from the PlaybookRuleQueue:
Failed saving operations for breaches are retried by the Queue.
When saving notifications fails and a retry occurs, ensure no duplicate breaches are generated for the same packet. Breaches remain idempotent based on playbook_id, rule_type, and rule checking interval.
Processing and Saving Notification
When re-queuing playbook rule jobs, we prevent duplicate notifications from being saved in the DB by using an idempotency key based on breach data.
Using idempotency key we make sure one notification can be created from one breach.
Incase a dispatched notifications is failed, it will be retried by the queue.
Incase a sent notification is re-queued, the worker will check the sent_at key for a notification making sure it is not resent.
Shortcomings
Rule Changes
Incase a user updates a rule after the notification is generated and waiting to be sent, we still send the notification.
This was a edge case that we did not handle internally simply because of time constraints.
This can also be handled by cross-checking the rule data inside the breach used to generate the notification with the rule data in associated playbook and deleting notification and re-queueing a PlaybookRule Job.
Conclusion
The Rule-Based Notification System developed at Middleware has proven to be robust in the past year at our current scale.
While it may not be the most scalable notifications system out there, I hope this article can get you thinking in the right direction :)