In this blogpost we try and clear up some confusion by outlining the key differences between commonly confused alerting configuration options:
group_interval
, group_wait
, and repeat_interval
.
Before digging into these 3 Alertmanager configuration options, let's recap on some Prometheus alerting basics.
Prometheus itself has two global clocks:
Prometheus itself has two global clocks:
scrape_interval
and evaluation_interval
.
The
scrape_interval
is the time between each Prometheus scrape (i.e when Prometheus is pulling data from exporters etc.), and the evaluation_interval
is the time between each evaluation of Prometheus' alerting rules.
When a rule is evaluated, its state can be altered to be either inactive, pending, or firing.
Following evaluation, this state is sent to the connected Alertmanager to potentially start/stop the sending of alert notifications.
Following evaluation, this state is sent to the connected Alertmanager to potentially start/stop the sending of alert notifications.
This is where
group_by
comes into play.
In order to avoid continuously sending notifications for similar alerts (like the same process failing on multiple instances, nodes, and data centres), the Alertmanager may be configured to group these related alerts into one alert:
group_by: ['alertname', 'job']
Instead we wait for the
group_interval
since the last notification was sent to the group, and then send all alerts firing (and any resolved alerts) to the receiver.group_wait
sets how long to initially wait to send a notification for a particular group of alerts.
This allows the Alertmanager to wait for an inhibiting alert to arrive or to collect more initial alerts for the same group. It essentially buffers alerts from Prometheus sent to the Alertmanager that are grouped by the same labels:
group_by: ['alertname', 'job'] group_wait: 45s # Usually set between ~0s to a few minutes.
While this reduces noisy alerts and saves the people receiving them some headache, it may introduce longer delays in receiving said alert notifications.
Another issue we must consider is that we'll receive the same grouped alert notification again next time the rules are evaluated.
This is where we use
group_interval
.group_interval
dictates how long to wait before sending notifications about new alerts that are added to a group of alerts that have been alerted on before:group_by: ['instance', 'job'] group_wait: 45s group_interval: 10m # Usually ~5 mins or more.
So where does
repeat_interval
fit into all of this?
Simply put,
repeat_interval
is used to determine the wait time before a firing alert that has already been successfully sent to the receiver is sent again.
To summarise:
group_wait
How long to wait to buffer alerts of the same group before sending initially.
group_interval
How long to wait before sending an alert that has been added to a group which contains already fired alerts.
repeat_interval
How long to wait before re-sending a given alert that has already been sent.
Comments
Post a Comment
https://gengwg.blogspot.com/