National Reviewer

Alert Aggregation Logic

How it works

The following is a brief description of the alert aggregation method. Examples are in the slides ahead:

  • Count of faults is maintained for each element in the system (probe, group, system). The counters are for each combination of service and fault type (BS on CNN might be one, while SS on Fox is a different counter)

  • On fault detection on any encoder:

    • We increment ref-count of probe (for that service and fault type)

    • Calculate group/system count by counting their non-zero descendants

  • Sending ‘set alert’ when count goes up to:

    • System or group: 2

    • Probe: 1

  • Sending ‘clear alert’ when count goes down to

    • System/Group: 0

    • Probe: 0

  • We keep track of each SNMP SET for all the sent data. The CLEAR will contain almost the same data: ALL attributes of the SNMP will be the same except for the ALERT_TYPE which will be 1 in case of a SET and a 0 in case of a CLEAR.

Observer system hierarchy levels example

image

Probe Fault

A specific fault type on a specific service is detected by the system. First one to spot it is Encoder E4: Alert (SNMP+email) is sent as ‘Probe’ level alert, identified as P2 (1).

image

Group Fault

E2 spots the same problem on the same service, it’s count increases, E4 (1):

Alert is sent for ‘Group’ level on behalf of G1 (1) and it’s count increases because it has now two non-zero descendants, P1 (1) and P2 (1)

image

System Fault

E10 detects the same alert sent on the system level because probe and group alerts were already sent for the fault/service combination.

No more alerts will be sent from now, no matter who else detects the fault on the same service.

However, web front-end will still show ALL alerts detected, with fault clips etc.

image

Scanning continues and faults begin to resolve:

Resolve Service Faults not affecting Probe, Group or System Alerts

E10 is for P4 is resolved, its count going to P4 (0) and the System count decreases to (1), but it’s still in system-wide alert, and therefore alert clear is not sent yet.

image

E4 is resolved to E4 (0), P2 (2) goes to P (1): Still no alert clear is sent as group and system have descendants.

image

Clear Probe Fault

E3 is resolved: P2 (0), ‘Clear’ sent to Probe level on P2

SNMP CLEAR will contain ‘E4’ in the encoder field because it was the one generating the alert.

image

Clear Group and System Faults

E2 is resolved: ‘Clear’ is sent at the Group level for G1 (0) and to System level, System (0). SNMP CLEAR details for ‘System’ will impersonate E10, the probe which originally reported the ‘System’ alert

SNMP CLEAR for ‘G1’ will contain E2 which happened to be the one sending the group alert.

image