Monitoring with Event Correleation

Posted on November 22, 2012

Maybe this is an obvious and/or well known thing for the experts in the field, but I did not realize how to do event correlation properly, and that was one of the reasons why I did not do it at all in NetWatcher.

Now I know.

What is event correlation? Imagine that you are monitoring the responsiveness of the web server, disk space, load average and “pingability” on a machine. If this machine is disconnected from the network or crashes, without event correlation, you will get four alarms, one for each of the monitored attributes. You don’t want that, you want to know just that the machine is down (“unpingable”), the rest is unhelpful noise. To avoid superfluous notifications, you need to arrange the monitored attributes into a dependency tree, and if some attribute becomes “failed”, suppress notifications about the failures of its dependent attributes. Quite simple, and, yes, the term “event correlation” is misleading, but never mind that.

My monitoring tool reports status change immediately when the probing completes, and different attributes are probed independently and in parallel. I could check if any of the upstream dependencies of an attribute are in “failed” state before reporting, but it is quite probable that after a failure, a dependent attribute will be probed earlier than its dependency, and be reported nevertheless.

And here, at last, is the solution to this problem:

When we notice status change of an attribute that has dependencies, queue the report without sending it. When we have “success” of a probe of an attribute that has dependants, and the previous status was “success” as well, send the queued reports of the dependants; if the combination of the current and previous statuses is different, discard queued reports. State changes of an attribute that has no dependencies are reported right away without queueing.

When there is more than one level of dependencies the scheme becomes only a little bit more complicated: when you need to release the queued reports, you don’t send them but rather re-queue them up to your upstream dependency.

And that’s it.