We recently had a PR come in to add nagios checks to zookeeper. We’ve always been opinionated on ganglia/rsyslog as the metric/log solution for big data bundles, but I’m curious how people feel about removing those components from the bundles and documenting alternatives.
Our core big data bundles (like hadoop-processing) were meant to be a foundation for modeling custom solutions. If people prefer to use nagios/beats/etc, they would have to manually alter bundle.yaml and swap in their monitoring stack.
Since things like nrpe, *beats, telegraf, and ganglia-node can all provide system metrics ootb, maybe we’re better off making the core bundles monitor-neutral and updating the READMEs to document various options. Thoughts?
I don’t really have an opinion on the opinionated stuff… but I was trying a while ago with datbases and it had crossed my mind with monitoring all to try and build a generic interface, JDBC interface is pretty straightforward, but similar with monitoring, where the interface provides enough information for the units to register on the monitoring service, and if they want to implement some app specific monitoring then they could do.
This way you could hook up big data services to your stack of choice and still get sane output without having to configure it.
I think it is important to make a distinction between performance metrics and incident monitoring. Ganglia is focused on capacity and demand management with some indications for troubleshooting when there’s application performance issues. It is a great tool to be able to drill down into your environment to see what’s been affected at any given time, or to pull metrics related to your business cycle to know when to purchase new hardware or whether you’re over-built. While Ganglia has the ability to add in alerting for thresholds or hosts down, it’s not really well suited to service and process management level alerting.
Nagios, on the other hand, is most useful when trying to detect and alert on services down (systemd checks) or non-responsive (api checking for timeouts or bad admin status) or in less-than-ideal condition (think CRM resource failures). Things that might make more sense for nagios checks on zookeeper are ensuring that all services installed by the charm are running and checking any sort of admin status api for indications of issues, or providing a way to alert on transaction times taking over X seconds configurable by a charm variable, perhaps.
The discussion of Ganglia vs. Nagios aside, you bring up an interesting point with regard to preferred monitoring stacks for our example bundles. Bootstack in the past has used an overlay bundle which had the services we wanted to monitor stubbed out as {service_name: {charm: cs:charm_name} and then landed the nagios/nrpe/prometheus/telegraf/etc pieces we wanted. Perhaps you could provide a couple different stubbed out “hadoop-processing-monitoring” bundles providing options. I think it’s not wrong to have an opinionated bundle, but definitely documentation on what’s critical for the product to function vs what’s monitoring/metrics/log aggregation that can be modified per user needs would be useful for people who want to customize their bundle w/out breaking the product while not having knowledge of the working parts.