Monitoring Overview

In order to keep things sane, we've tried to keep to a single monitoring paradigm. Our methodology has also been designed to minimize the work load across the machines involved.

Overall, our monitoring procedure can be summed up by the image below. But there's a lot going on in that graphic, so jump past it, and we'll take it one piece at a time.

Monitoring Host

First up is the monitoring host. This is the machine that is keeping tabs on everyone else. In addition to the necessary language constructs (generally PERL), there are three key services that are running: CRON for automation, a database to store the data, and a web server of some flavor.

We use CRON to schedule periodic data collections from the hosts that are being monitored. Typically CRON comes bundled with most UNIX based operating systems.

The type of database service that is running on the host depends on the monitoring package being run. For the most part, RRDTool is the required database, but there is the occasional need for a random access database, such as MySQL.

The web server is not strictly necessary. The only reason to include it is to allow the graphs to be easily accessible from another machine via a web interface. We tend to favor Apache.

Those are the three major components to the monitoring host. Other necesssary pieces involve communication with the hosts and other devices that are being monitored. More on that later.

Local Data Collection

On each of the systems that is being monitored, we schedule CRON jobs to collect and store data locally. The data is then later collected by the monitoring host.

We break data collection up into to parts like this for a couple of reasons. First, it reduces the stress on the main monitoring host. For only monitoring a handful of machines, this isn't that big of a deal, but when you start to monitor more and more machines, this becomes more and more important. Second, it cuts down on network traffic. We can't all have our systems on an infini-band network with nanosecond latency, so cutting down on the amount of communication that has to occur on the network allows more bandwidth to go to other day-to-day operations.

Remote Data Collection

CRON jobs scheduled on the cetral monitoring host retrieve the data stored on the remote hosts. The method by which the data is collected is dependent on the type of device the data is being collected from. In the case of TempMon, that method is a wget command. For PDUMon (not yet released) that mechanism is SNMP. For pretty much every other monitor, data is gathered through either an RSH or SSH issued command.

In order to completely automate the monitoring process, the central monitoring host has to be able to harvest the data from the remote devices with requiring a password entry from a user. Otherwise someone would get to sit a keyboard all day (and night) long doing nothing more than typing in a password every 15 minutes.

For methods that don't require a password (wget and SNMP versions 1 and 2), this doesn't matter. The data is collected, no passwords are asked for, and things continue on. However, for more secure methods (SNMPv3, rsh, and ssh), things are a little trickier. For SNMPv3, the answer is to hand over the password to monitor at installation. For other computers that method is either rsh or ssh.

For the remote commands rsh and ssh a generic user and group is required on all systems that will be monitored. And that user has to be configured on all machines to allow for password-less access. The account can have a disabled password, but the account has to be able to reach all machines with the use of a password. This can be done via rsh and a .rhosts file, ssh and a .shosts file, or (our favorite) an ssh key exchange for the generic user.

Exchanging ssh keys for a specific user isn't exactly the easiest or most straightforward, but we feel it is a more secure method. And since we use it and like it so much, we've taken the time to figure it out and automate the process.

Data Storage

Once the data has been pulled from the remote machines and devices back to the monitoring host it is stored into a database. The majority of our monitors use RRDTool databases. It is a light weight database tool that is designed for efficiently displaying graphs.

While the RRDTool databases are great for a lot of our monitors, we do have occasionis when a monitor requires a random-access database. For those we use MySQL. It's free and easy to use, which make it an attractive solution.

Graphing and Viewing

Graphing the data and viewing those graphs is left to a web browser, a web server, and a handful of server-side scripts written in PERL.

Most of the pages that are necessary to generate and display graphs are fairly static and are generated once the monitor has been installed and configured to meet the needs of the environment it's monitoring. Other monitors require more dynamic pages and make use of CGI scripting with PERL.

To view the monitor displays, simple point the web browser of your choice at the web server that's running on the monitoring host (we use Apache). The information and graphs are right there and the links will take you to different slices of information.

There you have it. That's the long and short (mostly long) of our monitoring paradigm. For questions, comments, or concerns, contact us; we'll clear up anything we can.