Frequently Asked Questions

General

  1. What does SARGE have to offer over other System Activity Reporters?

    The advantages of using SARGE are fairly significant:

    Low Client Overhead - Local data collection is performed using the navtive vendor tool, i.e., sar, and therefore has a low overhead.

    Low Data Collection Overhead - The new code has been re-written with an empasis on efficiency and stability. Only recent data is collected thus reducing bandwidth and CPU time on both the client and server.

    Low Web Overhead - The original version of SARGE used the Berkely DB and Perl data storage, processing, and graph generation. Graphs were necessarily generated every 15 minutes for web page display, regardless of the actual viewing status. This could bring a Sun Ultra 60 to its knees trying to keep track of only 15 hosts.

    New versions of SARGE have been converted to use RRDTool, which is a light weight (i.e., FAST) database with tools designed for efficiently displaying web graphs. Now, monitoring over 100 systems is easily accomplished with room for growth.

    Historical Graphs - SARGE stores data for 1 year, and has the ability to display any time slice during that period. Long term views of a few months are valuable for determing usage trends and capacity planning. Shorter views of a week or a few days can give course effciency diagnostics for model code or data access methods. Daily views can be useful for problem detection and have even led to a break-in detection due to abnormal use.

  2. [Top]

Installation

  1. What system requirements are needed to run SARGE?

    This is dependent on the number of systems being monitored and other tasks the central system is running. A dedicated Intel PIII 1.0 GHZ with 512 MB RAM was able to monitor 35 systems without problems.

  2. [Top]

  3. What is required to run SARGE?

    Perl Version 5.8.x or greater.

    Perl Modules:

    CWD(Cwd.pm)
    File::Basename(File/Basename.pm)
    FileHandle(FileHandle.pm)
    Getopt::Long(Getopt/Long.pm)
    Net::Ping(Net/Ping.pm)
    POSIX(<arch>/POSIX.pm)
    Time::Local(Time/Local.pm)

    These are all core modules that have been included with every version of Perl 5, so you shouldn't need to install them yourself.

    Web server (Apache)

    RRDTool (http://www.rrdtool.org) Installed with Perl module.

    SAR - The vendor's version of 'sar' installed on each system to be monitored.

  4. [Top]

  5. How much space should I allocate for data collection?

    Roughly 3 MB per host per year for data storage, and another 1 MB per host for graph display. Systems with a large number of disk drives may require an additional 3 MB of space.

  6. [Top]

  7. What Operating Systems are supported?

    SARGE v0.10.0 is a complete code re-write. Support for several operating systems was removed to make development easier. Most of these will be restored in future versions. The following table shows the operating systems support status as of July, 2008.

    Operating SystemSupport
    AIX 5.3Planned
    AIX 5.4Planned
    Linux SysStat 5 Supported
    Linux SysStat 7 Supported
    Mac OSX TigerPlanned
    Mac OSX LeopardPlanned
    Irix 6.4Not Planned
    Irix 6.5Planned
    Solaris 7Planned
    Solaris 8Supported
    Solaris 9Supported
    Solaris 10Supported

  8. [Top]

  9. Can I put a system into 2 different groups?

    Well, yes, but the system will have data collected from it twice.

  10. [Top]

Operation

  1. If a machine is off the network, will the data still be collected and graphed?

    Maybe...it depends on what's "down".

    If the remote system is down, then no sar data is being collected and can't be graphed.

    If the remote system is unreachable due to a network problem, SARGE will attempt to collect the previous 3 hours of data automatically. Anything beyond that will be lost.

    If the SARGE data collection server is down for a significant length of time (a few days), historcal data can be collected using the the "loadall" script. (Be sure to turn off the crontab call to SARGE while doing this.)

  2. [Top]

  3. Can you control how much data is collected and how long it is kept?

    Not easily.

    RRDTool requires data to be collected as specific intervals, or else the data is considered invalid. RRDTool also requires that storage is pre-allocated for the predicted size. These constraints also aid in the small processing over head of RRDTool.

    Five minute data collection intervals and storage of one year of data per host were chosen as the SARGE limits based on input from colleagues and years of running monitoring tools.

    The intrepid programmer can dive into the SARGE code to change these, if desired. This is not recommeneded. Or supported.

  4. [Top]

  5. Does Sarge run continually in the background on the Server? How about on the clients?

    No. The central server collects data from monitored systems every 15 minutes. Each remote system collects data locally every 5 minutes.

  6. [Top]

  7. Does Sarge use up a lot of system resources on the server? What about the clients?

    Resources on the clients (i.e., systems being monitored) are neglible (< 1% CPU utilitazation).

    Resource use on the central server depends on the number of clients being monitored. Our operational server spikes to approximately 30% CPU usage for a minute every 15 minutes during data collection periods.

  8. [Top]

  9. What about firewalls and the operation of Sarge?

    Data collection from clients will work as long as correct ports (port 514 for rsh, port 22 for ssh) are open and the 'sarge' user can perform data collection without a password.

    Graph display is via a web interface (port 80 for http).

  10. [Top]

  11. Why can't I load historical data?

    Is there data to be loaded? SAR data is pruned by crontab entries (see sa2) every few weeks to avoid filling disks.

    Is there already data in the RRDTool database? RRDTool is a "forward only" database, not a random access database. This means that only data with newer time stamps may be inserted into the database. Attempts to load old data will raise RRDTool errors.

  12. [Top]

Data Collection

  1. No data is being collected on the client. What's wrong?

    Make sure that the vendor "sar" package is loaded. "sar -A" is a good, quick test.

  2. [Top]

  3. Data is being collected... except disk data. Huh?

    Some versions of sar (sysstat) on Linux systems do not collect disk data by default. This can be fixed by adding the "-d" option to the data collection line in /usr/lib/sa1, or (re-)running the "enable_sar" script included with the SARGE software.

    If it's not a Linux system, I have no idea.

  4. [Top]

Display

  1. Data appears to be collecting, why aren't the graphs displaying?

    Run "webgen" and see what happens.

    If that doesn't work, it's possible that there is a problem with the sar data being entered into RRDTool. RRDTool is a forward only database and some versions of sar have been known to enter data with timestamps in the future. We've tried to correct for this, but haven't been able to do so in all cases.

  2. [Top]

  3. Why does it display "Too Many Disks" in my disk io and disk wait graphs?

    The maximum number of disks that can be displayed is 26. Any more than that would clutter the graphs to the point of uselessness.

  4. [Top]

  5. What is "steal" in the CPU graphs?

    This applies only to more recent version of sar on Linux systems.

    From the man page:

    %steal:

    Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.

  6. [Top]

  7. The default screen shows "-24h" in Start and "now" in the End boxes, why doesn't it show a date?

    The default display period is 'the last 24 hours.' This is the method RRDTool specifies this period.

  8. [Top]

  9. How do I make the graphs bigger/smaller?

    Re-run "webgen" with the "-p <percent>" option. The default size is 75%.

    Ex - Smaller: ./webgen -p 50

    Ex - Larger: ./webgen -p 150

  10. [Top]

  11. Can I change the limits of the graphs?

    Again, not easily. RRDTool needs to have limits chosen at the time the databases are created. The limits chosen seem reasonable and are based on years of experience with SAR data and performance monitoring.

  12. [Top]

  13. Why are disk graphs not displaying?

    Is disk data being collected? See above.

    A bug in the code? Well, maybe. You could run "private" manually with debug options "-d 789A" turned on.

    Re-run "webgen".

    There's a "chicken and egg" problem with generating graphs for systems. SARGE cannot pre-allocate RRDTool databases for a system because it does not know how many disks a systems has...or the names...until data is actually collected. Once data is collected, webgen can be re-run which will then detect the disk RRDTool databases, and generate graphs for them.

  14. [Top]