Gnmond Documentation

  1. Introduction
    1. Network Topology
    2. Data Flow
  2. Getting Started
    1. Requirements
    2. RPM Installation
    3. Manual Installation
    4. Configure Gnmond for a Cluster
    5. Troubleshooting
    6. Configure Nagios
    7. Further Steps
  3. Gnmond in Detail
    1. Controll Flow
    2. Logging
    3. Records
  4. Configure Gnmond
    1. General Configuration
    2. Add Cluster
    3. Group Management
    4. Metric Storage
    5. The Analyze Function
    6. The Records
    7. Lost Clusters
    8. Default Checks
    9. Logging
    10. Adding Health Plugins
  5. Extend Gnmond
    1. Gnmond's Plug-in Design
    2. Input Plugins
    3. Output Plugins
  6. Gnmond Plugins
    1. Input
    2. Output
  7. Conclusions
  8. Pydocs
  9. Examples
    1. Health Plugins
    2. Output Plugins
    3. Input Plugins
    4. Nagios Configuration



Introduction

Gnmond is a monitoring tool for computer clusters, that collect data originating from individual hosts, analyzes it and provides a concise view of the individual cluster's state to other monitoring systems.

Gnmond was originally designed to collect data from the Ganglia Gmond daemons, aggregate and analyze the date on a per cluster basis and supply the cluster state summary to Nagios; therefore the name of the tool "Ganglia Nagios MONitoring Daemon".

Thanks to the plug-in implementation of Gnmond, its functionality can be easily extended beyond the original design and its data processing logic is highly customizable.

To the top

Network Topology

Gnmond arranges nodes in the network into clusters (also called "communities" in ganglia). Gnmond is able to monitor several of those communities. If used with gmond, gnmond will connect to one node in the cluster to get the collected data from all the nodes.
Gmond is responsible for collecting the data and spreading it out to all nodes in the cluster.
Gnmond analyzes and aggregates the data and provides it to other tools, like nagios.


To the top

Data Flow

Data has to be collected by an other tool (like Gmond). It will be collected in the form of key/values pairs called metrics (for example free_memory, load, free_swap etc). Each computer or cluster can have it's own metrics (depending of its functionality) Gnmond checks those metrics regularly, by default every minute.

The collected data will then be analyzed and aggregated. The analyze function is highly customizable by so called health plug-ins. A health plug-in defines some records (for example one record per monitored cluster, or one record for monitoring memory and one for load). A record consists of a name connected with some values (at least a status and a short description string). The analysis will be repeated regularly.

The states of the record will then be stored, and Gnmond is waiting for someone who is asking for those value (for example Nagios, or yourself using telnet). If the questioner is allowed to see the data, the record state will be given to him.


To the top

Getting Started

This section will shortly explain how you can install Gnmond on your system and configure it to monitor your local PC or a simple Cluster. It assumes that you have some general knowledge about Ganglia (which is used to collect the metrics).
First you have to install ganglia-gmond on all systems that should be monitored. If you want to check weather Ganglia works as expected, install ganglia-gmetad on your local system, and configure it to check gmond on your PC or cluster. If you install ganglia-web you can look at the values collected by Ganglia over a web interface.
It is important for Gnmond, that the clusters are set up properly in Ganglia. Thus in the beginning we will start with one cluster, consisting of your local PC or a bunch of nodes. If you use multicast UDP in Ganglia, every node should be able to communicate with all other nodes, otherwise Gnmond might not work as excepted. If you have set up Ganglia, please remember the name you have given to the cluster, since you will need it in Gnmond.
If you want to monitor only your local PC, you should name this cluster "unspecified" (whitch is the Ganglia default value). In this case, Gnmond will work out of the box.

To the top

Requirements

Gnmond needs Python including PyXML to run. It has been tested with python 2.4.3 and 2.3 but is expected to work with other versions too.
To test Gnmond with Ganglia on your own system you need to have Gmond (a part of Ganglia) installed. Gnmond has been tested with Gmond 3.0.6 and 3.1.2.


To the top

RPM Installation

Download and install the Gnmond RPM. The installer will configure Gnmond as a start-up service, controlling a gmond client on localhost. If you have a running Gmond on your PC and the cluster Gmond is monitoring is called "unspecified" (default), then you can test weather your installation succeed by trying to access Gnmond over telnet with (note that at the moment only localhost is allowed to access Gnmond)
telnet localhost 46666
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
unspecified: 0 Everything looks fine
gnmond_healthPlugins: 0 All plugins are Running
gnmond_clusters: 0 All clusters are available
Connection closed by foreign host.
You see the status message for your computer nspecified: 0 Everything looks fine

If you did not call your local cluster unspecified, you have to replace the cluster name in the file /usr/local/Gnmond/Plugins/Health/localhost.py. This can be done with
cd Plugins/Health
sed 's/unspecified/your_name/g' localhost.py >localhost.py
Then restart Gnmond with
service Gnmond restart.


To the top

Manual Installation

Instead of using the RPM you can install Gnmond manually. Download the Gnmond source tarball and unpack it. Now run
./Gnmond.py
This will start Gnmond in background. You should be able to connect with telnet.
If you want to run Gnmond as a service, you have to add the file sbin/Gnmond to your start-up scripts (on Scientific Linux copy it to /etc/init.d/ and then run
chkconfig --add Gnmond
Furthermore you have to make a link from /bin/Gnmond to Gnmond.py. This can be done with
ln ln -s PATH_TO_GNMOND/Gnmond.py /usr/bin/Gnmond
Now you should be able to start Gnmond with
service Gnmond start


To the top

Configure Gnmond for a Cluster

To configure Gnmond for a cluster open the file /use/local/Gnmond/Plugins/Health/localhost.py. There, change clusterName = "unspedified" to clusterName = "Name_of_your_cluster_in_Ganglia".
Furthermore you have to define the nodes of this cluster. Add all nodes to the list computeNodes (in file localhost.py). If you have some login or file server nodes you can also add them to loginNodes or fileNodes instead.
If you added all nodes in the cluster to one of those lists, restart Gnmond with
service Gnmond start
and try to connect with telnet
telnet localhost 46666
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
Name_of_your_cluster_in_Ganglia: 0 Everything looks fine
gnmond_healthPlugins: 0 All plugins are Running
gnmond_clusters: 0 All clusters are available
Connection closed by foreign host.
. You should now see a health status for your cluster.

To the top

Troubleshooting

If you cannot connect to Gnmond over telnet, it's most likely that Gnmond detects an invalid health plug-in localhost.py. To see the exact output try to run
Gnmond --debug --nodaemon --file=localhost
Now Gnmond should tell you why it fails to start. Most likely there are syntax errors in localhost.py, or the cluster you defined could not be reached (most likely because of a invalid cluster- or hostname, or a firewall blocking access to this node).
If Gnmond is running, but you cannot connect to it, check weather the TCP port 46666 is already taken by some other program. Check the debug output of Gnmond for something like error: (98, 'Address already in use'). If this is the case then either stop the other program from using port 46666 or change the port in Gnmond's Telnet plug-in under /usr/local/Gnmond/Plugins/Output/Telnet.py
If Gnmond is running and listens to port 46666, check weather you are allowed to connect to Gnmond. Check the debug output for something like "Receiving TCP connection from unknown host ADDRESS". This would mean you try to connect to Gnmond using address ADDRESS instead of "localhost". Change the line addAllowedServer("localhost") in file /usr/local/Gnmond/Plugins/Health/localhost.py to addAllowedServer("ADDRESS") to enable access from this address.
If Gnmond fails to get values from your cluster, try to access them manually with
telnet name_of_one_node 8649
If you do not see any value here, you are not able to connect to Gmond.

To the top

Configure Nagios

To use the Gnmond records in Nagios, you have to add the plug-in Nagios/check_gnmond to your Nagios installation. check_gnmond uses as arguments the name of the host where Gnmond is running and the name of the record that should be checked. See also
Nagios/check_gnmond --help
For details about Nagios configuration see Nagios documentation.
For a sample configuration see Nagios Configuration Sample

To the top

Further Steps

As next steps you could try to add more clusters or start to write your own analyze function

To the top

Gnmond in Detail

Gnmond has a core part and some plug-ins. The core consist of:


To the top

Controll Flow

Gnmond first tries to find as many valid health plugins as possible. Then it initializes them and makes a first fetch and analyze round. Now the first values are computed and the output plug-ins will be started (every output plug-in in it's own thread)

The output plug-ins are waiting for incoming connection from allowed hosts. The main thread controls from time to time if the plug-ins are still alive, if not it will restart them.

After the output plug-ins have been started, the main thread will start the HealthPluginManager thread. The thread consists of a infinite loop, that does an iteration every minute.

A iteration consist of three stages: First check the clusters, if there are new metrics available. If this is the case, fetch them. Then check witch health plug-ins have to be executed in this iteration. Then execute them.
After the execution of the analyze functions the loop will collect all records and provides them to the output plug-ins, and waits for the next iteration.

If the HealthPluginManager has to execute some code from input or health plug-ins, it will execute this code in a new thread (called a HealthPluginThread). The HealthPluginManager is waiting for this thread to return, before it will continue. However, if the thread takes to long, HealthPluginManager will mark it as failed, and will try to kill this thread (this might not work if the thread is waiting for some IO, or is hanging in some external C code, so take care of those cases if you write a plug-in...). The HealthPluginManager will not wait until this thread is killed, but will continue.

If you want to stop Gnmond, you have to send a terminate signal to the main thread. This will be caught, and the main thread will try to terminate the output plug-ins nicely. If this is not possible, it will kill them. The HealthPluginManager thread and all it's children will be killed immediately (so take care of this in your plug-ins: They might be killed at any time!)

The different threads communicate with the global variables records, allowedHosts and plugin.


To the top

Logging

Gnmond provides the GnmondLogger class to do logging. GnmondLogger will log to syslog if Gnmond is running as daemon, and to stdout if not. There are several instance of GnmondLogger: one for the core part (that will be used also for the output plug-ins) and one for every health plug-in (that will be used also if an input plug-in try to check a cluster that is defined in the health file).
Every logger can define a log level (for the core logger this is WARNING, but can be set to DEBUG if you'll use the --debug flag, for health plugins this can be done with the setLogging() function). Every message that should be logged has a associated log level. The logger will only print messages with a log level lower than their default log level. There are 4 allowed log levels:

Value Name Description
7 DEBUG Debug information
6 INFO A normal information, or additional information to errors
4 WARNING An error happens. Gnmond will try to work around
2 CRITICAL A critical error. Gnmond will exit


Important: Even if you set the log level to DEBUG, the messages might not appear in syslog, since syslog by default only logs messages with priority with level at least WARNING. To see messages with a higher priority, you have to edit /etc/syslog.conf.

To the top

Records

The output of Gnmond is a set of records with the corresponding states. A record state consists of a status value, a short description, a long description and some perf data. The status value of a record has to be one of those values:
Value Name Description
0 NAGIOS_OK Everything looks good
1 NAGIOS_WARNING Something is not good. May need attention, but not immediately
2 NAGIOS_CRITICAL Something goes badly wrong. Needs attention immediately
3 NAGIOS_UNKNOWN Gnmond cannot compute a state


The short description should give an idea of what is going wrong. It is not allowed to have more than one line, and it should be less than 70 characters long. The long description can be as long as needed. Perf data can be used to give additional data about the problem. This data could be used for example by pnp4nagios. If you want a Nagios plug-in to use the perf data, you have to assure that it will be in a correct format. See Nagios documentation to get more information about perf data
Records have to be defined in the init() function of a health plug-in. You have to define some default values for status, long, short and perf.
The record state can be set in the analyze() function of a health plug-in. After the execution of all health plugins the collected states will be stored in the global variable HealthPluginManager.records that can be accessed by the output plugins
. Before the next execution of the analyze() function, the records will be reset to default. So you have to set a non default status in every execution of analyze(). Note that the reset is done before the function clusterFailure() will be called (if it is called). If you set the same record twice in one run of analyze() and clusterFailure(), Gnmond will report the state with the higher alert level (unknown is stronger than critical,warning and OK).

To the top

Configure Gnmond

The configuration of Gnmond is done with health plugins. A health plugins is a python module defining the three function: init(), analyze() and clusterFailure(). You have to put your health plugins is the directory Plugins/Health. Gnmond will add all python files in this directory to it's health plugins on every restart (to restart Gnmond use service Gnmond restart).
The init() function will be executed once at start-up of Gnmond. It is used to set the general configuration. init() receives no arguments, and it is assumed to return nothing.
The analyze() function is executed periodically (by default every minute). As argument it receives an object of type Metrics that represents all metrics of all clusters defined in this health file. Analyze() can analyze those metrics, and then set the records you've defined in your init() function. analyze() is not allowed to run longer than maxExecutingTime (by default 1 second). It is assumed to return nothing.
The clusterFailure() function will be called if Gnmond is unable to get new metrics from one of the clusters defined in the init() function. You can use this for example to set all your records to unknown. clusterFailure will get a object of type Cluster as argument, is assumed to run no longer than maxExecutingTime and returns nothing.
To write a health plug-in you'll need functions defined by the HealthLogicFramework. To use them you have to import them by using from HealthLogicFramework import *.

See also the exmaple section.

To the top

General Configuration

General configuration is done in the init() function. You can use
Function Description
addAllowedServer(SERVER) Add the host SERVER to the list of allowed servers. Only those servers are allowed to connect to one of the Gnmond output plug-ins. You should for example add your Nagios server, or your local PC to connect to Telnet. The list of allowed servers is shared by all plug-ins.
setExecutingInterval(TIME) Sets the execution interval. This health plugins will be executed every TIME minutes. Default is 1. This value will be used only by this health plug-in
setMaxExecutingTime(TIME) Sets the maximal execution time. After TIME seconds, the analyze() and the clusterFailure() function are assumed to be crashed. Default is 1. This values will be used only by this health plug-in.


To the top

Add Cluster

To monitor a cluster you have to add it to the list of monitored clusters. To add a cluster you have to call addCluster() in the init() function of your health plug-in. This cluster can now be used by this plug-in (but by no other). addCluster takes up three arguments:
Name Description
name The name of your cluster (has to be a string). If you want to use Gmond as input, the name has to be exactly the same as in Ganglia, otherwise Gnmond is not able to find the cluster.
initialHosts A list of some nodes in the cluster. They are used to get the metrics for this cluster. Gmond is checking only one node to get all metrics for the cluster. Thus it it sufficient to give only few nodes, Gnmond will get a list of nodes in this cluster later by himself.
refreshTime Sets the checking interval. Every TIME minutes Gnmond will try to get new metrics form this cluster. Default is 1.
Note that it normally makes no sense to set this to something different than the executing interval of the health plug-in
checkWith Chooses the input plug-in that will be used to get new metrics. Default is Gmond. If you want to set another plug-in, you have to import this plug-in first. See also the health examples


To the top

Group Management

In Gnmond you can define groups of nodes. Such groups can be used to define rules for all nodes in the Group. A group has to be created in the init() function with addGroup(). addGroup() takes two arguments: a name of the group and a list of the nodes.
If you've define a group, you can use it in your analyze() function. with getGroup(NAME) you'll get a Metrics object representing all nodes in this group.
See also the health examples

To the top

Metric Storage

Sometimes it is desirable to have access to older metrics. This can be done with metric storage. You can set up a storage for a specific metric for a single node, a group of nodes or a cluster with the store function
This has to be done in the init() function.
To get access to stored values in the analyze() function you can use fetch() or fetchAll()

See also the health examples

To the top

The Analyze Function

The analyze() function defines the logic to compute the record states. As an argument it receives an object of type Metrics with the metrics of all nodes in all clusters defined in the init() function of this health plug-in. You can either use this data directly or through the HealthLogicFramework. The analyze() function should set the record states with the setRecord() function. If they will not be set, Gnmond resets them to default (default is configured in the init() function). The analyze() function is assumed to return nothing and should not run longer than setMaxExecutionTime.

To the top

The Records

Records are stored for ever health plug-in independently. A record is created with the addRecord() function. The addRecord function takes as arguments:
Name Description
name The name of the record. The name should not consists spaces or special characters. The name should not be to long (an optimal name is between 4 and 15 characters long). The name is not allowed to begin with gnmond_
status A default status value. Has to be 0,1,2 or 3
short A default short status message, is not allowed to consist a new line character. Should be less than 70 characters long
long (optional) A default long status message
perf (optional)Default perf data.
A record will be set in the analyze() or clusterFailure() function with the setRecord() function. The arguments are the same as for addRecord()

To the top

Lost Clusters

Sometimes it might happen, that Gnmond is unable to connect to a cluster. This can be caused by an network problem, misconfiguration or a non responding cluster. If this case, Gnmond will try to call the clusterFailure() function of the health plugins defining this cluster. The clusterFailure() function receives as argument a object of type HealthPluginManager.Cluster. You can use the clusterFailure() function to set Records (for example to CRITICAL or UNKNOWN). Note that clusterFailure may be called several times, once for each cluster failed.
After calling clusterFailure, Gnmond will call the analyze() function, but the metrics given to it may be incomplete or out of date.

To the top

Default Checks

To easily defines simple health plugins you can use Default Checks.
They provide an easy interface to start with, but for more individual results you have to define your own checks.
See Pydocs or examples for details.

To the top

Logging

To do some logging you should use GnmondLogger. This can be done with the function log() and setLogging().
setLogging() takes as argument a log level.
log() takes as argument a string (the log message) an an integer (the log level).
See Logging for details about logging.

To the top

Adding Health Plugins

If you want to monitor more than one cluster, it might be desirable to have more than one health plug-in, to stop things getting chaotic. This is very easy, since every python module in the directory Gnmond/Plugins/Health/ is viewed as health plug-in. Health plugins do not share clusters and records, so you are not able to access metrics of cluster A defined in Health plug-in A.py in plug-in B.py. You can also define for each of your health plug-in an independent logLevel, maxExecutingTime and executionInterval. Only the list with allowed servers is shared between the plugins. By default Gnmond will use all health plugins he will find. If you want to run only one plug-in you can do this by running Gnmond with the option --file=PLUGINNAME (PLUGINNAME without the ending).

To the top

Extend Gnmond

Gnmond can easily been extended with plugins. Besides Health Plug-ins there are Input and Output plug-ins.
Input plug-ins are responsible for collecting metrics from clusters, output plug-ins provides record states to other systems.

To the top

Gnmond's Plug-in Design

On every start-up Gnmond will search for plug-ins in the directory Plugins/Health (only health plug-ins) and Plugins/Output (only output plug-ins). If he finds invalid health plugins, Gnmond will report this and terminate. If he finds invalid output plugins Gnmond will report this, but will not terminate. Input plugins are only loaded if they are used by a cluster (to see how to configure the input plug-in of a cluster go to Add Cluster.

To the top

Input Plugins

An input plug-in is used to collect metrics form a cluster. Every cluster defines it's input plug-in (in the Add Cluster function, by default Gmond). A cluster can have exactly one input plug-in, if you want to to collect metrics from different source, you have to write a wrapper plug-in.
If a input plug-in has an invalid syntax, this will be notices by the health plug-in (to use your own input plug-in, you have to import it in your health plug-in). If your input plug-in do not return correctly, the cluster will be reported as failed, and the clusterFailure() function will be called.
An input plug-in should define a function and an attribute
Name Description
getMetrics() Gets metrics form a source and stores them in cluster.values and cluster.nodes. Gets cluster as argument
maximalExecutionTime The maximal execution time (integer) in seconds. After this time getMetrics will be stopped and reported as dead. maximalExecutionTime is optional. Default is 3

The getMetrics() function

The getMetrics() function gets as argument a Object of type Cluster, and should set cluster.values and cluster.nodes (at least in the first run).
cluster.values has to be set with an two dimensional directory of all metrics for all nodes with the structure

{"node1": {"metric1":value11,"metric12":value2,...}, "node2":{"metric2":value21,...},...}

Furthermore it has to set cluster.nodes with a list of the name of all nodes.
Attention: Gnmond does not check weather you have reset those variables, you have to do this by your own.

To get those values it can use the list cluster.nextNodesToCheck. This list will be initialized with the nodes given to the addCluster() function of the health plug-in. The input plug-in then has to manage this list by it's own.

For an example see Input plug-in example

Exceptions

If something went wrong in your input plug-in, fell free tpo throw any Exception you like. The exception will be catched by Gnmond and it will be logged. Then Gnmond will notice this cluster as failed. Thus it will call the clutserFailure() function.
After 5 minuters Gnmond will retry to get metrics from this cluster.

To the top

Output Plugins

Output plugins provide information for other systems. You can either use XML or telnet plug-in and write a parser for the other system, that imports this data, or you can write a output plug-in for Gnmond, that will provide data in the desired format. An output plug-in is a python module in Plugins/Output. They have to provide a thread object that will be managed by Gnmond. Such a output thread should have access to the global variables records and serverList.
An output plug-in should define a function getThread(), that returns a threading.Thread objects (the output thread). getThread() receives as arguments
Name Description
serverList A list of servers who are allowed to access data provided by Gnmond. You should only allow connections from this servers. Servers are added to this list by the health plugins
logger A GnmondLogger object, to perform some logging
The thread returned by getThread() should not be started yet. It will be started by Gnmond. It is assumed to run forever. If a output thread crashes or terminates, Gnmond will restart it.
Furthermore, a output plug-in can define the function getName(), that receives no arguments and returns the name of the plug-in.

For an example see Output Example

To the top

Gnmond Plugins

Input

Gmond

The Gmond Plug-in read values from Ganglia. It connects to a random host in the cluster and get an XML file with all metrics from Gmond. This file will be parsed and used. Gmond reports all metrics collected with Ganglia, and computes some mixed metrics by itself: Gmond is the default input plug-in and uses TCP port 4649 to communicate with Gmond.

To the top

Collectl

The Collectl plug-in collects data from collectl running on every node of the cluster. The plug-in connects to every node and get the values over the lexpr functionality of collectl.
The plugins assumes that collectl has been started like

collectl -sbCDFiJmNsYZ --export lexpr -A server:46667

Thus the plug-in is listing on TCP port 46667 (since 46666 is taken by telnet plug-in)

To the top

Output

Nagios

The Nagios Plug-in waits for connection from 'check_gnmond' on UDP port 46666 and reports the status of a single record. It should be used to handle Nagios connections. To see how to configure Nagios see Configure Nagios

To the top

Telnet

The Telnet Plug-in waits for connection over TCP port 46666. It will report all available records with corresponding states. You can connect to Telnet plug-in with 'telnet gnmondhost 46666' (gnmondhost should be the name of the host where gnmond is running on). Gnmond host will report one or more lines for each record. The first line consists the name, the status and the short description of the record divided by tabs. Additional lines are used for long description and have an indentation of 4 spaces.
The connecting host has to be on allowed server.

To the top

XML

The XML work similar to the Telnet Plug-in, but sends the data in a XML format. XML plug-in is listing on TCP port 46668.

To the top

Conclusions

TO DO

To the top

Pydocs

Gnmond
HealthPluginManager
GnmondLogger
HealthLogicFramework
Record
DefaultChecks
Health.Plugins.Gmond
Plugins.Output.Nagios
Plugins.Output.Telnet

To the top

Examples

Health Plugins

An easy health plug-in example:
from HealthLogicFramework import * #import framework, has to be done!

def init():
        """the init() function
        will be called once on start-up"""

        """with setLoggin you can set the logging settings
        GnmondLogger.DEBUG means log everything.
        You might want to change this to GnmondLogger.WARNING
        """

        setLogging(GnmondLogger.DEBUG)

        """with setExecutionTimeInterval you can set the execution Interval 
        of the analyze function
        analyze is called every executionIntervall minutes
        If set to zero analyze is called every minute
        """
        setExecutionInterval(2)

        """with setMaxExecutionTime you can set the  max. execution time of analyze()
        in sec. If analyzes takes longer than this to finish, 
        it will be stopped and reported as invalid.
        This should prevent a hanging plug-in crashing the hole system
        """
        setMaxExecutionTime(1)


        """with add allowed server you can add a server to allowedServers list
        Only hosts in this list will be allowed to connect to gnmond, all other
        connections will be refused
        """

        addAllowedServer("localhost")

        """ with addCluster you can add a cluster that should be monitored.
        You should give this cluster a name and a list of initial hosts.
        If you use gmond input plug-in those names have to be exactly 
        the same as in ganglia!
        """
        addCluster("test",["pc0.psi.ch","pc1.psi.ch"])

        """with add record you can add a record
        You should give the record a name and a default status/message.
        """

        addRecord("test",NAGIOS_OK,"Everything fine")

def analyze(value):
        """with values.getMax("load_percent") you'll get 
        the maximum of the metric load_percent 
        of all nodes is clusters defined in the init() function.
        Note that load_percent is defined as the ganglia metrics load_one / num_cpu * 100
        """
        maxLoadInCluster = values.getMax("load_percent")
        if maxLoadInCluster > 200:

                """if one node has a load above 200, 
                you want to set record test to CRITICAL.
                This can be done with setRecord
                """
                setRecord("test",NAGIOS_CRITICAL,"Some nodes are overloaded")

        """If you won't set a record during analyze, 
       it will be reset to default values defined in addRecord.
       """"

def clusterFailure(cluster):
        """clusterFailure will be called every time a cluster cannot be reached.
        You might want to set your records to critical or unknown...
        This record could be overwritten by the analyze() function, 
        but only if analyze would "worsen" it.
        (e.g. clusterFailure sets it to warning, but analyze to critical,
        the record would result as critical)
        """
        setRecord("test",NAGIOS_CRITICAL,"Could not reach cluster")
An example with groups:
from HealthLogicFramework import *

def init():
        addCluster("test",["pc0.psi.ch","pc1.psi.ch"])
        addGroup("testg",["pc0.psi.ch","pc4.psi.ch"])
        addRecord("test",NAGIOS_OK,"Everything fine")
        addRecord("test2",NAGIOS_OK,"Everything fine")

def analyze(value):
        maxLoadInCluster = values.getMax("load_percent")
        if maxLoadInCluster > 200:
                setRecord("test",NAGIOS_CRITICAL,"Some nodes are overloaded")

        """with getGroup() you'll get a metric object 
        for the group defined in the init() function.
        This object is of the same type as value and allowes the same calls
        """
        maxLoadInGroup = getGroup("testg").getMax("load_percent")
        if maxLoadInGroup > 100:
                setRecord("testg",NAGIOS_WARNING,"Some nodes in group are overloaded")

def clusterFailure(cluster):
        setRecord("test",NAGIOS_CRITICAL,"Could not reach cluster")
        setRecord("testg",NAGIOS_CRITICAL,"Could not reach cluster")
An example with storage:
from HealthLogicFramework import *

def init():
        addCluster("test",["pc0.psi.ch","pc1.psi.ch"])
        addRecord("test_last",NAGIOS_OK,"Everything fine")
        addRecord("test_period",NAGIOS_OK,"Everything fine")

        """with store() you can mark a metric to be stored. This can be called on 
        every metric object you get with getNode, getCluster, getGroup.
        You have to set the size of the storage
        """
        getCluster("test").store("load_percent",15)

def analyze(value):

        c = getCluster("test")

        """with fetch() you can access a stored value.
        if called without argument or 0 it will return the actual value, with an
        int as argument it will return the metric with this age
        """
        if c.fetch(1,"load_percent") > 200:
                setRecord("test",NAGIOS_WARNING,"Some nodes were overloaded\
                last time")

        below = False
        for metric in c.fetchAll("load_percent"):
                if metric < 200:
                        below = True
                        break
        if not below:
                setRecord("test_longer",NAGIOS_WARNING,"For history load was\
                above 200%")

def clusterFailure(cluster):
        setRecord("test_last",NAGIOS_CRITICAL,"Could not reach cluster")
        setRecord("test_period",NAGIOS_CRITICAL,"Could not reach cluster")
An example with DefaultTest:
from HealthLogicFramework import *
from DefaultChecks import * # To use default check you have to import this

cluster_name = "test"

#length of history that default testes will check
length_of_history = 30

#Number of nodes that can fail the test until critical
tolerated = 3

#nodes in cluster. For default_checks the nodes has to be know in preview
nodes = ["pc0.psi.ch","pc1.psi.ch","pc2.psi.ch","pc3.psi.ch","pc4.psi.ch"]

def init():
def init():
        addCluster(cluster_name,nodes)
        #Sets up all default checks as described in DefaultChecks
        setUpAll(cluster_name,length_of_history)

def analyze(value):
        checkAll(clusterName,tolerated) #Perfore all checks

def clusterFailure(cluster):
        pass #Sine no records are defined, you have to do nothing...
An example with DefaultTest:
from HealthLogicFramework import *
from DefaultChecks import *


#Unlike example 4 you have to sort the nodes in 3 Groups
login_nodes=["testlogin.psi.ch"]
file_nodes=["testfiles.psi.ch","testfiles2.psi.ch"]
compute_nodes =["pc0.psi.ch","pc1.psi.ch","pc2.psi.ch","pc3.psi.ch","pc4.psi.ch"]

clusterName = "test"
tolerated = 5
history=15

def init():
        addCluster(clusterName,compute_nodes)

        """set up the singel check as described in DefaultChecks
        """
        setUpSingleCheck(getCluster(clusterName),login_nodes\
        ,file_nodes,compute_nodes)

def analyze(value):
        singleCheck(getCluster(clusterName)) #perfor single check

def clusterFailure(cluster):
        """singleCheck defines the record 'clusterName' you might want to set this
        record to CRITICAL or UNKNOWN if cluster cannot be reached...
        """
        setRecord(clusterName,NAGIOS_CRITICAL,clusterName + " could not be reached")
An example with a non default input plug-in:
from HealthLogicFramework import *
import Plugins.Input.YOURPLUGIN as Input #Import your plug-in

def init():
        #Give your plug-in to addCluster
        addCluster("test",["pc0.psi.ch","pc1.psi.ch"], Input) 
        addRecord("test",NAGIOS_OK,"Everything fine")

def analyze(value):
        #Assusmes your input plug-in returns a metric names load_percent
	maxLoadInCluster = values.getMax("load_percent")
        if maxLoadInCluster > 200:
                setRecord("test",NAGIOS_CRITICAL,"Some nodes are overloaded")

def clusterFailure(cluster):
        setRecord("test",NAGIOS_CRITICAL,"Could not reach cluster")


To the top

Output Plugins

"""
An example for a Output plug-in
"""

import threading
import copy

#The global variable records consists all records computed by health plugins
from HealthPluginManager import records

def getThread(serverList,logger):
        #Do not change this
        return OutputThread(serverList,logger)

def getName():
        #Change NAME to the name of your plug-in
        return "NAME"

class OutputThread (threading.Thread):
    def __init__(self,serverList,logger,):
        threading.Thread.__init__(self)
        self.serverList = copy.copy(serverList)
        self.logger = logger
        #Add stuff that should be set up before thread will be started here

    def run(self):
        while True:
            """Add your code here
            You can access global records to get access to records
            You can access self.serverList to get access to a list of servers, 
            who are allowed to connect to gnmond
            You can access self.logger to do some logging 
            (don't use print, since gnmond will normally run as daemon, 
            therefore those statements will be ignored).
            Get get more information about GnmondLogger see class GnmondLogger.
            """
            pass

    def terminate(self):
        """This code will be executed if gnmond terminates.
        For example you could give back some resources like sockets etc.
        """
        pass


To the top

Input Plugins

"""
An example for a Input plug-in.
"""

maximalExecutionTime = 1 #The maximal execution time of getMetrics()

def getMetrics(cluster):
    """This function will be executed every time you want to get new values. 
    The argument is an object is of type HealthPluginManager.Cluster.
    With this element you have access to all needed stuff for example 
    (for more details see HealthPluginManager.Cluster)
        cluster.logger:   A GnmondLogger, used for logging output
        cluster.name:     Name of the cluster
        cluster.nodes:    A list of nodes in this cluster
                          IMPORTANT: This list might not be complete, 
                          especially in the first call.
                          The Input Plug-in shout update this list!
        cluster.nextNodesToCheck:
                          A List of nodes that should be checked next.
                          IMPORTANT: You have to manage this list by yourself!
        cluster.values:   Actual metrics of this cluster.
                          Your plug-in should set new metrics in this variable
                          values is a double dict with structure
                          {"node1":{"metric1":v1,"metric2":v2,...},"node2":{...},...}
                          The metrics itself should be floats.
        """

        """set the nodes in this cluster. IMPORTANT this should not be done statical.
        Set the nodes is actually only important for that first call of the plug-in,
        since you do not have to define all nodes in the health plugins init() function,
        but the analyze() function might has to have a list of nodes.
        """
        cluster.nodes = ["pc0.psi.ch","pc1.psi.ch","pc2.psi.ch","pc3.psi.ch","pc4.psi.ch"]

        """set the values for each node
        """
        cluster.values = dict()
        for node in cluster.nodes:
                cluster[node] = dict()
                cluster[node]["metric1"] = 0
                cluster[node]["metric2"] = 0
                cluster[node]["metric3"] = 0
                cluster[node]["metric4"] = 0
                cluster[node]["metric5"] = 0


To the top

Nagios Configuration

This is just a little configuration example. It's not complete and might not work with your Nagios configuration. Please see the Nagios Documentation for a detailed introduction into Nagios configuration.
First copy the executable Nagios/check_gnmond into the plug-in directory of your Nagios installation (per default usr/local/nagios/libexec), then you have to add this plug-in to your configuration. Normally this is done in a file called checkcommands.cfg. There you should add the entry
define command {
    command_name    check_gnmond
    command_line    $USER5$/check_gnmond $HOSTADDRESS$ $ARG1$
}
You might want to change $USER5$ to something different, depending on your Nagios configuration and where you've installed check_gnmond. Now you have to add the gnmond server to your configuration. Normally this should be done in the file hosts.cfg You should add a section like
define host{
        use                     generic-host
        host_name               GnmondServer
        alias                   SERVERNAME
        address                 129.129.194.94
        check_command           check-host-alive
        max_check_attempts      10
        contact_groups          linux-admins
        notification_interval   480
        notification_period     24x7
        notification_options    d,u
        }
Now you can define services for this host. This can be done in services.cfg or in a special file (preferred). This file could look like
define service{
        use                             generic-service
        host_name                       GnmondServer
        service_description             clusters
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              1
        normal_check_interval           1
        retry_check_interval            1
        contact_groups                  linux-admins
        notification_interval           240
        notification_period             24x7
        notification_options            w,u,c
        check_command                   check_gnmond!gnmond_clusters
       }
define service{
        use                             generic-service
        host_name                       GnmondServer
        service_description             plugins
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              1
        normal_check_interval           1
        retry_check_interval            1
        contact_groups                  linux-admins
        notification_interval           240
        notification_period             24x7
        notification_options            w,u,c
        check_command                   check_gnmond!gnmond_healthPlugins
       }
Now have have to add for every record defined in one of your health plug-in a section like (you have to replace RECORD_NAME with the name of the record).
define service{
        use                             generic-service
        host_name                       GnmondServer
        service_description             RECORD_NAME
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              1
        normal_check_interval           1
        retry_check_interval            1
        contact_groups                  linux-admins
        notification_interval           240
        notification_period             24x7
        notification_options            w,u,c
        check_command                   check_gnmond!RECORD_NAME
       }
No restart Nagios, and if did not make a mistake and had some luck, it might work...

To the top