Gnmond arranges nodes in the network into clusters (also called "communities" in ganglia). Gnmond is able to monitor several of those communities. If used with gmond, gnmond will connect to one node in the cluster to get the collected data from all the nodes. Gmond is responsible for collecting the data and spreading it out to all nodes in the cluster. Gnmond analyzes and aggregates the data and provides it to other tools, like nagios. |
Data has to be collected by an other tool (like Gmond). It will be collected in the form of key/values pairs called metrics (for example free_memory, load, free_swap etc). Each computer or cluster can have it's own metrics (depending of its functionality) Gnmond checks those metrics regularly, by default every minute. The collected data will then be analyzed and aggregated. The analyze function is highly customizable by so called health plug-ins. A health plug-in defines some records (for example one record per monitored cluster, or one record for monitoring memory and one for load). A record consists of a name connected with some values (at least a status and a short description string). The analysis will be repeated regularly. The states of the record will then be stored, and Gnmond is waiting for someone who is asking for those value (for example Nagios, or yourself using telnet). If the questioner is allowed to see the data, the record state will be given to him. |
telnet localhost 46666
Trying 127.0.0.1... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. unspecified: 0 Everything looks fine gnmond_healthPlugins: 0 All plugins are Running gnmond_clusters: 0 All clusters are available Connection closed by foreign host. |
cd Plugins/Health sed 's/unspecified/your_name/g' localhost.py >localhost.py |
service Gnmond restart.
|
./Gnmond.py |
chkconfig --add Gnmond |
ln ln -s PATH_TO_GNMOND/Gnmond.py /usr/bin/Gnmond |
service Gnmond start |
service Gnmond start |
telnet localhost 46666
Trying 127.0.0.1... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. Name_of_your_cluster_in_Ganglia: 0 Everything looks fine gnmond_healthPlugins: 0 All plugins are Running gnmond_clusters: 0 All clusters are available Connection closed by foreign host. |
Gnmond --debug --nodaemon --file=localhost |
telnet name_of_one_node 8649 |
Nagios/check_gnmond --help |
Gnmond has a core part and some plug-ins. The core consist of:
|
Gnmond first tries to find as many valid health plugins as possible. Then it initializes them and makes a first fetch and analyze round. Now the first values are computed and the output plug-ins will be started (every output plug-in in it's own thread) The output plug-ins are waiting for incoming connection from allowed hosts. The main thread controls from time to time if the plug-ins are still alive, if not it will restart them. After the output plug-ins have been started, the main thread will start the HealthPluginManager thread. The thread consists of a infinite loop, that does an iteration every minute. A iteration consist of three stages: First check the clusters, if there are new metrics available. If this is the case, fetch them. Then check witch health plug-ins have to be executed in this iteration. Then execute them. After the execution of the analyze functions the loop will collect all records and provides them to the output plug-ins, and waits for the next iteration. If the HealthPluginManager has to execute some code from input or health plug-ins, it will execute this code in a new thread (called a HealthPluginThread). The HealthPluginManager is waiting for this thread to return, before it will continue. However, if the thread takes to long, HealthPluginManager will mark it as failed, and will try to kill this thread (this might not work if the thread is waiting for some IO, or is hanging in some external C code, so take care of those cases if you write a plug-in...). The HealthPluginManager will not wait until this thread is killed, but will continue. If you want to stop Gnmond, you have to send a terminate signal to the main thread. This will be caught, and the main thread will try to terminate the output plug-ins nicely. If this is not possible, it will kill them. The HealthPluginManager thread and all it's children will be killed immediately (so take care of this in your plug-ins: They might be killed at any time!) The different threads communicate with the global variables records, allowedHosts and plugin. |
Value | Name | Description |
---|---|---|
7 | DEBUG | Debug information |
6 | INFO | A normal information, or additional information to errors |
4 | WARNING | An error happens. Gnmond will try to work around |
2 | CRITICAL | A critical error. Gnmond will exit |
Value | Name | Description |
---|---|---|
0 | NAGIOS_OK | Everything looks good |
1 | NAGIOS_WARNING | Something is not good. May need attention, but not immediately |
2 | NAGIOS_CRITICAL | Something goes badly wrong. Needs attention immediately |
3 | NAGIOS_UNKNOWN | Gnmond cannot compute a state |
Function | Description |
---|---|
addAllowedServer(SERVER) | Add the host SERVER to the list of allowed servers. Only those servers are allowed to connect to one of the Gnmond output plug-ins. You should for example add your Nagios server, or your local PC to connect to Telnet. The list of allowed servers is shared by all plug-ins. |
setExecutingInterval(TIME) | Sets the execution interval. This health plugins will be executed every TIME minutes. Default is 1. This value will be used only by this health plug-in |
setMaxExecutingTime(TIME) | Sets the maximal execution time. After TIME seconds, the analyze() and the clusterFailure() function are assumed to be crashed. Default is 1. This values will be used only by this health plug-in. |
Name | Description |
---|---|
name | The name of your cluster (has to be a string). If you want to use Gmond as input, the name has to be exactly the same as in Ganglia, otherwise Gnmond is not able to find the cluster. |
initialHosts | A list of some nodes in the cluster. They are used to get the metrics for this cluster. Gmond is checking only one node to get all metrics for the cluster. Thus it it sufficient to give only few nodes, Gnmond will get a list of nodes in this cluster later by himself. |
refreshTime | Sets the checking interval. Every TIME minutes Gnmond will try to get new metrics form this cluster. Default is 1. Note that it normally makes no sense to set this to something different than the executing interval of the health plug-in |
checkWith | Chooses the input plug-in that will be used to get new metrics. Default is Gmond. If you want to set another plug-in, you have to import this plug-in first. See also the health examples |
Name | Description |
---|---|
name | The name of the record. The name should not consists spaces or special characters. The name should not be to long (an optimal name is between 4 and 15 characters long). The name is not allowed to begin with gnmond_ |
status | A default status value. Has to be 0,1,2 or 3 |
short | A default short status message, is not allowed to consist a new line character. Should be less than 70 characters long |
long | (optional) A default long status message |
perf | (optional)Default perf data. |
Name | Description |
---|---|
getMetrics() | Gets metrics form a source and stores them in cluster.values and cluster.nodes. Gets cluster as argument |
maximalExecutionTime | The maximal execution time (integer) in seconds. After this time getMetrics will be stopped and reported as dead. maximalExecutionTime is optional. Default is 3 |
Name | Description |
---|---|
serverList | A list of servers who are allowed to access data provided by Gnmond. You should only allow connections from this servers. Servers are added to this list by the health plugins |
logger | A GnmondLogger object, to perform some logging |
from HealthLogicFramework import * #import framework, has to be done! def init(): """the init() function will be called once on start-up""" """with setLoggin you can set the logging settings GnmondLogger.DEBUG means log everything. You might want to change this to GnmondLogger.WARNING """ setLogging(GnmondLogger.DEBUG) """with setExecutionTimeInterval you can set the execution Interval of the analyze function analyze is called every executionIntervall minutes If set to zero analyze is called every minute """ setExecutionInterval(2) """with setMaxExecutionTime you can set the max. execution time of analyze() in sec. If analyzes takes longer than this to finish, it will be stopped and reported as invalid. This should prevent a hanging plug-in crashing the hole system """ setMaxExecutionTime(1) """with add allowed server you can add a server to allowedServers list Only hosts in this list will be allowed to connect to gnmond, all other connections will be refused """ addAllowedServer("localhost") """ with addCluster you can add a cluster that should be monitored. You should give this cluster a name and a list of initial hosts. If you use gmond input plug-in those names have to be exactly the same as in ganglia! """ addCluster("test",["pc0.psi.ch","pc1.psi.ch"]) """with add record you can add a record You should give the record a name and a default status/message. """ addRecord("test",NAGIOS_OK,"Everything fine") def analyze(value): """with values.getMax("load_percent") you'll get the maximum of the metric load_percent of all nodes is clusters defined in the init() function. Note that load_percent is defined as the ganglia metrics load_one / num_cpu * 100 """ maxLoadInCluster = values.getMax("load_percent") if maxLoadInCluster > 200: """if one node has a load above 200, you want to set record test to CRITICAL. This can be done with setRecord """ setRecord("test",NAGIOS_CRITICAL,"Some nodes are overloaded") """If you won't set a record during analyze, it will be reset to default values defined in addRecord. """" def clusterFailure(cluster): """clusterFailure will be called every time a cluster cannot be reached. You might want to set your records to critical or unknown... This record could be overwritten by the analyze() function, but only if analyze would "worsen" it. (e.g. clusterFailure sets it to warning, but analyze to critical, the record would result as critical) """ setRecord("test",NAGIOS_CRITICAL,"Could not reach cluster") |
from HealthLogicFramework import * def init(): addCluster("test",["pc0.psi.ch","pc1.psi.ch"]) addGroup("testg",["pc0.psi.ch","pc4.psi.ch"]) addRecord("test",NAGIOS_OK,"Everything fine") addRecord("test2",NAGIOS_OK,"Everything fine") def analyze(value): maxLoadInCluster = values.getMax("load_percent") if maxLoadInCluster > 200: setRecord("test",NAGIOS_CRITICAL,"Some nodes are overloaded") """with getGroup() you'll get a metric object for the group defined in the init() function. This object is of the same type as value and allowes the same calls """ maxLoadInGroup = getGroup("testg").getMax("load_percent") if maxLoadInGroup > 100: setRecord("testg",NAGIOS_WARNING,"Some nodes in group are overloaded") def clusterFailure(cluster): setRecord("test",NAGIOS_CRITICAL,"Could not reach cluster") setRecord("testg",NAGIOS_CRITICAL,"Could not reach cluster") |
from HealthLogicFramework import * def init(): addCluster("test",["pc0.psi.ch","pc1.psi.ch"]) addRecord("test_last",NAGIOS_OK,"Everything fine") addRecord("test_period",NAGIOS_OK,"Everything fine") """with store() you can mark a metric to be stored. This can be called on every metric object you get with getNode, getCluster, getGroup. You have to set the size of the storage """ getCluster("test").store("load_percent",15) def analyze(value): c = getCluster("test") """with fetch() you can access a stored value. if called without argument or 0 it will return the actual value, with an int as argument it will return the metric with this age """ if c.fetch(1,"load_percent") > 200: setRecord("test",NAGIOS_WARNING,"Some nodes were overloaded\ last time") below = False for metric in c.fetchAll("load_percent"): if metric < 200: below = True break if not below: setRecord("test_longer",NAGIOS_WARNING,"For history load was\ above 200%") def clusterFailure(cluster): setRecord("test_last",NAGIOS_CRITICAL,"Could not reach cluster") setRecord("test_period",NAGIOS_CRITICAL,"Could not reach cluster") |
from HealthLogicFramework import * from DefaultChecks import * # To use default check you have to import this cluster_name = "test" #length of history that default testes will check length_of_history = 30 #Number of nodes that can fail the test until critical tolerated = 3 #nodes in cluster. For default_checks the nodes has to be know in preview nodes = ["pc0.psi.ch","pc1.psi.ch","pc2.psi.ch","pc3.psi.ch","pc4.psi.ch"] def init(): def init(): addCluster(cluster_name,nodes) #Sets up all default checks as described in DefaultChecks setUpAll(cluster_name,length_of_history) def analyze(value): checkAll(clusterName,tolerated) #Perfore all checks def clusterFailure(cluster): pass #Sine no records are defined, you have to do nothing... |
from HealthLogicFramework import * from DefaultChecks import * #Unlike example 4 you have to sort the nodes in 3 Groups login_nodes=["testlogin.psi.ch"] file_nodes=["testfiles.psi.ch","testfiles2.psi.ch"] compute_nodes =["pc0.psi.ch","pc1.psi.ch","pc2.psi.ch","pc3.psi.ch","pc4.psi.ch"] clusterName = "test" tolerated = 5 history=15 def init(): addCluster(clusterName,compute_nodes) """set up the singel check as described in DefaultChecks """ setUpSingleCheck(getCluster(clusterName),login_nodes\ ,file_nodes,compute_nodes) def analyze(value): singleCheck(getCluster(clusterName)) #perfor single check def clusterFailure(cluster): """singleCheck defines the record 'clusterName' you might want to set this record to CRITICAL or UNKNOWN if cluster cannot be reached... """ setRecord(clusterName,NAGIOS_CRITICAL,clusterName + " could not be reached") |
from HealthLogicFramework import * import Plugins.Input.YOURPLUGIN as Input #Import your plug-in def init(): #Give your plug-in to addCluster addCluster("test",["pc0.psi.ch","pc1.psi.ch"], Input) addRecord("test",NAGIOS_OK,"Everything fine") def analyze(value): #Assusmes your input plug-in returns a metric names load_percent maxLoadInCluster = values.getMax("load_percent") if maxLoadInCluster > 200: setRecord("test",NAGIOS_CRITICAL,"Some nodes are overloaded") def clusterFailure(cluster): setRecord("test",NAGIOS_CRITICAL,"Could not reach cluster") |
""" An example for a Output plug-in """ import threading import copy #The global variable records consists all records computed by health plugins from HealthPluginManager import records def getThread(serverList,logger): #Do not change this return OutputThread(serverList,logger) def getName(): #Change NAME to the name of your plug-in return "NAME" class OutputThread (threading.Thread): def __init__(self,serverList,logger,): threading.Thread.__init__(self) self.serverList = copy.copy(serverList) self.logger = logger #Add stuff that should be set up before thread will be started here def run(self): while True: """Add your code here You can access global records to get access to records You can access self.serverList to get access to a list of servers, who are allowed to connect to gnmond You can access self.logger to do some logging (don't use print, since gnmond will normally run as daemon, therefore those statements will be ignored). Get get more information about GnmondLogger see class GnmondLogger. """ pass def terminate(self): """This code will be executed if gnmond terminates. For example you could give back some resources like sockets etc. """ pass |
""" An example for a Input plug-in. """ maximalExecutionTime = 1 #The maximal execution time of getMetrics() def getMetrics(cluster): """This function will be executed every time you want to get new values. The argument is an object is of type HealthPluginManager.Cluster. With this element you have access to all needed stuff for example (for more details see HealthPluginManager.Cluster) cluster.logger: A GnmondLogger, used for logging output cluster.name: Name of the cluster cluster.nodes: A list of nodes in this cluster IMPORTANT: This list might not be complete, especially in the first call. The Input Plug-in shout update this list! cluster.nextNodesToCheck: A List of nodes that should be checked next. IMPORTANT: You have to manage this list by yourself! cluster.values: Actual metrics of this cluster. Your plug-in should set new metrics in this variable values is a double dict with structure {"node1":{"metric1":v1,"metric2":v2,...},"node2":{...},...} The metrics itself should be floats. """ """set the nodes in this cluster. IMPORTANT this should not be done statical. Set the nodes is actually only important for that first call of the plug-in, since you do not have to define all nodes in the health plugins init() function, but the analyze() function might has to have a list of nodes. """ cluster.nodes = ["pc0.psi.ch","pc1.psi.ch","pc2.psi.ch","pc3.psi.ch","pc4.psi.ch"] """set the values for each node """ cluster.values = dict() for node in cluster.nodes: cluster[node] = dict() cluster[node]["metric1"] = 0 cluster[node]["metric2"] = 0 cluster[node]["metric3"] = 0 cluster[node]["metric4"] = 0 cluster[node]["metric5"] = 0 |
define command { command_name check_gnmond command_line $USER5$/check_gnmond $HOSTADDRESS$ $ARG1$ } |
define host{ use generic-host host_name GnmondServer alias SERVERNAME address 129.129.194.94 check_command check-host-alive max_check_attempts 10 contact_groups linux-admins notification_interval 480 notification_period 24x7 notification_options d,u } |
define service{ use generic-service host_name GnmondServer service_description clusters is_volatile 0 check_period 24x7 max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 contact_groups linux-admins notification_interval 240 notification_period 24x7 notification_options w,u,c check_command check_gnmond!gnmond_clusters } define service{ use generic-service host_name GnmondServer service_description plugins is_volatile 0 check_period 24x7 max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 contact_groups linux-admins notification_interval 240 notification_period 24x7 notification_options w,u,c check_command check_gnmond!gnmond_healthPlugins } |
define service{ use generic-service host_name GnmondServer service_description RECORD_NAME is_volatile 0 check_period 24x7 max_check_attempts 1 normal_check_interval 1 retry_check_interval 1 contact_groups linux-admins notification_interval 240 notification_period 24x7 notification_options w,u,c check_command check_gnmond!RECORD_NAME } |