Technobabble
Exploiting the Python object system for server monitoring
I have tried a lot of server/network monitoring systems over the years. Nagios, Zabbix, Zenoss is some of them. However, as a programmer, I didn't really feel right with any of them. Then I got this crazy idea of exploiting the Python object system for server monitoring. In this blog post I will try to explain my idea to you and provide a proof of concept implementation, and probably you will see how great an idea this is. ;-)
The system is designed to have a very small footprint (currently less than 500 lines of Python) and also designed with inheritance in mind.
In my system every server is a class. Servers can inherit from other servers. Classes can be defined as being a stub, and needing to be subclassed. Servers can inherit from multiple server "types".
The checks to be performed are just methods on the server class. To distinguish checks from other methods, they must start with check_. If a check method raises a MonitoringError exception, this will be handled by the notification system. All settings, like what port number your httpd runs on, is done in class variables. All class variables are defined with a default value. This makes it easy to configure.
Whenever a check fails (or returns to a successful state) a notification is to be sent. This is done by invoking all defined methods starting with notify_ on the server class.
To put things into perspective, here is an example of a http/https server:
This file should then be saved in a directory (along with an __init__.py file - this is Python... ;-)). The monitoring client is then invoked as: monitoring-client.py /path/to/hosts/directory. All subclasses of the base server class is then discovered and entered into a scheduler.
Server abstractions
The experienced Python programmer would probably by now have realized that if you have a lot of web servers, you would make a web server abstraction:
from monitoring.hosts.bundled.web import WebServer, SSLWebServer
class MyWebServerType(WebServer, SSLWebServer):
admins = ("someone@example.com",)
stub = True
class WebServer001(MyWebServerType):
hostname = "server001.example.com"
address = "195.274.31.12"
class WebServer002(MyWebServerType):
hostname = "server002.example.com"
address = "195.274.31.13"
Obviously, this would make more sense when you have a lot of more generic settings per server - but hopefully, you will get the point. If you were to also check a plain HTTP server running on port 81, you would just add it in the generic class:
from monitoring.hosts.bundled.web import WebServer, SSLWebServer
class MyWebServerType(WebServer, SSLWebServer):
admins = ("someone@example.com",)
stub = True
http_ports = (80, 81)
Class relationships can also span multiple files. For easy importing, the directory with your hosts is automagically exposed to Python as monitoring.hosts.
Remote checks
I have also made a simple mechanism for carrying out remote checks. That is, checks that are run locally on the remote servers. This could fx. be to check the available disk space or check that a given process is running.
I use a XML-RPC server with some HMAC-SHA-1 for authentication. However, this still needs some work and is not very well-tested at this point. But if you are into security systems, please do take a look and comment. One thing that comes to my mind, is that i probably should also validate the "calling" IP address.
Adding remote checks to our server checks is as simple as adding another base class for our template:
from monitoring.hosts.bundled.web import WebServer, SSLWebServer
from monitoring.hosts.bundled.remote import RemoteChecks
class MyWebServerType(WebServer, SSLWebServer, RemoteChecks):
admins = ("someone@example.com",)
stub = True
http_ports = (80, 81)
This will automagically check disk usage, system load, ram usage and swap usage. The limits can be configured using class variables (if the default ones doesn't cut it for you):
from monitoring.hosts.bundled.web import WebServer, SSLWebServer
from monitoring.hosts.bundled.remote import RemoteChecks
class MyWebServerType(WebServer, SSLWebServer, RemoteChecks):
admins = ("someone@example.com",)
stub = True
http_ports = (80, 81)
disk_limit = 50 # warn when more than 50% if the disk is full
This also exposes a method called remote_call on the server. That is, if we were to check for a specific running process, we could just define a method:
from monitoring.hosts.bundled.web import WebServer, SSLWebServer
from monitoring.hosts.bundled.remote import RemoteChecks
class MyWebServerType(WebServer, SSLWebServer, RemoteChecks):
admins = ("someone@example.com",)
stub = True
http_ports = (80, 81)
disk_limit = 50 # warn when more than 50% if the disk is full
def check_someproc_running(self):
""" Someproc running """
procs = self.remote_call("pgrep -f someproc").strip()
if not procs:
raise MonitoringError("someproc not running on '%s'" % self.hostname)
Remember that any method prefixed with check_ will be treated as a service check and any MonitoringError exceptions will go to the notification system. Also when defining custom checks, remember a doc string, as this will be used to identify the problem in the notifications.
More
I haven't made a release yet, but will try the system out in production in the next week or so. If it seems to work, I might put out a release for you to try. Until then you can check out the code in my trac browser or get a snapshot to play with.