OCP Daemon

From NagiosCommunity

Jump to: navigation, search

OCP Daemon HOWTO

This is small guide on how to use a persistent daemon for processing OCHP and OCSP commands.

Contents

What is it for?

Given the way Nagios operates, running a command every time a host/service check result comes in can greatly reduce the speed at which Nagios can do its work. On huge Nagios setups the checks can end up lagging behind without fully using the server resources.

There is a way to make Nagios write OCHP/OCSP data into a named pipe instead of running a command every time, and on the other end of the pipe a daemon takes care of sending the data to the master Nagios server.

Overview

The overall process goes like this:

  1. Nagios writes host and service check results onto named pipes in send_nsca format using the perfdata files.
  2. The daemon, running in background, polls the pipes and buffers data.
  3. Every few seconds the daemon forks and flushes all its data through send_nsca.
  4. The data is sent to the NSCA server on the master Nagios server.
  5. The master Nagios server receive its data trough the command pipe.

The benefits

  • Nagios check result processing is much faster as it only needs to write a file to send OCHP/OCSP data.
  • Simple Perl daemon using Libevent (Event::Lib) for ultra fast operation.
  • Choice of instant/nearly instant processing or batched processing.
  • Much less processing overhead as the only forks required are to send batch updates to send_nsca.

The drawbacks

  • Can't use the perfdata files for what they are meant for without modifying the daemon. Although the proper fix would be to have a dedicated pipe in Nagios for OCHP/OCSP purposes, I might as well implement it in the daemon is there is requests for this (use the Talk page or email me...). This would be in the form of "duplicate" file/fifo files where everything received from Nagios pipes would be written there.

Requirements

To successfully go trough this HOWTO, you will need the following:

  • Fully configured Nagios master (passive) and slave (active) servers set up for OCHP/OCSP.
  • NSCA/send_nsca properly working between master/slave Nagios servers.
  • Libevent
  • Perl 5.8.6+ with Event::Lib
  • Some UNIX/Linux skills

Nagios Setup

The master Nagios box should already be set up for receiving all it's check results from the command pipe. You shouldn't have to change anything on the master host.

On the slave Nagios box, you will first want to disable OSHP/OCSP as we're going to use a replacement processor using perfdata files.

obsess_over_hosts=0
obsess_over_services=0

Then you will require the following additional config:

# Enable Performance data processing.
process_performance_data=1

# Files to which Nagios will write data. In this setup
# they will be named pipes.
host_perfdata_file=/path/to/host-perfdata.fifo
service_perfdata_file=/path/to/service-perfdata.fifo

# This is exactly what will be sent to send_NSCA. Do not change it.
host_perfdata_file_template=$HOSTNAME$\t$HOSTSTATEID$\t$HOSTOUTPUT$|$HOSTPERFDATA$
service_perfdata_file_template=$HOSTNAME$\t$SERVICEDESC$\t$SERVICESTATEID$\t$SERVICEOUTPUT$|$SERVICEPERFDATA$

# The write mode should be w, although append should have no effect on a named pipe.
host_perfdata_file_mode=w
service_perfdata_file_mode=w

# We don't want to process any command, so set this to 0
host_perfdata_file_processing_interval=0
service_perfdata_file_processing_interval=0

Note: Nagios will block on start/restart when set to write on pipes, so you should set-up the daemon before you restart Nagios with this config. If you ever end up on a stuck Nagios daemon just cat both perfdata files once (hosts first and then service) and Nagios with finish loading up. If there is no reader after Nagios start up Nagios will continue to function properly but anything written to the pipe will be lost.

Note: Don't forget to activate the parameter “process_perf_data” in object’s configuration, without that parameter Nagios doesn't process the performance data for a service or host.

Filesystem setup

Before you start Nagios with the new performance data settings you must create named pipes according to the paths you used for "host_perfdata_file" and "service_perfdata_file" in your Nagios config.

You should make them owned by root read-write by the nagios group (this can be skiped if you run OCP_daemon as root):

chgrp nagios /path/to/host-perfdata.fifo
chmod 660 /path/to/host-perfdata.fifo
chgrp nagios /path/to/service-perfdata.fifo
chmod 660 /path/to/service-perfdata.fifo

Daemon Setup

The daemon code is at the end of this article. I recommend running it under Daemontools.

First of all install OCP_daemon under some path (ex: /usr/sbin/OCP_daemon).

Critical errors (there shouldn't be any if properly setup) will be printed to stderr (in the Daemontools example below they are discarded). If you're having problems you can run the daemon manually. You should get an error message saying what's wrong.

# /usr/sbin/OCP_daemon -f /path/to/host-perfdata.fifo,/path/to/service-perfdata.fifo -n <path_to_send_nsca> -H <reveiving_host> -c <nsca_config> -r 1

Run the daemon alone or with -h to print a descriptive usage screen. You can have the daemon send data as fast as possible (-r 0) or at given intervals. If opting for intervals, you can also have it flush the data if the queue reach a certain size (the timer will also be reset).

Warning: While using -r 0 should work well under light load, under heavy load testing (constantly feeding the pipe) system performance was highly degraded compared to -r 1, the default. For that reason I highly recommend not to use -r 0

Using Daemontools

Assuming Daemontools is installed the following command will set up the daemon:

# mkdir /etc/OCP_daemon
# cat <<EOF >/etc/OCP_daemon/run
#!/bin/sh
exec >/dev/null
exec 2>&1
exec setuidgid nagios /usr/sbin/OCP_daemon -f /path/to/host-perfdata.fifo,/path/to/service-perfdata.fifo -n <path_to_send_nsca> -H <reveiving_host> -c <nsca_config> -r 1
EOF
# chmod +x /etc/OCP_daemon/run
# ln -s /etc/OCP_daemon /service/

Without Daemontools

You can start the daemon directly from your init scripts (ex. rc.local).

/usr/sbin/OCP_daemon -f /path/to/host-perfdata.fifo,/path/to/service-perfdata.fifo -n <path_to_send_nsca> -H <reveiving_host> -c <nsca_config> -r 1

Comments?

Please send comments/bugs/questions to Thomas Guyot-Sionnest

OCP_daemon code

Since I wasn't able to upload it here's the code in cleartext. Copy it in full and paste it into your favorite editor.

#!/usr/bin/perl
# OCP_daemon - Obsessive Compulsive Host/Service Processor daemon for Nagios
#
# Copyright (C) 2007 Thomas Guyot-Sionnest <tguyot@gmail.com>
# Original code Copyright (C) 2006, 2007 Mark Steele
#       http://www.control-alt-del.org/code
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
#
use Event::Lib;
use Getopt::Std;
use POSIX;
use strict;
use warnings;
use vars qw($PROGNAME $VERSION $READ_SIZE $MAX_LINE_LENGTH $CHILD_TIMEOUT %args);

#####################################################################
#
$PROGNAME = 'OCP_daemon';
$VERSION = '1.0rc4';
#
# Try to get that much data each read. Normally a named pipe
# can't hold more that 4096 bytes.
$READ_SIZE = 4096;
#
# A line longer than this will be discarded.
$MAX_LINE_LENGTH = 8192;
#
# How long to wait for send_nsca. If you're sending huge batch
# updates on a very slow network you'll likely want to increase this.
$CHILD_TIMEOUT = 60;
#
#####################################################################

# Ignore HUPs in case we've been lazily started from the shell
$SIG{HUP} = 'IGNORE';

getopts("f:n:H:p:t:c:r:m:h", \%args);

# Print usage if missing options or -h
if (!$args{'f'} || !$args{'H'} || $args{'h'}) {
  if (!$args{'h'}) {
    print "You must specify at least one pipe to read\n" unless ($args{'f'});
    print "You must specify the host to send data to\n" unless ($args{'H'});
  }
  usage();
}

# Process options
my @fifos = split (/,/, $args{'f'});
my $reaper_delay = $args{'r'} || 1;
my $max_queue = $args{'m'} || 0;

# Construct send_nsca command
my $nsca = $args{'n'} || '/usr/local/nagios/bin/send_nsca';
$nsca .= " -H $args{'H'}";
$nsca .= " -p $args{'p'}" if $args{'p'};
$nsca .= " -to $args{'t'}" if $args{'t'};
$nsca .= " -c $args{'c'}" if $args{'c'};

# Sanity checks
if ($reaper_delay !~ /^\d+$/) {
  print "reaper_delay must be an integer greater or equal to 0!\n\n";
  usage();
}

if ($max_queue !~ /^\d+$/) {
  print "max_queue must be an integer greater or equal to 0!\n\n";
  usage();
}

$max_queue = 0 unless ($reaper_delay);

# send_nsca test run
system ("$nsca </dev/null >/dev/null 2>/dev/null");
if ($? != 0) {
  print "Failed to run '$nsca', bailing out!\n";
  exit 1;
}

# Now the fun stuff :)

$0 = $PROGNAME;

# Set up a zombie reaper
my $signal = signal_new(SIGCHLD, \&reap_chld);
$signal->add;

my @queue;

## VERY IMPORTANT: You have to open the pipe in O_RDWR, POSIX has rules about 
##                 using polling calls on pipes, and can't do any on O_RDONLY
##
foreach my $fifo (@fifos) {
  die "$fifo is not a pipe!" unless (-p $fifo);
  sysopen(my $FIFO, $fifo, O_RDWR | O_NONBLOCK) || die "couldn't open $fifo: $!";
  my $reader = event_new(\*$FIFO, EV_READ, \&reader);
  $reader->add;
}

my $timer;
if ($reaper_delay) {
  $timer = timer_new(\&reaper);
  $timer->add($reaper_delay);
}

event_mainloop();

sub reap_chld {
  while (waitpid(-1, WNOHANG) > 0) {
  }
}

sub reaper {
  my $event = shift;

  if (@queue) {
    my $fork;
    if (($fork = fork) == 0) {
      # We're a child, make sure we don't stay around too long...
      alarm($CHILD_TIMEOUT);
      $0 = "$0 child";

      open(NSCA, "|$nsca >/dev/null 2>/dev/null") or die "Failed to spawn send_nsca: $!";
      print NSCA @queue;
      close(NSCA);
      exit;

    } elsif (!defined ($fork)) {
      # Fork failed, no free resources?
      die "Fork failed, no free resources?"
    } else {
      # We're the parent, empty the queue
      undef @queue;
    }
  }
  # Reschedule ourself if we're using the timer.
  $event->add($reaper_delay) if ($event);
}


sub reader {
  my $event = shift;
  my $fh = $event->fh;
  my $self = shift;
  my $data;

  if (scalar($event->args()) > 3) { ## Recursively called ourselves with data passed to function
    $data = $_[3];
  }

  my $ret = sysread ($fh, my $buf, $READ_SIZE);

  if (defined ($ret) && $ret == 0) { ## Shouldn't happen
    #print scalar localtime, " ACK: Got EOF?\n";
    die;
  } elsif (!defined ($ret)) { ## Shouldn't happen
    #print scalar localtime, " ACK: Error condition? $!\n";
    die;
  } elsif (!$buf) { ## Shouldn't happen
    #print scalar localtime, " ACK: Not EOF, not error, but nothing in buffer\n";
    die;
  }

  # 
  # Be safe here...
  $data .= $buf;
  while (my $marker = index ($data, "\n") + 1) {
    push (@queue, substr ($data, 0, $marker));
    $data = substr ($data, $marker);
      
    if ($max_queue && $max_queue <= @queue) {
      $timer->remove; # Reaper will re-add itself
      reaper($timer);
    }
  }

  # Process queue now if there's no timer
  reaper(0) unless ($reaper_delay);

  if ($data && length ($data) < $MAX_LINE_LENGTH) {   ## Incomplete line
    #print "DATA LEFT AFTER PARSING: ------------\n$data\n-------------\n";
    $event->args($event->fh, EV_READ, $self, $data);
    $event->add;
    return;
  }

  $event->args($event->fh, EV_READ, $self);
  $event->add;
}

sub usage {
  print "$PROGNAME v.$VERSION - Obsessive Compulsive Host/Service Processor daemon\n";
  print "Usage:\n";
  print "  $PROGNAME -f <fifo>[,<fifo2>[,<fifoN>...]] -H <nsca_host> [ -n <nsca_bin> ]\n";
  print "  [ -p <nsca_port> ] [ -t <nsca_timeout> ] [ -c <nsca_config> ]\n";
  print "  [ -r <reaper_delay> ] [ -m <max_queue> ]\n\n";

  print "Options:\n";
  print "  -f <fifo>\tComma-separated list of fifo files to read from\n";
  print "\t\tThese files must be all named pipes (fifo)\n\n";

  print "  -n <nsca_bin>\tsend_nsca command path\n";
  print "\t\tDefaults to /usr/local/nagios/bin/send_nsca\n\n";

  print "  -H,-p,-t,-c\tSee corresponding send_nsca command\n\n";

  print "  -r <seconds>\tHow long to wait between each nsca flushes\n";
  print "\t\t0 = as data arrive. Default: 1 second\n";
  print "\t\tWARNING: Setting this to 0 can be very resource-consuming!\n\n";

  print "  -m <slots>\tMax queue size if reaper_delay is greater than 0\n";
  print "\t\tA flush will be forced if the queue reach this size\n\n";

  exit 1;
}

Todo

  • Allow $max_queue without using the reaper timeout (Minor change but I have to test it...)
  • Add a dup pipe option to allow chaining daemons (ex. using a modified NPDaemon on top of it).

Changelog

  • 1.0rc4 (2007-03-07)
    • Ignore HUPs in case it was started from the shell
  • 1.0rc3 (2007-03-02)
    • First version with multiple pipe support (Posted on Nagios-devel)
Personal tools