The EDDIE-Tool User's Manual

Contents:


Introduction

The EDDIE-Tool (commonly just called EDDIE) is an agent for system, network and security monitoring. It is highly customizable and easily extendable. It has been designed to be as platform-independent as possible, with platform-specific code limited to a small group of modules, making it easily portable to new platforms. It is fully written in Python and the configuration has a Python "look-and-feel" to it, although no Python or coding skills are necessary to configure it.

This user's manual is specific for EDDIE-Tool versions 0.29 and above, as some significant changes were made to improve the configuration. These changes can be read here. The user's manual for earlier versions can be read here.


Installation

Downloading

You need to download the following:

Installing

Follow the QUICKSTART document (also located in the eddie/doc/ directory) or continue with the steps below.


Command-Line Options


Configuration

Config files

Global Configurables

The global configurables are usually in eddie.cf and are listed below:

eddie.cf is well documented, so read through the file and modify the settings to suit your environment.

Configuration Format

The EDDIE configuration follows the standard Python code format. Where methods or child objects of an object are indicated by indenting them beneath the parent object definition, sub-objects or parameters of a directive object are similarly indicated by indenting them beneath the parent object definition. For example, a part of the configuration may look like:

    group testing:
        PING testping:
            host="10.0.0.1"
            numpings=10
            rule="not alive"
            action=email("chris", "%(host)s failed ping")

    FILE file1:
        file='/tmp/file1.tmp'
        scanperiod='2m'
        rule='not exists'
        action=ticker("%(file)s does not exist", timeout=1)
        act2ok=ticker("%(file)s now exists", timeout=1)
         
A config group called "testing" is defined, then the PING directive "testping" is configured inside this group because it is indented. Similarly, all testping's arguments are indented as they belong to the PING directive configuration. The second directive, FILE called "file1", is at the same indentation level as the group definition (i.e., not indented) and is therefore a global directive. Thus, all hosts using this example config would execute the FILE directive, but only those hosts in the "testing" group would execute the PING directive.

If you are used to Python coding this will be second nature to you. If you are not, it will not be hard to pick up.

The above example also introduces the format of directive definitions. Directives are the rules which do "something". More often than not, they will perform system or network checks of some sort. But they are very flexible and could be configured to do more than simple checks.

In any case, the format of directive definitions is:
             DIRECTIVE name:
                 argument1=value1
                 [argument2=value2
                 ...]
         
where "DIRECTIVE" is the directive name, like PROC or FS, and "name" is the user-defined, unqie name of this directive object. The arguments customize the directive appropriately. Some arguments are directive-specific while others are common to all directives.

Example:
             PROC test:
                name='syslogd'
                rule='not exists'
                scanperiod='30s'
                action=email("alert@my.domain","syslogd is not running")
         
This is an example definition of a PROC directive, called 'test'. It contains the PROC-specific argument, 'name'. 'rule', 'scanperiod' and 'action' are arguments which are common to all directives. Some arguments are optional while others are required, and errors will be raised if they are missing. In this example 'name' and 'rule' are required. 'scanperiod' and 'action' are optional.

Simple Configuration

An EDDIE configuration can be simple to get basic monitoring started quickly and made as complicated as required to perform advanced operations. A simple example rules file is shown below to monitor basic services on a host. This rules file, named simple.rules, would be placed in the same directory as eddie.cf and eddie.cf would contain the entry

INCLUDE 'simple.rules' The file simple.rules contains
        # Process checks
        PROC syslogd:
            name='syslogd'
            rule='not exists'
            action=email('root', '%(name)s is not running on %(h)s')
        PROC inetd:
            name='inetd'
            rule='not exists'
            action=email('root', '%(name)s is not running on %(h)s')
        PROC sshd:
            name='sshd'
            rule='not exists'
            action=email('root', '%(name)s is not running on %(h)s')

        # Filesystem checks
        FS root:
            fs='/'
            rule='pctused > 90'
            action=email('root', '%(mountpt)s over 90%% on %(h)s')
        FS varlog:
            fs='/var/log'
            rule='pctused > 90'
            action=email('root', '%(mountpt)s over 90%% on %(h)s')

        # Service Port checks
        SP smtp_port:
            port='smtp'
            protocol='tcp'
            bindaddr='0.0.0.0'
            rule='not exists'
            action=email('root', '%(protocol)s/%(port)s on %(h)s is not listening')
        SP http_port:
            port='http'
            protocol='tcp'
            bindaddr='0.0.0.0'
            rule='not exists'
            action=email('root', '%(protocol)s/%(port)s on %(h)s is not listening')

        # System statistics checks
        SYS loadaverage:
            rule="loadavg1 > 3.00"
            scanperiod='1m'
            action=email('root', '%(h)s load-average > 3.00')

Directives

The directives are the configuration commands which tell EDDIE what to do. They are of the form:
    DIRECTIVE name:
        arg1=value1
        arg2=value2
        argn=valuen 
Where "DIRECTIVE" is the name of the directive itself (see Built-in Directives); "name" is a user-defined name of the directive definition (the directive ID is usually constructed as "DIRECTIVE.name", e.g., "FS.root", and will appear in the logs, console, etc); "args" are arguments to define what the directive should do and how it should do it. Some arguments are common to all directives and others are specific to that type of directive.

Common Directive Arguments

Rule Format

The format of rules is very simple. For those familiar with Python, rules are simply Python expressions which are evaluated on-the-fly with the variables set by the directive at the time. For those unfamiliar with Python, the expressions are almost English-like using operators such as: not, and, or; and mathematical operators such as: ==, !=, >, <, >=, <=. Use these operators to evaluate the variables you are interested in. The whole expression should evaluate to 1 (i.e., true) if the actions set by the action argument should be executed. If it evaluates to 0 (i.e., false) only the actions set by the actelse argument are executed.

As rule expressions are evaluated in a Python environment, links to related Python documentation is provided below.

  • Boolean Operations
  • Comparisons
  • String Methods

    Built-in Directives

    The built-in directives are grouped roughly into categories and are as follows:

    System Monitoring:

    Network Monitoring: Security Monitoring: Note that there may be many more directives depending on the version of EDDIE or any new or optional directives which may have been added to the distribution. See the EDDIE-Tool developer's guide for more information on creating new directives.

    Directive Details

    COM
    COM is a generic directive used to perform custom checks that other directives are not available for. It simply executes the given command in a sub-shell, and captures the stdout/stderr and return value for testing by the directive rule.

    Security note: if EDDIE is run as root, the config files should not be world-writable as obviously directives like COM can execute any commands on the system.

    COM-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            # Check load average (the hard way, without using SYS)
            COM loadavg:
                cmd="uptime | cut -d, -f4 | awk '{print $3}'"
                rule="float(out) > 6.0"
                action=email("alert", "Load on %(h)s is > 6.0")
    
            # Check number of netscapes running
            COM count_ns:
                cmd="ps -ef | grep netscape | wc -l"
                rule="int(out) > 3.0"
                action=email("alert", "There are %(out)s netscapes running on %(h)s")
    
            # A variation on checking load average, using 'outfield' variables
            COM loadavg:
                cmd="uptime | cut -d, -f4"
                rule="float(outfield3) > 6.0"
                action=ticker("Load on %(h)s is %(outfield3)s", timeout=1)
        

    FILE
    This is a directive for performing checks on files or changes to files. Rules can be written based on any changes to the file metadata, like modification date, size, ownership, permissions, etc. It can also pick up changes to the file itself, which can be useful as a security check.

    FILE-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            # Alert when /etc/passwd changes
            FILE passwd_change:
                file='/etc/passwd'
                rule='mtime != lastmtime'
                action=email('alert','%(file)s has been modified.')
    
            # Alert when 'ps' changes
            FILE ps_change:
                file='/bin/ps'
                rule='md5 != lastmd5'
                action=email('alert','%(file)s has changed.')
    
            # Alert if file not owned by root
            FILE file_root:
                file='/usr/local/bin/testfile'
                rule='uid != 0'
                action=email('alert','%(file)s uid is %(uid)s.')
    
            ## Simple test that cron is working
            ## crontab should have an entry like:
            ##   0,15,30,45 * * * * /bin/touch /var/run/eddie/cron.test
            FILE cron_test:
                file='/var/run/eddie/cron.test'
                rule='exists and mtime < (now-15*60)'  # file modified over 15 minutes ago
                action=email("alert", "Cron test failed.", "%(file)s mtime=%(mtime)s now=%(now)s")
    
            # Make sure this file isn't a symlink
            FILE check_file:
                file='/etc/passwd'
                rule='issymlink'
                action=email('alert','%(file)s should not be a symlink !!')
    
            # Alert if a file disappears
            FILE file_missing:
                file='/etc/passwd'
                rule='missing'
                action=email('alert','%(file)s has disappeared')
        

    FS
    The FS directive is used to perform checks on local filesystems. Alerting when the filesystem is full would be the most common use for this directive.

    FS-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            # alert if / over 95% full
            FS root:
                fs='/'
                rule='pctused > 95'
                action=email("alert", "%(mountpt)s is over 95%% full on %(h)s")
    
            # alert if /var has less than 100MB available
            FS var:
                fs='/var'
                rule='avail < 100*1024'
                action=email("alert", "%(mountpt)s has only %(avail)dkB free on %(h)s")
        

    HTTP
    This is a directive for performing remote HTTP and HTTPS tests against web servers.

    The elapsed connection time is recorded, and all related connection variables are made available, such as response code, headers and returned message body, as well as error information if the connection failed.

    SSL-support must be compiled into Python for HTTPS connections.

    The POST method is not yet supported.

    HTTP-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            # Check our web site is up.
            HTTP website:
                url='http://www.my.domain.name/index.html'
                rule='failed'
                action=email('alert', '%(url)s failed', 'exception: %(exception)s\nerrno: %(errno)s\nerrstr: %(errstr)s')
    
            # Check a certain page hasn't disappeared.
            HTTP mypage:
                url='http://www.my.domain.name/~fred/fred.html'
                rule='failed or not ok'
                action=email('fred', '%(url)s failed')
    
            # Store web site response time in RRD db.
            HTTP web_time:
                url='http://our.website.com/'
                rule='not failed'
                scanperiod='5m'
                action=elvinrrd('http-%(h)s_%(hostname)s', 'time=%(time)f')
                actelse=email('alert', 'Connection failed to %(url)s')
        

    IF
    The IF directive provides a mechanism for testing network interfaces. Interfaces listed in "netstat -i" are available for testing. The test can be simply whether the interfaces exists on a host or not; or it can be a more complex rule based on various statistics about that interface.

    IF-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            # alert if eth0 interface has disappeared
            IF ethexists:
                    name='eth0'
                    rule='not exists'
                    action=email('alert', 'interface %(name)s has disappeared on %(h)s')
    
            # alert if input packet errors are greater than 10% (Solaris)
            IF ierrs:
                    name='hme0'
                    rule="100.0*ierrs/ipkts > 10.0"
                    action=email('alert', 'input packet error > 10%% on %(name)s')
        

    LOGSCAN
    The LOGSCAN directive provides a facility to watch files for important messages. Every line in the file can be matched, or for a busy system, selective lines can be picked out using a regular expression pattern. Commonly the resulting lines are emailed to an admin, but any standard EDDIE action could also be performed with the results.

    This directive works by initially finding the end of the file on its first 'scan' and storing this location. On the second and subsequent scans, the directive will scan all the new lines of the file that have been added since the previous scan, and finish by storing the new location of the end-of-file. If, however, the file has truncated in size (i.e., perhaps a log rotation has occured) the directive will scan all lines from the start of the truncated file.

    Note: it is possible that some lines may be missed in between scanning the file and the file being truncated (or log rotation) if the scanperiod is not short enough. It is recommended that the scanperiod be short if the file is updated frequently (i.e., for a busy logfile).

    LOGSCAN-specific arguments:


    Rule Variables: Action Variables: Directive Examples:
            # Email all entries from /var/log/messages to alert every 12 hours.
            LOGSCAN messages:
                file='/var/log/messages'
                regex='.*'
                scanperiod='12h'
                action=email("alert", "%(h)s:%(file)s", "-- Logscan matched %(matchedcount)d lines: --\n%(lines)s")
    
            # Email lines from /var/log/httpd/error_log, and ignore "notice" messages
            LOGSCAN httpd_error:
                file='/var/log/httpd/error_log'
                regex='.*[notice].*'
                negate=true
                action=email("alert", "%(h)s:%(file)s", "-- Logscan matched %(matchedcount)d lines: --\n%(lines)s")
        

    METASTAT
    This directive, part of the Solaris directives module, allows simple checks to be performed on Solaris Disksuite devices. Currently it only checks whether any metadevices require maintenance, but will be expanded in the future.

    METASTAT-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
        # Check if any metadevice requires maintenance
        METASTAT maintenance:
            rule='need_maintenance'
            action=email('alert', 'A metadevice on %(h)s requires maintenance')
        

    NET
    The NET directive provides an interface to the kernel network statistics usually provided by a call to 'netstat -s'. Simple or complex rules can be written using these statistics.

    Linux Note: network stats counters are now collected from '/proc/net/snmp'. Try a 'cat /proc/net/snmp' to see what counters are available.

    NET-specific arguments:

    Rule Variables: Action Variables: Directive Examples:
            # alert if any UDP input errors (Solaris)
            IF udpinerr:
                    rule="udpInErrors > 0"
                    action=email('alert', '%(h)s has had %(udpInErrors)s UDP input errors')
        

    PID
    The PID directive is used to perform simple checks using pid files which some program generate. The most basic check is whether the pid file exists or not, which can often indicate whether the program is running or not; the second most basic check makes sure the pid found in the pid file also belongs to a process in the process table.

    PID-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            # alert if the sshd pid file doesn't exist
            PID sshdpid1:
                pidfile='/var/run/sshd.pid'
                rule='not exists'
                action=email("alert", "sshd pid file not found on %(h)s")
    
            # alert if the sshd pid doesn't match the process table
            PID sshdpid2:
                pidfile='/var/run/sshd.pid'
                rule='exists and not running'
                action=email("alert", "sshd pid not a valid process on %(h)s")
        

    PING
    This directive provides a facility for checking the availability of hosts on a network. It allows ICMP ping checks to be performed and rules and actions can be written based on whether the remote host is alive, packet loss and round trip times.

    PING-specific arguments:

    Rule Variables: Action Variables: Directive Examples:
            # Alert if host not responding
            PING foo:
                host="foo.domain.name"
                rule="not alive"
                action=email('alert', 'host foo is not responding to pings')
    
            # Alert via ticker if there is any packet-loss.
            PING badpings:
                host='10.0.0.5'
                numpings=20
                rule='pktloss >= 0.0'
                scanperiod='1m'
                action=ticker("%(host)s packetloss=%(pktloss)0.1f%% avgrtt=%(avgtriptime)f sec")
        

    POP3TIMING
    The POP3TIMING directive is used to measure the performance of a POP3 server. EDDIE connects to the given POP3 server/port and logs in as the given user, then performs some standard commands before closing the connection. The time taken for each step of the connection are timed and stored in variables to be used by the action(s).

    Besides timing information, this directive can also be used to perform basic checks on a POP3 server. A variable is set if the connection fails, so simple rules can be written to test this.

    POP3TIMING-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            POP3TIMING pop3test:
                server='pop3.domain.com'
                user='fred'
                password='foo'
                rule='connected'
                action=email('mary', 'host=%(server)s, username=%(username)s, connecttime=%(connecttime)f, authtime=%(authtime)f, listtime=%(listtime)f, retrtime=%(retrtime)f')
                actelse=email('alert', 'POP3 connection to %(server)s failed')
        

    PORT
    The PORT directive tests remote TCP based services. The simplest test is to determine whether the service is accepting remote connections on a given TCP port.

    The test can be made more complex by defining send and expect strings. The send string will be sent to the remote host after connecting, and any reply will be matched against the expect string (a regular expression). The check fails if the result does not match.

    PORT-specific arguments:


    Rule Variables: Directive Examples:
            # check that 10.0.0.5 is accepting connections on port 80
            PORT webcheck:
                    host='10.0.0.5'
                    port=80
                    rule='not alive'
                    action=email('alert', 'port 80 not responding on 10.0.0.5')
    
            # check that a host is accepting connections on port 25
            PORT smtpcheck:
                    host='ahost.domain.name'
                    port=25
                    expect='220 '
                    rule='not alive or not matched'
                    action=email('alert', 'port 25 problem on 10.0.0.5')
        

    PROC
    The PROC directive is used to perform process checks. In the simplest case it is used to check if a process is not running when it should be (or running when it should not be). More complex rules can also be written, using most of the process statistics such as memory-usage, owner, percentage cpu used, running time, etc.

    PROC-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            # alert if cron is not running
            PROC cron:
                name='cron'
                rule='not exists'
                action=email("alert", "cron is not running on %(h)s")
    
            # syslog has a memory leak - alert if using over 50MB
            PROC syslogmem:
                name='syslogd'
                rule='vsz > 50*1024'
                action=email("alert", "%(name)s is using %(vsz)d kBytes")
        

    RADIUS
    The RADIUS directive provides a facility for performing radius authentication checks.

    RADIUS-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            RADIUS radtest:
                server='radius.domain.name:1812'
                secret='s3cr3t'
                user='bob@domain.name'
                password='b0bm@t3'
                rule='not passed'
                action='email("alert", "radius FAILED to %(host)s:%(port)d")'
        

    SP
    The SP directive is used to perform checks on listening service ports. These can be either TCP or UDP ports. The simplest use is to check if nothing is currently listening on the given port, protocol and bind address combination.

    SP-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            # alert if nothing listening on http port
            SP http:
                port='http'
                protocol='tcp'
                bindaddr='0.0.0.0'
                rule='not exists'
                action=email('alert', 'http port not bound to on %(h)s')
    
            # alert if nothing listening on tcp port 22 on 10.0.0.5
            SP sshport:
                port=22
                protocol='tcp'
                bindaddr='10.0.0.5'
                rule='not exists'
                action=email('alert', '%(protocol)s port %(bindaddr)s:%(port)s not listening')
        

    DBI
    The DBI directive is used to perform database queries (typically SQL), and check the results.

    DBI-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            # test that our postgresql server is alive and responding to requests properly
            DBI postgresql_check:
                dbtype='pg'
                host='localhost'
                database='monitoring'
                user='monitoring'
                password='sshhh'
                query='select * from monitoring'
                rule='not connected or results != 1 or result1 != 42'
                action=email(ALERT_EMAIL, 'PostgreSQL DB %(database)s failed test', 'Query: %(query)s\nConnected: %(connected)s\nError: %(errmsg)s')
            
            # alert if too many connections to the Postgres database
            DBI db_connections:
                dbtype='pg'
                host='localhost'
                database='mydb'
                user='pgsql'
                password='sekrit'
                query='select count(1) from pg_stat_activity'
                rule='connected and results > 0 and result1 > 40'
                action=email('alert', 'Database %(database)s on %(h)s: too many connections (currently %(result1)s)')
                console='%(database)s on %(host)s : connections = %(result1)d'
        

    SMTP
    This directive makes a connection to an SMTP server and returns the elapsed response time.

    SMTP-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
            SMTP smtp_test:
                server='mail.mydomain.com'
                rule='connected'
                action=email('alert', "SMTP connection to %(server)s:%(port)s took %(connecttime)s secs")
        

    SNMP
    This directive provides an SNMP client to retrieve data from remote hosts and devices via the SNMP protocol. Multiple values can be retrieved in one call. Standard EDDIE rules can then perform tests on the retrieved data, or the data could be stored in RRD files using the elvinrrd action (for instance).

    SNMP-specific Arguments:


    Rule Variables: Action Variables: Directive Examples:
        # Fetch a counter from a device
        SNMP foo:
            host='alt1.domain.name'
            oid='1.3.6.1.4.1.1872.2.1.1.6.0'
            community='private'
            rule='response > 0'
            maxretry=10
            action=email('alert', 'Head for the lifeboats: %(snmpresponse)s')
    
        SNMP router_traffic:
            scanperiod='5m'
            host='10.0.0.1'
            oid='1.3.6.1.2.1.2.2.1.10.2, 1.3.6.1.2.1.2.2.1.16.2'
            community='special'
            rule='not failed'
            maxretry=10
            action=elvinrrd("net-router_BRI01", "ibytes=%(response1)s", "obytes=%(response2)s")
        

    STORE
    The STORE directive is still being developed and tested. It will be documented at a later date.

    SYS
    The SYS directive provides an interface to the kernel's system statistics. Simple or complex rules can be written using these statistics.

    SYS-specific arguments:

    Rule Variables: Action Variables: Directive Examples:
            # alert if 1 minute load average > 2
            SYS loadavg1:
                    rule="loadavg1 > 2.0"
                    action=email('alert', '%(h)s has a loadavg1 of %(loadavg1)0.2f')
        

    DISK
    The DISK directive provides an interface to the kernel's disk I/O statistics. Simple or complex rules can be written using these statistics. This requires the data collector diskdevice:DiskStatistics (which is only available on Solaris and Win32 at time of writing).
    [Eddie 0.35+]

    Directive-specific arguments:

    Rule Variables: Action Variables: Directive Examples:
            # /dev/md/dsk/d20 == /var : send read/write counters to RRD
            DISK md20_thruput:
                device='md20'
                scanperiod='5m'
                rule='True'        # always perform action
                action='elvinrrd("disk-%(h)s_%(device)s", "rbytes=%(nread)s", "wbytes=%(nwritten)s")'
        

    TAPE
    The TAPE directive provides an interface to the kernel's tape I/O statistics. Simple or complex rules can be written using these statistics. This has almost exactly the same functionality as the DISK directive. This requires the data collector diskdevice:TapeStatistics (which is only available on Solaris at time of writing).
    [Eddie 0.35+]

    Directive-specific arguments:

    Rule Variables: Action Variables: Directive Examples:
            # st65 == TAPE : send tape read/write counters to RRD
            TAPE st65_thruput:
                device='st65'
                scanperiod='5m'
                rule='True'        # always perform action
                action=elvinrrd("tape-%(h)s_%(device)s", "rbytes=%(nread)s", "wbytes=%(nwritten)s")
        

    Actions

    Actions are performed when rules
    Actions currently include:

    Action Details

    log
    log performs message logging to a file, the tty where eddie was executed, or to syslog. The where depends on the via, which is the second parameter. If via looks like "XXX.YYY", then it is assumed that syslog type logging is desired. If via begins with a "/", then it is assumed that logging to a file is desired. If via is the string "tty", then the message will go to the tty where eddie was executed. You may specify multiple vias by separating them with a ";", as in "FACILITY.LEVEL;/path/to/file1.txt;/path/to/file2.log".

    Format:

    Action Examples:
            # generate a syslog notification using the LOG_DAEMON facility and LOG_ALERT level
            action=log("There is a problem on %(h)s", "DAEMON.ALERT")
    
            # append a message to a log file
            action=log("There is a problem on %(h)s", "/var/log/eddie_disk.log")
    
            # display a message on the tty that eddie was started on, and append to eddie.log
            action=log("There is a problem on %(h)s", "tty;/var/log/eddie.log")
        

    email
    email performs message emailing.

    How it goes about sending the email depends on your SENDMAIL and SMTP_SERVERS Eddie config options.

    Format:

    Action Examples:
            # generate an email alert
            action=email("me@mydomain.com,them@myotherdomain.com", "There is a problem on %(h)s", "Problem age: %(problemage)s")
        

    system
    system allows execution of operating system commands.

    Format:

    Action Examples:
            # run command to rotate the web log file
            action=system("rotate /var/log/web_log")
        

    restart
    Run /etc/init.d/(name) start command. Usually used to restart a dead daemon.

    Format:

    Action Examples:
            # restart the httpd server
            action=restart("httpd")
        

    nice
    Change the "nice" value of a running process, either up or down. Note that in order to increase the nice level, eddie has to be running as super-user.

    The process acted upon is the current pid in the dictionary, so this action only works for PROC and PID directives.

    Format:

    Action Examples:
            # change the execution of the process to take a little less time
            action=nice("+", 5)
    
            # de-prioritize the process
            action=nice(20)
        

    eddielog
    This action allows for logging messages to the log file that eddie is configured to use. Depending on the ADMINLEVEL setting, the message may also (eventually, depending on ADMIN_NOTIFY setting) get emailed to the ADMIN.

    Format:

    Action Examples:
            # generate an informational message to the eddie log file
            action=eddielog("Disk issue on %(h)s: used level is %(pctused)s%%")
    
            # generate a high-priority message to the eddie log file
            # (and probably to the ADMIN as well, eventually)
            action=eddielog("Disk issue on %(h)s: used level is %(pctused)s%%", 9)
        

    ticker
    Send a ticker-type message to an Elvin listener.

    Format:

    Action Examples:
            # send a ticker-type message
            action=ticker("%(file)s does not exist", timeout=1)
        

    page
    Send a page to the specified recipients. Currently implemented as an email.

    Format:

    Action Examples:
            # send a page to the ADMIN_PAGER alias
            action=pager(ADMIN_PAGER, "Host %(server)s is inaccessable")
    
            # send a page to a Sprint phone
            action=pager("734657XXXX@messaging.sprintpcs.com", "Host %(server)s is inaccessable")
        

    elvindb
    Send information to a database listener via Elvin. Data to insert in db can be specified in the data argument as 'col1=data1, col2=data2, col3=data3' or if data is not specified it will use values sent previously.

    Format:

    Action Examples:
            # send data to table "MYTABLE" via elvindb
            action=elvindb("MYTABLE", "host=%(h)s,load1=%(load1)s,load5=%(load5)s")
        

    elvinrrd
    Send information to a RRDtool database listener via Elvin.

    Format:

    Action Examples:
            # send the one-minute load average every minute for this host
            SYS loadavg1_rrd:
                rule='True'        # always true
                scanperiod='1m'
                action="elvinrrd('loadavg1-%(h)s', 'loadavg1=%(loadavg1)f')"
        

    netsaint
    Send information to a NetSaint listener via Elvin.

    Format:

    Action Examples:
            # send the free memory size to the NetSaint consumer
            action=netsaint("EddieMem", "Free memory on %(h): %(memfree)s", 1)
        

    Notification and Message objects

    Notification objects define levels of actions to be performed. Usually, the higher the level, the more serious the actions will be. Later versions of EDDIE will use notification objects for advanced features like problem escalation.
    Message objects define messages to be used in actions like email or paging. They are grouped together to provide a common way to call them from notification objects.


    Other Features

    Console

    EDDIE features a Console facility which provides live information about the active directives via a TCP connection. The TCP port used is set by the CONSOLE_PORT setting in eddie.cf and defaults to port 33343. Set this to 0 to disable this feature.

    By default every directive is shown in the Console output in the format "<ID> - <state>". This can be modified with the console directive argument, or the directive not shown at all by setting this argument to None.

    Substitution variables available to the console argument string are:

    Directive examples:

        # check root filesystem usage
        FS rootfs:    fs='/'
                      rule="pctused > 95"
                      action=email("root", "%(mountpt)s at %(pctused)s%%")
                      console='%(state)s %(pctused)s%%'
    
        # email me load average every 5mins
        SYS loadavg5: rule="True"
                      action=email('chris', '%(h)s loadavg5: %(sysloadavg5).02f')
                      scanperiod='5m'
                      console="loadavg5=%(sysloadavg5).02f"
    
        # store root filesystem data in RRD (don't show on Console)
        FS root_rrd:  fs='/'
                      rule="True"
                      scanperiod='5m'
                      action=elvinrrd("fs-%(h)s_root", "used=%(fsused)s", "size=%(fssize)s")
                      console=None
    

    Console example:

        $ telnet localhost 33343
        Trying 127.0.0.1...
        Connected to localhost.
        Escape character is '^]'.
        Eddie Console Gateway
        FS.rootfs - ok 33%
        SYS.loadavg5 - loadavg5=0.14
        Connection closed by foreign host.
    


    TODO: System-specific information...... (NOT FINISHED)

    Solaris:

      System stats from '/usr/bin/uptime':
          uptime          - time since last boot (string)
          users           - number of logged on users (int)
          loadavg1        - 1 minute load average (float)
          loadavg5        - 5 minute load average (float)
          loadavg15       - 15 minute load average (float)
    
      System counters from '/usr/bin/vmstat -s' (see vmstat(1M)):
          ctr_swap_ins                            - (long)
          ctr_swap_outs                           - (long)
          ctr_pages_swapped_in                    - (long)
          ctr_pages_swapped_out                   - (long)
          ctr_total_address_trans_faults_taken    - (long)
          ctr_page_ins                            - (long)
          ctr_page_outs                           - (long)
          ctr_pages_paged_in                      - (long)
          ctr_pages_paged_out                     - (long)
          ctr_total_reclaims                      - (long)
          ctr_reclaims_from_free_list             - (long)
          ctr_micro_hat_faults                    - (long)
          ctr_minor_as_faults                     - (long)
          ctr_major_faults                        - (long)
          ctr_copyonwrite_faults                  - (long)
          ctr_zero_fill_page_faults               - (long)
          ctr_pages_examined_by_the_clock_daemon  - (long)
          ctr_revolutions_of_the_clock_hand       - (long)
          ctr_pages_freed_by_the_clock_daemon     - (long)
          ctr_forks                               - (long)
          ctr_vforks                              - (long)
          ctr_execs                               - (long)
          ctr_cpu_context_switches                - (long)
          ctr_device_interrupts                   - (long)
          ctr_traps                               - (long)
          ctr_system_calls                        - (long)
          ctr_total_name_lookups                  - (long)
          ctr_toolong                             - (long)
          ctr_user_cpu                            - (long)
          ctr_system_cpu                          - (long)
          ctr_idle_cpu                            - (long)
          ctr_wait_cpu                            - (long)
    
      Process/memory stats from '/usr/bin/vmstat' (see vmstat(1M)):
          procs_running   - number of processes running (int)
          procs_blocked   - number of processes blocked (int)
          procs_waiting   - number of processes waiting (int)
          mem_swapfree    - amount of free swap (kB) (int)
          mem_free        - amount of free RAM (kB) (int)
              
    Linux:
      loadavg1              - 1min load average (float)
      loadavg5              - 5min load average (float)
      loadavg15             - 15min load average (float)
      ctr_uptime            - uptime in seconds (float)
      ctr_uptimeidle        - idle uptime in seconds (float)
      ctr_cpu_user          - total cpu in user space (int)
      ctr_cpu_nice          - total cpu in user nice space (int)
      ctr_cpu_system        - total cpu in system space (int)
      ctr_cpu_idle          - total cpu in idle thread (int)
      ctr_cpu%d_user        - per cpu in user space (e.g., cpu0, cpu1, etc) (int)
      ctr_cpu%d_nice        - per cpu in user nice space (e.g., cpu0, cpu1, etc) (int)
      ctr_cpu%d_system      - per cpu in system space (e.g., cpu0, cpu1, etc) (int)
      ctr_cpu%d_idle        - per cpu in idle thread (e.g., cpu0, cpu1, etc) (int)
      ctr_pages_in          - pages read in (int)
      ctr_pages_out         - pages written out (int)
      ctr_pages_swapin      - swap pages read in (int)
      ctr_pages_swapout     - swap pages written out (int)
      ctr_interrupts        - number of interrupts received (int)
      ctr_contextswitches   - number of context switches (int)
      ctr_processes         - number of processes started (I think?) (int)
      boottime              - time of boot (epoch) (int)
              
    HP-UX:
      System stats from '/usr/bin/uptime':
          uptime          - (string)
          users           - (int)
          loadavg1        - (float)
          loadavg5        - (float)
          loadavg15       - (float)
    
      System counters from '/usr/bin/vmstat -s' (see vmstat(1)):
          ctr_swap_ins                                    - (long)
          ctr_swap_outs                                   - (long)
          ctr_pages_swapped_in                            - (long)
          ctr_pages_swapped_out                           - (long)
          ctr_total_address_trans_faults_taken            - (long)
          ctr_page_ins                                    - (long)
          ctr_page_outs                                   - (long)
          ctr_pages_paged_in                              - (long)
          ctr_pages_paged_out                             - (long)
          ctr_reclaims_from_free_list                     - (long)
          ctr_total_page_reclaims                         - (long)
          ctr_intransit_blocking_page_faults              - (long)
          ctr_zero_fill_pages_created                     - (long)
          ctr_zero_fill_page_faults                       - (long)
          ctr_executable_fill_pages_created               - (long)
          ctr_executable_fill_page_faults                 - (long)
          ctr_swap_text_pages_found_in_free_list          - (long)
          ctr_inode_text_pages_found_in_free_list         - (long)
          ctr_revolutions_of_the_clock_hand               - (long)
          ctr_pages_scanned_for_page_out                  - (long)
          ctr_pages_freed_by_the_clock_daemon             - (long)
          ctr_cpu_context_switches                        - (long)
          ctr_device_interrupts                           - (long)
          ctr_traps                                       - (long)
          ctr_system_calls                                - (long)
          ctr_Page_Select_Size_Successes_for_Page_size_4K - (long)
          ctr_Page_Select_Size_Successes_for_Page_size_16K - (long)
          ctr_Page_Select_Size_Successes_for_Page_size_64K - (long)
          ctr_Page_Select_Size_Successes_for_Page_size_256K - (long)
          ctr_Page_Select_Size_Failures_for_Page_size_16K - (long)
          ctr_Page_Select_Size_Failures_for_Page_size_64K - (long)
          ctr_Page_Select_Size_Failures_for_Page_size_256K - (long)
          ctr_Page_Allocate_Successes_for_Page_size_4K    - (long)
          ctr_Page_Allocate_Successes_for_Page_size_16K   - (long)
          ctr_Page_Allocate_Successes_for_Page_size_64K   - (long)
          ctr_Page_Allocate_Successes_for_Page_size_256K  - (long)
          ctr_Page_Allocate_Successes_for_Page_size_64M   - (long)
          ctr_Page_Demotions_for_Page_size_16K            - (long)
              

    Appendix A

    Time Definition

    The format for specifying time is either:

    EDDIE-Tool Homepage ]


    © Chris Miles 2002-2005

    $Id: manual.html 910 2007-12-10 12:48:51Z chris $