Using rrdtool to generate server load & bandwidth graphs

I’ve been using MRTG and routers2.cgi for years to graph the various aspects of a server that warrant monitoring. I’ve long known that they used something called rrdtool to do… well, something, but never had a need or desire to figure out exactly what that was.

But, having just moved my site to a new server, I was curious how the server would handle the load. Rather than setting up some behemoth like Nagios or Zabbix, which are full monitoring/alerting suites, I just wanted graphing. As I said, in the past I’ve used MRTG or routers2.cgi for this but both of them were overkill for me in this case. Since both of them used rrdtool, I figured that was a good place to look.

The two metrics I want to record are server load and in/out bandwidth. The first step is to create the RRDs (round robin databases). This was done via these commands:

# rrdtool create /mrtg/load.rrd --start N DS:load1:GAUGE:600:0:100 DS:load5:GAUGE:600:0:100 DS:load15:GAUGE:600:0:100 RRA:AVERAGE:0.5:2:800

# rrdtool create /mrtg/eth1.rrd --start N DS:in:COUNTER:600:0:10000000000 DS:out:COUNTER:600:0:10000000000 RRA:AVERAGE:0.5:2:800

A good explanation of what these various fields mean is here. In short, each “DS:” section defines a “column” (for fellow RDBMS users) in the database. The first one has 3 “columns,” named load1, load5, load15, each of which will contain GAUGE data. The second one contains two COUNTER fields, representing the bytes in/out for interface eth1.

To actually get the data I poll snmpd via this bash script:

#!/bin/bash

rrdupdate /mrtg/load.rrd N:
`/usr/bin/snmpget -v 2c -c public -Oqv localhost laLoad.1`:
`/usr/bin/snmpget -v 2c -c public -Oqv localhost laLoad.2`:
`/usr/bin/snmpget -v 2c -c public -Oqv localhost laLoad.3`

rrdupdate /mrtg/eth1.rrd N:
`/usr/bin/snmpget -v 2c -c public -Oqv localhost ifInOctets.3`:
`/usr/bin/snmpget -v 2c -c public -Oqv localhost ifOutOctets.3`

I have that run every 5 minutes via cron. Then to generate the actual graph, I run this script via cron:

#!/bin/bash

rrdtool graph /var/www/html/graphs/load.png 
        -N 
        -E 
        --start now-30hours 
        --title "Load Averages" 
        --width 300
         --x-grid MINUTE:60:HOUR:2:HOUR:4:0:%H
        --height 200 
        -u 1.0 
        --lower-limit 0
        --vertical-label "Load Avg" 
        --full-size-mode 
-a PNG --title="Load Avg" 
'DEF:load1=/mrtg/load.rrd:load1:AVERAGE' 
'VDEF:load1last=load1,LAST' 
'DEF:load5=/mrtg/load.rrd:load5:AVERAGE' 
'DEF:load15=/mrtg/load.rrd:load15:AVERAGE' 
'AREA:load15#33CC33:15 Min Load Avg ' 
'LINE1:load1#0000ff:1 Min Load Avg ' 
'GPRINT:load1:AVERAGE:"Load1 Avg:%3.2lf"' 
'GPRINT:load1last:Drawn at %Y-%m-%d, %H:%M:strftime' 
#'LINE1:load5#ff00ff:5 Min Load Avg ' 

 
rrdtool graph /var/www/html/graphs/eth1.png 
        -N 
        -E 
        --start now-30hours 
        --title "eth1 traffic" 
        --width 300
         --x-grid MINUTE:60:HOUR:2:HOUR:4:0:%H
        --height 200 
        -u 1000000 
        --lower-limit 0
        --vertical-label "bps" 
        --full-size-mode 
-a PNG --title="eth1 traffic" 
'DEF:eth1in=/mrtg/eth1.rrd:in:AVERAGE' 
'CDEF:eth1inbits=eth1in,8,*' 
'VDEF:eth1last=eth1in,LAST' 
'DEF:eth1out=/mrtg/eth1.rrd:out:AVERAGE' 
'CDEF:eth1outbits=eth1out,8,*' 
'AREA:eth1inbits#33CC33:eth1 in ' 
'LINE1:eth1outbits#0000ff:eth1 out' 
'GPRINT:eth1last:Drawn at %Y-%m-%d, %H:%M:strftime' 

The final graphs look decent, though not very fancy, but I’ll play around with it a bit more:

eth1 graph
eth1 graph
load graph
load graph

TP-Link TL-WR841ND v7 802.11n router, wireless dies after a few days

I mentioned in a previous post that I got the TP-Link TL-WR841ND 802.11n wifi router and it solved the speed problems I was noticing with wifi connections since going from FiOS to Cablevision. This seems to be the case still, however I’ve now had another problem with the TP router. Basically, wireless becomes unusable and the web UI becomes inaccessible. The SSID still shows up but I can’t get an IP address. When accessing it from the wired LAN via a browser, the connection times out – apparently whatever’s going on inside the router is also crashing its internal webserver.

Power-cycling the router resolved the issue both times it occurred (most recently tonight), but twice is two times too many. Tonight I downloaded and installed DD-WRT v24-sp2 and configured it. It only took a few minutes – I was pretty impressed with dd-wrt – though I was surprised not to see SNMP monitoring included. Not sure if I missed it in the UI but I assumed it would be under “Services,” and I didn’t see it there. I tried snmpwalk against the router and it returned nothing, so it’s not on by default.

Anyway, hopefully dd-wrt will give me better luck than the native TP-Link firmware. It seems to be a good router hardware wise, but crashing every few days negates that.

Update: April 19, 2011: I’ve had DD-WRT running for a few weeks now on the TP-Link router and it’s been great. No reboots required. For some reason DD-WRT doesn’t seem to have SNMP available, at least not through the web UI, but other than that it’s far better than the default TP-Link software.

http://rcm.amazon.com/e/cm?lt1=_blank&bc1=000000&IS2=1&bg1=FFFFFF&fc1=000000&lc1=0000FF&t=evanhoffmasho-20&o=1&p=8&l=as4&m=amazon&f=ifr&ref=ss_til&asins=B0034CN0AS

Using Zabbix for SNMP monitoring disk usage percent for Windows hosts

A few years ago we moved from Nagios to Zabbix for our server monitoring needs. I wasn’t a big fan of Nagios, finding it a pain to manage with its myriad configuration files. It’s probably gotten better since I last toyed with it but since we moved to Zabbix I haven’t had much reason to look at Nagios again.
Continue reading “Using Zabbix for SNMP monitoring disk usage percent for Windows hosts”