The not-so-secret secret to Postgres performance

I manage a bunch of Postgres DBs and one of the things I almost always forget to do when setting up a new one is set the readahead up from the default of 256. I created this script and run it out of /etc/rc.local and sometimes cron it too. The 3 commands at the top are only really relevant on systems with “huge” memory – probably over 64 GB. We ran into some memory problems with CentOS 6 on a box with 128 GB ram which we ended up working around by reinstalling CentOS 5, but the /sys/kernel/mm/redhat_transparent_hugepage/ options below should fix them in 6.x (though we haven’t actually tried it on that DB, we haven’t seen any problems in other large DBs).


echo no > /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag
echo never >/sys/kernel/mm/redhat_transparent_hugepage/defrag

echo 1 > /proc/sys/vm/dirty_background_ratio

BLOCK_DEVICES=`perl -ne 'chomp; my @a=split(/[s]+/); next unless $a[4]; next if ($a[4] =~ /sd[a-z]+[d]+/); print "$a[4]n";' /proc/partitions `

logger -t tune_blockdevs "Block devices matching: $BLOCK_DEVICES"

for DEV in $BLOCK_DEVICES;  do
        logger -t tune_blockdevs "Setting IO params for $DEV"
### Uncomment the below line if using SSD
#        echo "noop" > /sys/block/$DEV/queue/scheduler
        /sbin/blockdev --setra 65536 /dev/$DEV

Problems with PostgreSQL 9.0 on CentOS 6.3 with a 2.6.32-series kernel

Note: this is cross-posted from the Yodle Tech Blog.

We recently migrated our Postgres database to a significantly more powerful machine. While our existing server was doing okay load-wise, we were approaching 90% full on its Postgres volume. We’d also recently acquired an awesome new PureStorage SSD-based SAN and were excited to move our database onto it. For comparison, here are the the specs for the two systems:

Old server New server
Nickname DB1 DB2
CPU 2x 6-core Xeon X5690 CPUs at 3.47GHz 2x 8-core E5-2690 CPUs at 2.90 GHz
Memory 96 GB 1333 MHz 128 GB 1600 MHz
  • Dell PERC H800 SAS HBA
  • Dell MD1220 drive enclosure
  • 24x 15k RPM SAS drives configured in RAID 10
  • QLogic QLE2562 dual-port 8 Gbit FC HBA
  • PureStorage SSD-based SAN
OS CentOS 5.6 x86_64 CentOS 6.3 x86_64
Kernel 2.6.18-238.9.1.el5 2.6.32-279.2.1.el6.x86_64

As you can probably guess, moving from spinning disk onto SSD was the centerpiece of the upgrade. We’d tested the PureStorage system pretty extensively and used it in some secondary databases and seen some amazing improvements. We were excited to see how it would handle our production workload.

Basic Configuration

DB1 was built in early 2011 and was running CentOS 5.6. We’re generally pretty big fans of CentOS, so when building DB2 we went with CentOS 6.3, which was the latest version. We followed our standard installation procedure, which essentially means doing a minimal installation and then installing just what we needed, and of course updating all packages to the latest versions.

We provisioned a big LUN for the Postgres data and 500 GB each for Postgres “temp” and logging. While a separate partition/volume for pg_log/pg_xlog is a common practice, we also created a separate volume for temp as a safeguard against runaway queries that generate lots of temp files due to hash-joins or sorting gigantic result sets. In the past we’ve run into situations where the temporary files consume all available space, and if the sort space is on the same partition as your data you run the risk of crashing the database if Postgres attempts to allocate space and there’s none to be had. Since all the SAN volumes are thin provisioned, there’s no penalty for creating separate volumes, and we get the additional benefit of being able to specify different mount options for each volume. (It’s worth noting that PostgreSQL 9.2 allows you to specify a maximum size for temp space, but since we’re still on 9.0 that option isn’t available to us.) We then created XFS filesystems on each volume. We didn’t bother partitioning the volumes, just created the filesystems on the raw devices. This way, in the event we need to expand a volume, we don’t have to worry about growing the partition, just the LUN (in the SAN) and the filesystem.

The initial cutover

We configured streaming replication between DB1 and DB2 to get DB2 in sync with DB1. On the night of Sunday, August 19th, we cut over from DB1 to DB2 as the primary. All things considered, everything went well and our database was only inaccessible to our servers for a few minutes. (Incidentally, while the database was “down” all our client sites were fine since they are completely decoupled from the database.) After verifying everything was up we ran some test queries and confirmed the performance was awesome, gave ourselves some virtual high-fives and went to bed.

The next day, Monday, started out well. Everything that touched the DB was much faster. Pages that took 3-5 seconds to load were nearly instant. Jobs that took hours to run finished in under 15 minutes. It was as if someone had hit the turbo button on our whole system. Everything was wonderful. Or so we thought.

The problems begin

Around midday Monday we started seeing lots of errors being thrown in our applications with the cause of “too many clients” to the database. Our connection limit has been set at 250 forever, so it seemed strange to encounter problems with the connection limit out of the blue. We had recently done a code release so our first line of thought was that there was a bug in the code. We use Hibernate, and it’s pretty easy to blame Hibernate. The problem was intermittent — every 20-30 minutes we’d see a bunch of SQLExceptions due to “too many connections”, which would last 1-2 minutes, and then it would clear up and everything was fine again. We made it through the rest of the day, and in the evening the problem seemed to have cleared up. We thought maybe it was a transient problem that had resolved itself but we saw a bunch of the same exceptions in some jobs that ran overnight. Our system is pretty resilient though, and most connection attempts automatically retried until they succeeded. The problem was a high priority but since it wasn’t affecting end users we were able to limp along. Tuesday was much the same as Monday and we continued investigating. Wednesday the real trouble started.

The real trouble starts

Our nightly pg_dump, which would normally finish by 8:30 AM, was only halfway complete by 8 AM on Wednesday. We had the same too-many-clients exceptions throughout the night. But around midday we started getting reports from users that our UI was really slow: some pages that had been loading nearly instantly were taking 30-60 seconds to load, or more. This quickly turned the problem into a crisis. We saw the same “too many connections” errors in the frontend logs as we had in the backend. The basic behavior was the same – we’d get a wave of “too many connections” for a couple minutes, then it would clear up for a while, and then the cycle would repeat.

We finally had to admit that maybe the problem wasn’t in the code, but in the database. To mitigate the error, we lowered the active connection limits in our connection pools, which improved the “too many connections” error for users, but led page loads to hang while the connection pool waited for a slot to open. This wasn’t much of an improvement for end users though – rather than failing quickly, their page loads were hanging indefinitely.

A minor breakthrough – and a workaround

Thursday morning we were able to get much closer to the actual problem by running vmstat on the database itself. vmstat shows the amount of memory cached vs free along with IO in/out. As people who’ve run Postgres databases before are probably aware, it’s common to see almost 100% of available memory “used” in top, even when the database doesn’t seem to be doing very much. This is due to caching.

What we saw in vmstat, however, was that during the “problem” periods the “cached” was decreasing while the “free” was increasing while the io in/out dropped to nearly zero. After some thought, we realized that the reason for the “too many connections” errors was that queries that normally finished in a few milliseconds were instead taking 30+ seconds due to the IO blocking we were seeing in vmstat. With queries running much longer than usual, they quickly ran over each other and we hit the limit of 250.

So the root behavior we were seeing that seemed to be causing the problem was:

  • Periodically, something was aggressively attempting to reclaim cached memory.
  • While memory was being reclaimed, IO was almost completely blocked.
  • While IO was blocked, database queries naturally took longer than usual, leading to rapid consumption of available database connections.

Now that we seemed to have a root cause (or more concrete symptom) we were able to formulate some relevant search queries. Simply Googling for linux memory reclaim, one of the results on the first page was from the PostgreSQL performance list and seemed to describe very similar behavior on machines with a lot of cores. DB2 has 16 physical cores, and we had logical cores enabled, for a total of 32. Maybe that’s “a lot.” We found some other similar threads and even this blog post written on the very same day we started Googling (how’s that for coincidence?). We also found references to Linux & NUMA memory-reclaim problems for MySQL and MongoDB. Most posts seemed to indicate there’s a problem in the 2.6.32 kernel line with optimizations related to NUMA memory. If experiencing problems, the suggested remedy was to set /proc/sys/vm/zone_reclaim_mode to 0 if using NUMA. Well, /proc/sys/vm/zone_reclaim_mode was already zero, and we weren’t using NUMA anyway – we were using interleaved memory. But this was enough to plant the seed of doubt in our minds for the 2.6.32 kernel. DB1 had been running for well over a year without issue on a 2.6.18 kernel.

On Thursday, thankfully, we discovered that “manually” clearing the cache caused the IO blockage to clear up immediately, at least until the system had time to fill it back up again and the memory-reclaim process kicked in again. The cache is dropped by executing echo 1 > /proc/sys/vm/drop_caches. Every time queries began to block, dropping the cache immediately cleared it up. Since we now had a stopgap workaround, we just cronned dropping of the cache for every 30 minutes. This let us get through Friday and the weekend without having to babysit the database too much for failed connections, etc.

On Friday we decided the best course of action was to put a QLogic fibre channel HBA in DB1 (with its presumably known-good 2.6.18-series kernel), provision some LUNs on it, set it up as a streaming replica and fail back over to it. This way we’d be on a good server but still have the IO benefits of the SSD. We ordered the card from Newegg Friday and received it Monday.

Tactical Retreat

I installed the card Monday (8/27), configured DB1 as a replica, tested it Tuesday, and Wednesday evening we cut back over to DB1 as our primary. With fingers crossed, we went into Thursday and thankfully everything was fine. While overall performance was a little slower than it had been on DB2 (before it had gotten really bad), it was still blazingly fast compared to how it had been on spinning disk. But most importantly there was no evidence of memory-reclaim problems.

We set DB2 up as a hotstandby and performed some testing on it to try and recreate the problem. Unfortunately we didn’t have a lot of success (possibly due to hostandbys being read-only replicas of the primary) but based on the grumbling about NUMA and memory-reclaim in the 2.6.32 kernel line, we felt the best course of action was to go to an older, proven kernel. Since DB1 seemed to be fine with its 2.6.18-238.9.1.el5 kernel, we opted to do a clean install of CentOS 5.8, running the latest in the CentOS 2.6.18 line: 2.6.18-308.13.1.el5.

DB2 reborn

We installed CentOS 5.8, remounted the existing LUNs, resynced it with the primary and did some more testing. We didn’t notice any problems, but since we weren’t really able to reproduce the problem reliably in the first place that wasn’t much consolation. On Thursday, 9/6, we cut back over to DB2. As I write this, DB2 has been in service for several weeks without any sign of the memory reclamation issue, or, more importantly, the IO/query blocking.


While we’re really happy to have (apparently) resolved the problem, we found the resolution somewhat unsatisfying. We had tested DB2 pretty extensively internally prior to putting it into production the first time, literally running millions of real-world queries against it and never encountered anything like the blocking we saw once it took on the production workload. Prior to making it our primary DB we had used it for several days as a hot-standby read-only reporting server and still never saw an issue. After we cut back from DB2 to DB1 we did yet more testing and still weren’t able to reproduce the query-blocking behavior. While this definitely felt like a problem related to the Linux kernel we didn’t have any hard evidence, and going back to a 2.6.18-series kernel was to some degree a leap of faith. Thankfully, so far it’s worked out for us.

Though we didn’t ultimately figure out what it was in the kernel that caused the problem, we did learn a lot from the experience. Most importantly, we ironed out our database cutover procedure to the point where it’s not really a major event to cutover. Since clearly at some point we’ll want to upgrade from CentOS 5.8, we know that we can always fail back to another database with minimal interruption if a problem like this arises in the future.

Rescan SATA bus (aka hot-adding a SATA disk on a Linux guest in VMware without rebooting)

Linux supports hot-adding disks but whenever I add a new vdisk in VMware the new disk doesn’t show up unless I reboot, which defeats the purpose of hot-add. This command forces a rescan of the bus:

echo "- - -" > /sys/class/scsi_host/host0/scan

dmesg shows the new disk has been found:

  Vendor: VMware    Model: Virtual disk      Rev: 1.0 
  Type:   Direct-Access                      ANSI SCSI revision: 02
 target0:0:2: Beginning Domain Validation
 target0:0:2: Domain Validation skipping write tests
 target0:0:2: Ending Domain Validation
 target0:0:2: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127)
SCSI device sdd: 1048576000 512-byte hdwr sectors (536871 MB)
sdd: Write Protect is off
sdd: Mode Sense: 03 00 00 00
sdd: cache data unavailable
sdd: assuming drive cache: write through
SCSI device sdd: 1048576000 512-byte hdwr sectors (536871 MB)
sdd: Write Protect is off
sdd: Mode Sense: 03 00 00 00
sdd: cache data unavailable
sdd: assuming drive cache: write through
 sdd: unknown partition table
sd 0:0:2:0: Attached scsi disk sdd
sd 0:0:2:0: Attached scsi generic sg3 type 0

Now, why there’s no “rescan_sata” command is something I can’t fathom, but that’s Linux for you.

64-bit Firefox 6 on CentOS 6 x86_64 (or Fedora) with Flash 11

So I’ve gone back to Linux from Mac, due to the SSD issues I had with my Macbook Pro basically making VMware unusable. A Win7 VM would grind the guest and host to a halt on the 7200 RPM SATA OEM drive, and the SSD wouldn’t work, so I put the SSD back in my HP and installed CentOS 6 x86_64. Not really ideal, but at least it works.

CentOS 6 ships with Firefox 3.6.9, which is really old by now. Fine for a server, but I wanted FF6. I grabbed the Firefox 6.0.1 bz2 from VoxCast, unzipped it and copied it to /usr/lib64/firefox-6. Then I ran yum remove firefox to remove 3.6.9 and avoid any issues. I tried the instructions from my older post on this subject but for whatever reason it didn’t work – I’m guessing because I’m not using the distro’s Firefox RPM.

I followed these instructions to get the Flash 11 64-bit plugin installed. It still wasn’t working though.

I ended up creating a symlink:

ln -s /usr/lib64/mozilla/plugins/ /home/evan/.mozilla/plugins/

When I started up Firefox after creating the symlink, Flash worked.

Additionally, to set Firefox 6 as the default browser, run gnome-default-applications-properties, select Custom, and paste /usr/lib64/firefox-6/firefox %s in the Command: field.

Update: I upgraded my home desktop PC from FC12 to FC14 last night and used the above procedure to install Firefox 6.0.2 on it with Flash 11. So, in case anyone was wondering, the above works for Fedora as well.

MongoDB logrotate script

MongoDB has log rotation functionality built in, but since I run CentOS I like to have everything managed through logrotate. The 10gen RPM I installed doesn’t have a logrotate script, so I wrote one. Create file /etc/logrotate.d/mongod:

/var/log/mongo/mongod.log {
        rotate 30

        /bin/kill -SIGUSR1 `cat /var/lib/mongo/mongod.lock 2> /dev/null` 2> /dev/null || true

logrotate runs at 4 AM daily by default (via the script in /etc/cron.daily/logrotate). The file above rotates the mongod.log file daily, retaining 30 days of files, appending the date to each one after rotation (rather than the “.1” or “.2” suffix) and then gzipping it.

Setting up InspIRCd as an internal IRC server for companywide chat

I got this idea (from this blog post) that it would be cool and great if we had an IRC server to facilitate communication within our team.  I started checking out some different IRC servers to see which ones supported the main features I wanted (LDAP auth & SSL).  I started out with ratbox but I didn’t really like it.  I found InspIRCd and after some config tweaking it’s working well.  Here’s the info from modules.conf for authentication against Active Directory (so people login with their AD usernames):
<module name="">
<ldapauth baserdn="OU=Users,DC=did"
killreason="LDAP auth failed"
Here’s the configure line (since I had to build from source):
./configure --prefix=/usr/local/inspircd --enable-gnutls --uid 101 --enable-extras=m_ldapauth.cpp --enable-extras=m_ldapoper.cpp

I still have some more customization to do to make ChanServ work, but since this is internal I don’t know if that even matters.  Also I can’t seem to get SSL working – I tried compiling with –enable-openssl but that failed, so I went with –enable-gnutls, which worked, but I can’t connect via ircs:// in my client.  If it looks like people are interested in this I may fix it up, but I have a feeling most people won’t be as enthused about it as I am.

HOWTO: Install Flash plugin on Firefox on Fedora / CentOS / RedHat Linux

Updated 8/31/2011 See my newer post on this subject.

I know I’ve been through this before because I’ve setup Fedora before and it’s working. But I just installed FC12 on my desktop at home (since my Seagate drive bricked itself a few days ago and I had to reinstall anyway) and while I finally got the Adobe Flash plugin working, it didn’t have any audio, which made YouTube kind of suck. Anyway, this of course was a solved problem, I just forgot where the fix was located… but duh, it’s on the Fedora website. A simple

yum install flash-plugin nspluginwrapper.x86_64 
    nspluginwrapper.i686 alsa-plugins-pulseaudio.i686 

and everything appears to be fine. Thank you Fedora!

I put Fedora Core 11 on my work laptop

My work laptop, which I got around the end of 2006, was starting to run like crap. Pretty sure it started around the time I put antivirus on it, which reinforces my theory that antivirus is a virus itself. Anyway, being bored with Windows XP and having already played with Windows 7 at home, I decided to install Linux. The last time I tried Linux desktop was around RedHat 7.3, which was a long time ago… pre-Fedora even. I’m pretty happy with CentOS on the server side and was set to try CentOS 5.3 on my laptop when someone suggested I go with FC due to more bleeding-edge driver support. Even though the hardware is pretty old at this point I figured it was worth a shot.

Generally everything works pretty well. Everything pretty much works out of the box – I had dual screens up at native resolution, it even recognized my phone as a camera when I plugged it in to charge it. Best of all, I was able to resize the NTFS partition (rather than blowing it away) so I can still boot back to XP when the need arises.

There are 2 problems I’ve had so far though:

  • the computer now hovers between 76 and 85 degrees Celsius, whereas under Win XP it generally peaked around 68 C. I tried underclocking the CPU down to 250 MHz but the problem persisted.
  • The Java plugin for Firefox … wtf? How can this not work? I downloaded Sun’s JDK and symlinked the libjavaplugin_whatever into /usr/lib64/mozilla/whatever/ and that didn’t work. Then I tried “yum install java-plugin” or something like that and that installed some OpenJava-ish plugin which looked like it was going to work, but when I logged into the Raritan KVM and clicked “Connect” it wouldn’t work. I booted up Win XP in a VM and Firefox with Java works fine. Really annoying and dumb.

I’ll probably try Ubuntu tomorrow as several people suggested it runs cooler than Fedora. If that fails then I guess I’ll go to Win7 after all. I also want an SSD for my work laptop now… my Windows VM seems to spent eternity spinning my crappy 5400 rpm drive. Plox.