About three years ago I replaced our primary database. For years we’d been plagued by awful performance in the database and we were never able to diagnose the problem. The original server was a real beast at the time: 8 Opterons (single core), 32 gigs ram, and a fibre channel RAID connected via a QLogic HBA. This was back in 2005, so those specs don’t probably sound that impressive today, but this was a crazy configuration (with a crazy price tag to match). On paper it looked like this server should be basically invincible but the performance was awful, slowing down every process within the company. We contacted a few different companies (including CommandPrompt, which employs several of the core Pg devs) to see if they could assist us in diagnosing the problems but tuning only helped to a point. There was just something wrong with the box, maybe having to do with the FC HBA itself (which nobody knew much about).
When the database became my responsibility I decided it was time to bite the bullet and just replace the bad machine rather than trying to debug it while it was still in service (which was always a fun exercise). After some research I ended up going with a dual-Xeon 5160 (dual-core Woodcrest) @ 3.0 GHz with 32 GB memory. For storage I decided to get away from fibre channel, since that wasn’t working out too well for us. For the database volume, I went with a HP MSA70 SAS array with 25x 146GB 10K RPM drives. I played around with RAID 5 and RAID 6 but ended up going with RAID 10 since space wasn’t as much of a concern as raw speed. So we have 24 spindles in RAID 10 + 1 hot spare.
When we put the new DB into service in September, 2007 it was like someone hit the turbo button on all of our processes. Despite halving the number of CPU cores and keeping the same amount of RAM, base performance more or less improved by 5x, with several bottlenecks apparently eliminated completely. Some bad queries went from minutes to seconds. The improvement was so drastic that we were finally able to nightly pg_dump backups of the database (performance on the 8-cpu DB had been so bad that a pg_dump would take 4-5 days, so we could only do them weekly at best). We were also able to do weekly VACUUMs and REINDEXES, which helped keep the performance well above what we were used to while keeping disk bloat down (a REINDEX once a month seems to shave 20-30 GB off our consumed disk space).
So now it’s 2010. The database is still working fine; we’ve had some disk failures over the years but the hot spare took over automatically and we returned the bad disk to HP for a free replacement under warranty (the array holds 25 disks but I purchased 26 so we could always drop a new one in and never be down a disk for more than a few hours). But now the entire thing is out of warranty, and technology has improved. We’ve moved almost all of our systems to our Compellent SAN but this core system is still running on its own RAID.
Moving the DB onto the SAN has several advantages:
- Leveraging the faster storage within the SAN – our tier 1 in the SAN currently has 30x 15kRPM drives (plus two hot spares), as opposed to 24x 10kRPM. If we add SSD to the SAN, the DB would be able to take advantage of this “for free.”
- Snapshots/Replays. Currently our only real method of recovery is from the previous night’s full backup, and restoring takes > 24 hours. We haven’t needed to do this in my memory, but it would be nice to have the ability if needed.
- Disaster Recovery. When we ultimately implement SAN-SAN replication, the DB data on the SAN will be replicated “for free” to the remote site.
The main risk I’ve been thinking about has been the possible bottlenecking of iSCSI for getting data in and out of the SAN. We have gigabit Ethernet, and the SAN traffic is segregated from regular traffic, but gig-E is still gig-E. After reviewing some iostat numbers though, it doesn’t appear that will be an issue.
The new hardware for the database itself is pretty incredible, even compared to what we’ve been purchasing over the past couple of years for our ESX hosts: HP DL360 G7 with dual Xeon X5660 (6-core (!) Westmere) CPUs @ 2.8GHz, 96 GB PC3-10600 SDRAM, and a QLogic QLE-4062 iSCSI HBA. This is actually about 1/3 the cost of the previous database (since there’s no RAID) despite having triple the memory and triple the CPU cores. I was originally planning to get the machine with 144GB, but my sales rep reminded me about how if you use over 6 DIMMs per CPU, the memory gets downclocked. I opted for 96 GB of faster memory rather than 144 GB of slower mem.
So that’s a short history (and future) of our primary database. But what prompted me to start writing this post is the question posed in the title: do I still need to configure swap space? With the storage on the SAN, swapping will incur latency which I imagine will be well above local disk. The server will have a local disk (15k SAS) which I intended to use for swap, but now I’m wondering if there’s even a point to having swap at all with 96 GB of physical memory.
Evan, a server with that much RAM wouldn’t require a pagefile, but in extreme cases where the system would run out of memory, the server could become unresponsive or even crash. The pagefile is used for swapping, but also for memory dumps. The best practice that we recommend is to store it on local server disk if you’re going to use it. You can store it on the SAN if you must, but store it on its own volume and don’t worry about replicating the pagefile for disaster recovery or backups. Hope this helps!
Thanks Justin. The server does have a local disk which I added just for swap; I’m hesitant to use it because it’s just a single disk, and I didn’t want to have to worry about local RAID (raid cards are probably my least favorite piece of hardware, and getting rid of them has been one of my favorite things about moving to SAN). But I suppose I’ll probably chicken out and create a 1or 2 GB swap partition on the local disk.
I learned the hard way not to replicate swap space when we tested the Compellent replication and found it saturated a 100 Mbit pipe (and still not move all the data), since we were doing hourly snapshots of all ESX datastores, which included the VMs’ page files.