Compellent “future proof?” Not so much.

So, I’ve written about Compellent a few times from a price perspective, mostly on the disk side. I was recently contacted by our vendor with quotes for two new Compellent controllers. “What’s this all about?” I asked. “Why don’t we have a call with Compellent to discuss?” he replied. I rolled my eyes a little but figured it was worth hearing them out, since our Compellent SAN is at the heart of our infrastructure.

We currently have two controllers setup in failover mode. The first was bought in 2008 and the other in 2010 to add redundancy. Earlier this year we upgraded to the latest software version in preparation for moving our production DB onto the SAN, to allow us a nice window before we had to perform another upgrade (which would now risk DB downtime… I like failover but I don’t trust it enough to have a DB up during a failover), so I was kind of skeptical about any sort of upgrade to begin with.

On the call, the Compellent reps explained that they’ve dropped Fibre Channel connectivity between the controller and the disk enclosure, and the purpose of the upgrade is to give us SAS. In addition, they no longer sell SATA (!). I asked why we couldn’t simply add SAS cards to our existing controllers and was told that our current controllers are PCI-X, so can only support up to 3Gb/s SAS, while the new controllers have PCI-e and support 6Gb/s. And they want to ensure that we have the best possible performance. Pretty sure someone said the new controllers “have the future built in” to them.

One of the features we really liked about Compellent from the beginning was the fact that it was basically a software solution on top of commodity hardware. They stressed this point repeatedly. “When new technology comes out, we can just add a new card into your existing controller.” I think the example at the time was 10-gig Ethernet, but it seems like the same logic would apply to SAS. I understand that PCI-X doesn’t support 6Gb/s SAS, but it’s a tough pill to swallow that if we want to expand our SAN at all now, on top of whatever the actual expansion costs, we’re going to need to plunk down some serious money to upgrade the controllers, which really seems like a net-zero for us. We’re not going to ditch our existing FC enclosures so we’re going to be limited to 4Gb/s anyway. If they’re only selling SAS, well, that sucks for us, but ok. But why can’t we just throw a $500 PCI-X 3Gb/s card in to expand? So we’re not running at peak performance. I doubt that would be our performance bottleneck anyway. Plus, swapping out controllers is a huge operation for us.

I know at some point we’re going to have to bite the bullet and do this upgrade, but it just irks me. On the bright side, I guess, we don’t have to do a “forklift upgrade,” and the disks/enclosures will all still work. But we have a long way to grow before we need to expand, so fortunately I can put this off for a while.

Using WAL archiving & Compellent snapshots for PostgreSQL backups

I seem to have what may be an irrational dislike for differential backups in general. No matter what system I’m backing up, I feel far more confident in doing complete backups than differential ones. The significant exception to this is rsync, but even rsync does “full” backups of the files that have changed. Even so, I usually add the -W flag to rsync so it moves the entire file if it’s been modified. I guess I just like knowing there’s a single file that contains the entire database rather than having to restore a huge file and then replay a bunch of differential files in sequence.

This has worked for a long time, even with our PostgreSQL DB, which I’ve been backing up with good ol’ pg_dump, but it’s gotten to the point that the DB backup takes over 12 hours now, during which performance is seriously degraded. With the recent migration to the new server, overall performance seems significantly better overall, but performance during the nightly pg_dump is much worse. Rather than trying to troubleshoot it, I think it’s time I bit the bullet and move to WAL archiving to enable differential backups, and stop doing a full backup every night.

The PostgreSQL docs are pretty great at explaining how to do this. Basically:

  1. Configure WAL archiving to copy the WAL files to another server as soon as they’re complete.
  2. Issue the pg_start_backup() command.
  3. Perform a filesystem-level backup of the DB (usually /var/lib/pgsql on RedHat/CentOS distros).
  4. Issue the pg_stop_backup() command.

That’s basically it. The backup from step 2 plus the archived WAL files will allow you to recover to any point in time after issuing the pg_stop_backup() command. So if you complete your backup at 2 AM on Sunday and your DB crashes on Wednesday at 6 PM, you can restore to any point between 2 AM Sunday and whatever your most recently archived WAL file is (presumably within a few minutes of the crash at 6 PM Wednesday). If you decided you wanted to restore to 12:01 AM on Wednesday, you can do that using the recovery_target_time setting. My aversion to differential backups aside, this is pretty awesome.

As I said, the Postgres docs do a good job of covering this procedure completely. The only thing I have to add is how I did this using a SAN-level snapshot as my “filesystem backup.” I’d never done any scripting with Compellent before and it turned out to be pretty easy. From the Knowledge Center, under Software download the latest version of Command Utility (I downloaded 5.4.1). The utility itself is just a JAR, you’ll need Java installed to run it.

I considered doing a real filesystem backup of the DB (love that rsync) but the problem was that the DB is currently 1.2 TB and gobbling that much space on the NAS wasn’t very appealing. Compellent replays (snapshots) are just deltas, and I can easily store a week worth of backups for much less than (7 * 1.2 = ) 8.4 TB.

I wrote a crappy bash script to do everything, included below:

I retain the WAL files for 8 days and the snapshots for 7 days, but I may adjust this since the WAL files themselves consume a lot of space – about 30-40 GB per day. Though this is still less than the gzipped pg_dump I had been doing, which was about 85 GB per day.

I’ve cronned the script to run at 2 AM and so far it appears to work. Compellent replays are created almost instantly, so the backup script completed in about 10 seconds, which includes 6 seconds of sleep, which probably aren’t necessary. Considering that the pg_dump method took 12+ hours to complete, 10 seconds is immeasurably better. Well, I guess it is measurable, you just need to divide 12 hours by 10 seconds.

I’m pretty happy with this so far. The improved performance far outweighs any irrational philosophical objection I may have had to differential backups. Buuuut I’m still going to do pg_dumps on Saturdays.

Migrating a PostgreSQL DB to a new machine without doing a dump & restore

Long ago, before I stepped into my role as de facto DBA in my current job, dump & restore of our Postgres database (dumping the entire DB to a text file, formatting the data disk, and restoring the data) was a pretty regular event. This was needed because we didn’t regularly vacuum the DB, which in turn was due to vacuums taking forever (which in the end ended up being due to crappy underlying hardware). We started doing monthly global vacuums (they took so long that monthly was about all we could handle) and we discovered we no longer needed to do dump & restores. When we finally upgraded to Postgres 8.2, which introduced autovacuum, things got even better, since we didn’t have to manage vacuuming via cron jobs.

In the years between then and now our Postgres DB has ballooned to such a size that a D&R isn’t something we could feasibly do in a scheduled maintenance window. The last time I did one was when we bade farewell to the decrepit database in September, 2007. At that point our database consumed 730 GB on disk. Looking through my email archives, we began the dump at 6:30 PM on a Friday night and it completed at 3:48 AM Saturday. The restore started around 9 AM and ran until around 1 PM (I assume it went so much faster than the dump due to the new DB being significantly better hardware-wise). Building indices took until 9:27 PM Saturday. We then ran a global “ANALYZE” to generate DB stats; the analyze ran from 10 PM until 1 AM Sunday. We then had most of Sunday to process the backlog of data that accumulated since Friday afternoon when we took the database offline.

So, with a 730 GB DB, the entire procedure took 9 hours (dump) + 4 hours (restore) + 8 hours (index rebuild) + 3 hours (analyze), so about 24 hours.

However, as I said, in the years since then our database has grown as we house more and more data, currently at about 1220 GB. It might have been possible to do the migration via dump & restore in the scheduled window, but I wasn’t looking forward to it. Instead, I decided to try a different option: copying the data files directly from the old server to the new one. If this worked it would eliminate almost all the overhead and the move would be complete in however long it took to copy the data from one host to the other.

Reasons

We had a couple reasons for doing this upgrade. Performance wasn’t really one of them though; we were pretty confident that the performance of the DB was as good as we were likely to get with platter disks, and our SAN doesn’t currently have SSDs. The old DB has dual Xeon 5160 2-core CPUs @3.0ghz, 32 GB memory, and a RAID 5 OS volume. The database resided on an HP MSA70, with 24x 10krpm 146 GB SAS drives (+1 hot spare) in RAID 10 for a 1.6 TB logical volume. At the time I debated RAID 6 vs RAID 10 but in the end I opted for the performance of RAID 10 vs the capacity of RAID 6 and it worked out well.

But one of the reasons we decided to upgrade was that the drives in the DB were starting to die and the warranty had expired on them, and each disk cost about $300 to replace. That was a pretty big liability and I expected the disks to begin dying more frequently.

Another reason for upgrading was the benefits of having the data on the SAN, especially snapshotting. We’d been doing daily backups for a while but doing a dump of the DB while it’s in use makes it take forever, causing degraded performance while it’s running. Snapshot isn’t a perfect solution, but at least it’s an option.

Another reason for wanting to move the DB to the SAN was for DR purposes; if we setup SAN-SAN replication to another site, with the DB on the SAN, we get that backed up for free.

And probably the biggest reason, we were up to 1.25 TB used out of 1.6, over 80% full. We’d probably be good for another few months, but for me 80% is pretty full.

Prerequisites

In order for this to work, the version of Postgres on both machines has to be of the same minor version. In my case, the source DB (server A) was running 8.2.5 (the newest at the time the box was built), so I built 8.2.18 (source RPM) on the target (server B). Server B is pretty beefy: HP DL360 G7 with dual Xeon X5660 6-core CPUs @2.8 GHz, 96 GB PC3-10600 ECC mem, QLogic QLE-4062 iSCSI HBA connected to a 4TB volume on our Compellent SAN (tier 1, RAID 10 across 32x 15krpm FC disks). Both machines of course need to be of the same architecture, in my case x86_64. It might work across Intel/AMD, but I’m not sure about that; fortunately I didn’t have to worry about that.

Moving the data

When we did dump & restores, we dumped directly to a commonly-mounted NAS, which worked well, since we wouldn’t start the restore until the dump was complete, and we didn’t want to consume disk on the target with a gigantic dump file (in addition to spinning the disks with reading the dumpfile while attempting to write to them; the contention causes everything to go much slower). There wasn’t really any need to use a NAS as an intermediary in this case though, it would just double the amount of time needed to get the data from A to B.

I created an NFS export on B:

/data/pgsql             10.0.0.35(rw,no_root_squash)

And mounted it on A with these options in /etc/fstab:

10.0.0.36:/data/pgsql  /mnt/gannon/nfs nfs     rw,rsize=32768,wsize=32768,nfsvers=3,noatime,udp        0     0

I tried TCP vs UDP mounts and found UDP was faster.

I then copied the data over with my favorite Unix tool, rsync:

time rsync -atp --delete --progress /var/lib/pgsql/data /mnt/gannon/nfs/ > /home/evan/rsync.log

Dry run

In January I did a dry run of the procedure. I had tried copying data over without stopping postgres on A, so as not to cause a service interruption, but it didn’t work; data was changing too rapidly. By the time the rsync completed, the files it had copied over earliest had already been modified again. I ended up scheduling some downtime for a weekend and did the copy. With the above NFS settings I was able to transfer data at around 50 MB/s over our gigabit switches, the whole thing took 3-4 hours. When it came up, everything seemed to be fine. I was pretty happy, because 3-4 hours is a whole lot better than 24+.

Day of Reckoning

I finally did the real migration this past weekend. I started it at 8 PM and (after an rsync snafu) completed it around 4 AM. The snafu was caused by my use of the “--delay-updates” flag, which I later learned copies modified files to a /.~tmp~/ directory, and when all of them are copied, moves them into place in an attempt to make it a more “atomic” operation. I didn’t realize this was what was happening, and I got a little freaked out when I saw the disk usage for the target growing 100 GB larger than the source. I cancelled the rsync and ran it again, stupidly dropping the –delay-updates flag, which with the –delete flag caused it to delete all the stuff in .~tmp~ that it had already copied over. It deleted like 300 GB of stuff before I freaked out and cancelled it again. I cursed myself a few times, then manually moved the contents of .~tmp~ up to the parent to salvage what had already been transferred, and ran the rsync once again to move the remaining data. So it probably would have been done much sooner had I not cancelled it and then deleted a bunch of my progress.

You may notice that the rsync flags above don’t include -z. With huge binary files being transferred over a fast LAN I don’t think there’s much reason to use -z, in fact when I added -z the throughput plummeted.

After copying the data, I moved the virtual IP of the DB to the new machine, moved the cron jobs over, started postgres on B and everything worked. I finished all of my cleanup around 6 AM Saturday, though like I said, I would have been done much sooner had I not deleted a bunch of my progress. Even still, this is a lot better than the dump & restore scenario, and has the added benefit of being reversible. I’m planning to upgrade postgres on A to 8.2.18 and leave it as a standby server; if a problem arises with B, the data can be moved back relatively quickly.

Conclusion

Well, I don’t have any great insight to put here, but so far this has worked out. My next DB project is upgrading from 8.2 to either 8.4 or the 9.x series, but that’s going to require a lot more planning since client drivers will likely need to be updated, and I’m not sure if queries might need to be altered.

The end (I hope).

Do I still need swap space?

About three years ago I replaced our primary database. For years we’d been plagued by awful performance in the database and we were never able to diagnose the problem. The original server was a real beast at the time: 8 Opterons (single core), 32 gigs ram, and a fibre channel RAID connected via a QLogic HBA. This was back in 2005, so those specs don’t probably sound that impressive today, but this was a crazy configuration (with a crazy price tag to match). On paper it looked like this server should be basically invincible but the performance was awful, slowing down every process within the company. We contacted a few different companies (including CommandPrompt, which employs several of the core Pg devs) to see if they could assist us in diagnosing the problems but tuning only helped to a point. There was just something wrong with the box, maybe having to do with the FC HBA itself (which nobody knew much about).
Continue reading “Do I still need swap space?”

Compellent Doesn't Suck

I noticed a bunch of people landing on this site by searching for “compellent sucks.” I just want to avoid any confusion: Compellent doesn’t suck. Now that the pain of spending the money to expand our Compellent SAN is in the past, I am back to being in love with the product. The only gripe I’ve really ever had with Compellent is the price, and as Ben Franklin said:

The bitterness of poor quality remains long after the sweetness of low price is forgotten.

Compellent Doesn’t Suck

I noticed a bunch of people landing on this site by searching for “compellent sucks.” I just want to avoid any confusion: Compellent doesn’t suck. Now that the pain of spending the money to expand our Compellent SAN is in the past, I am back to being in love with the product. The only gripe I’ve really ever had with Compellent is the price, and as Ben Franklin said:

The bitterness of poor quality remains long after the sweetness of low price is forgotten.

The VMware datastore LUN 2TB (well, 1.99999 TB) size limit

I started migrating our physical machines to VMs using VMware a few years ago and the first problem I ran into is still the most annoying one: the size limit for LUNs is, per VMware’s docs, (2TB – 512B). That’s 512 bytes shy of 2TB, so basically 1.99999 TB, or 2047.999 GB. So when I create a new LUN for a datastore in the SAN the max size is 2047 GB. Now, as the VMware KB article states, this is a limitation of SCSI, not VMware per se, but that doesn’t make it any less annoying. When I first setup ESX, I created a 5 TB LUN for the datastore. It showed up in vCenter as 1 TB. After some Googling I learned of the 2 TB limit — the usable space is basically usable space = (size of lun) % 2TB, where % is the modulo operator — and found something suggesting using extents to expand the datastore across luns. I did that, but I later learned that there seems to be a consensus that extents should be avoided.

There are other things I learned along the way – that you want to limit the number of vmdks per datastore anyway, for example, due to the risk of SCSI reservation errors and IO contention, but these are all things that it feels like we shouldn’t have to worry about. I can see having separate LUNs/datastores for different logical groupings of disks, allowing you to have different snapshot schedules for each datastore, or allowing you to put an entire datastore in Tier 1 or Tier 3 (to use Compellent parlance) based on its value to you. But having to segregate stuff for technical reasons seems like a problem that should already be solved.

And maybe it is… I’ve never tried NFS datastores, but if I created an 8TB LUN, mapped it to a physical box (skirting the 2TB limit imposed by VMware), export the volume from the host over NFS and use that as the datastore, I guess I’d be able to do all the things I want. Hmm. I’ll have to think about that. I guess I’d still keep the ESX host local luns on iSCSI so they could boot from SAN, though I suppose when we move to ESXi that won’t be much of an issue anyway.

Hmm… Well, I started writing this as a rant but I think I just morphed it into a new research project.

The bright side of Compellent

Since I was bemoaning Compellent’s pricing recently I figured it would be unfair of me not to highlight the upside. Their tagline is (or was when we purchased it) “The only SAN so sophisticated it’s simple.” While I can’t say whether they’re the ONLY one, the idea is definitely true. This is the first SAN I’ve ever used, and aside from the learning curve for iSCSI itself (targets, spinup delay, etc.) it’s totally simple and intuitive. Create LUNs, map them to servers. Don’t worry about things like RAID levels or hot disks. We’re into our second year with Compellent and it’s definitely lived up to its promise of simplicity.

I don’t know how much management the average SAN requires but our sales rep recently asked me how much time per week we spend managing the SAN. I crinkled my brow, because I don’t really spend any time managing the SAN. I’ve logged in to the web interface a lot more over the past few weeks than I normally do because the SAN filled up quickly due to our experimentation with Hadoop, and I wanted to make sure we didn’t get to 100% before I was able to order more disks. But aside from that incident I think the only times I’ve logged in to the management console have been to add a LUN or map a datastore to a new ESX host.

I was reminded about this simplicity when we finally added the disks last week. We went to the datacenter Wednesday to move some servers around in the racks to ensure there would be enough power in the SAN rack for the new enclosure (16x 2TB disks). We also updated a firmware update for the SAN (required so it would recognize the new 2TB drives). We have redundant controllers, so we were told there shouldn’t be any downtime. I don’t tend to trust those types of statements – if someone says something will be down for an hour I budget for 4. If it’s 8 hours I budget for a day. If it’s zero I just think they’re lying and it’s going to explode and kill people.

So all things considered I was rather impressed. We have dual controllers, so the update was installed on one controller first, and that controller rebooted. When it rebooted, the iSCSI traffic did actually fail over properly to the secondary controller. This wasn’t completely flawless – the console on some of the machines showed some iSCSI errors, but the machines seemed to be working fine (I rebooted them just to be safe). A couple of the VMs (whose data/swap drives are all on the SAN) barfed and had to be rebooted – I think our Jabber server was the only casualty, but that was back up in under a minute. When the second controller updated itself, its traffic failed back over to the first one. When it was all done (took about 30 mins total) there was a warning about the ports being unbalanced, which was rectified by clicking the “rebalance ports” button. So all in all, I’d say there was “pretty much” no downtime. After the update, we racked the new enclosure and called it a day.

This week a tech from Compellent came onsite to do the actual install for the enclosure (hooking up the fibre loops and installing the new license). This was really zero downtime. I got some alerts that one of the loops was down, but it didn’t affect anything. Pop the disks in, wire it up, install license, and we’ve got another 32 TB usable space. It’s been over a day and the data is in the process of moving from our tier 1 (32x 15krpm FC disks) down to tier 3 (SATA). All in all it was a pretty painless procedure. Sure, it would have been easier had we not had to do the firmware update, but I guess when a new type of drive is introduced that’s to be expected.

So in conclusion, I guess this just reinforces my theory that the only bad thing about Compellent is the price. And if that’s the worst thing someone can say about your product, that’s probably a pretty good place to be.

The SAN Scam

It’s time to buy some more disks for the SAN we have at work. The SAN is made by Compellent and we’ve had it for a year and it’s been great. One of the selling points was the ability to add disks however we wanted – one at a time is possible, which apparently isn’t the case with other SAN products. The one we looked at from LeftHand expanded by purchasing entire nodes, so the incremental cost was pretty high. Compellent seemed to have a higher initial cost but cheaper incrementally.

Well, that wasn’t really the case, as I’ve come to discover. The way they license features on the SAN requires “expansion licenses” for each set of 8 disks you add on. As it happens, I would like to add 8 SATA disks to our SAN, bumping us into a license expansion. The net result of this is that purchasing these disks costs over $16,000.

If that sounds like a lot of money, well, it is. I expected some markup for enterprise-class hardware, but this is ridiculous. A quick search on Newegg shows that hard drives are readily available at about $0.09 – $0.10 per gigabyte, and even Seagate drives are only around $0.14 per gig. At the price I was quoted for the Compellent drives, the price per gig is over $2.00 per gig! The markup is over 1500%, and that’s not even factoring in the discount they likely get for buying disks in bulk – I doubt they pay retail. They claim this is due to the disks being “certified” but I don’t imagine they’re opening up each disk and checking its platters. They probably just make sure the firmware is correct and then ship it out. Their quote also includes 1 year of support on the disks, with 4-hour on-site replacement, but still, as someone who’s basically “cheap,” this just pisses me off.

Now, in Compellent’s defense, their product is amazing, and I would wholeheartedly recommend it to anyone with the need for it and the means to get it, but it is very pricey, moreso than I was led to believe. The fact that I rarely have to think about the SAN probably means it’s money well spent, but as I said, I’m a cheap bastard, so this bothers me.