Setting user Postgres passwords via MD5

Say you want to create a Postgres account for a user but you don’t want them to have to reset it after logging in, and you don’t want to do the “come type in a password on my computer” routine. Well, here’s one way around this.

  1. Have the user generate an MD5 of their password on their local computer. Postgres uses the username as the salt for the hash, so the command to generate the md5 on a Mac would be (assuming a username of ‘ehoffman’ and a password of ‘abcdefg’):
    [evan@Evans-MacBook-Pro ~] $ echo -n abcdefgehoffman | md5
    95eebfcce27162773a3828689df9d79e
    

    The “-n” is important – without it, the newline gets included in the hash. After they generate their MD5, have them send it to you (along with their username).

  2. Create the user’s account in the database (or ALTER ROLE if it already exists):
    CREATE ROLE ehoffman LOGIN INHERIT ENCRYPTED PASSWORD 'md595eebfcce27162773a3828689df9d79e';
    

    Syntax for an existing account:

    ALTER ROLE ehoffman ENCRYPTED PASSWORD 'md595eebfcce27162773a3828689df9d79e';
    

That’s it. This has the added benefit of the password never being logged in the DB logs or the .psql_history. The main downside is the possibility of user error.

The not-so-secret secret to Postgres performance

I manage a bunch of Postgres DBs and one of the things I almost always forget to do when setting up a new one is set the readahead up from the default of 256. I created this script and run it out of /etc/rc.local and sometimes cron it too. The 3 commands at the top are only really relevant on systems with “huge” memory – probably over 64 GB. We ran into some memory problems with CentOS 6 on a box with 128 GB ram which we ended up working around by reinstalling CentOS 5, but the /sys/kernel/mm/redhat_transparent_hugepage/ options below should fix them in 6.x (though we haven’t actually tried it on that DB, we haven’t seen any problems in other large DBs).

#!/bin/bash

echo no > /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag
echo never >/sys/kernel/mm/redhat_transparent_hugepage/defrag

echo 1 > /proc/sys/vm/dirty_background_ratio

BLOCK_DEVICES=`perl -ne 'chomp; my @a=split(/[s]+/); next unless $a[4]; next if ($a[4] =~ /sd[a-z]+[d]+/); print "$a[4]n";' /proc/partitions `

logger -t tune_blockdevs "Block devices matching: $BLOCK_DEVICES"

for DEV in $BLOCK_DEVICES;  do
        logger -t tune_blockdevs "Setting IO params for $DEV"
### Uncomment the below line if using SSD
#        echo "noop" > /sys/block/$DEV/queue/scheduler
        /sbin/blockdev --setra 65536 /dev/$DEV
done

Goodbye, pg_dump

I’ve been a Postgres user and administrator for a while. Over the years, my views on backups have evolved.

Originally, like most people, I started out with good old pg_dump. With a reasonably small database (under 50 GB) dumping to a flat text file is a fine option. I’d generally do something like pg_dump -Upostgres dbname | gzip > dbname.sql.gz to compress it on the fly and save space. For years this seemed perfect: dumping the entire database in a single transaction into a single file that can be restored anywhere.

But as my databases started growing larger and larger, the time it took to do a pg_dump grew as well. At a previous job, the database grew to nearly 2TB and the pg_dump took nearly 18 hours. We’d by that point already changed the pg_dump schedule from daily to weekly and then to three times a month and then finally to semi-monthly. Not only was it slow, but since it operated in a single transaction it wreaked havoc with normal database operation for queries that needed locks on tables locked by the dump.

When we moved the database from a physical RAID to a volume on our SAN, that gave us the opportunity to use LUN snapshotting rather than pg_dump (I just remembered I already wrote about that here). This let us move to a monthly pg_dump and more frequent snapshot-level backups that took up very little space. This was ideal on Compellent since the snapshots would auto-expire after however long you specified.

When I started at Yodle we were doing nightly pg_dumps and pretty soon we ran into the same problems I’d seen at Didit with the dump itself interfering with normal DB operation – the dump would start at midnight and run until 7-8 AM when I started, and after a few months it would still be running at noon. We discussed moving to wal archiving and making a basebackup to NFS but that would require a pretty massive amount of space, and as anybody who uses “enterprise storage” knows, that’s not something you want to do. We discussed building a whitebox file server for backups but nobody was really in love with that option – we’re trying to reduce the reliance on physical machines as much as possible. We talked about pushing it all to S3 but that seemed rather difficult.

When I attended NYC PgDay earlier this year, there was lots of discussion about WAL-E. I hadn’t ever head of WAL-E so I looked it up and was impressed. Basically, WAL-E handles archiving of wal to S3, but first compresses and pgp-encrypts it. It also handles pushing the basebackup to S3, also compressed and pgp-encrypted. This was just what we were looking for. We set it up and, amazingly, it worked perfectly. After a few weeks (and confirming we can restore from the wal-e backups) we moved our pg_dump to weekly, on the weekend when it doesn’t interfere with any user processes. We do a wal-e basebackup every 3-4 days or so and retain 3 of them. We retain all the wal so we can restore the DB to any point within the last ~10 days if needed. The best part is it’s faster than pg_dump, and since the basebackup doesn’t operate in a transaction (it’s a filesystem-level backup rather than an application-level backup) it doesn’t mess with user queries. There’s of course elevated IO during this time but our SAN has more than enough bandwidth.

We setup some basic monitoring of S3 (check the age of the most recent WAL and log it in Zabbix) just to ensure the backups are actually happening, and we’re at the point where we’re discussing moving pg_dump to monthly, or simply not doing it at all. Overall, wal-e has been a huge win for us, enabling better, faster backups that don’t interfere with the DB itself, and, while not free, aren’t ridiculously expensive. And since it’s in its own S3 bucket, you can tweak the bucket settings (e.g. enable RRS) to save money, and Amazon tells you exactly how much your backups cost you.

Problems with PostgreSQL 9.0 on CentOS 6.3 with a 2.6.32-series kernel

Note: this is cross-posted from the Yodle Tech Blog.

We recently migrated our Postgres database to a significantly more powerful machine. While our existing server was doing okay load-wise, we were approaching 90% full on its Postgres volume. We’d also recently acquired an awesome new PureStorage SSD-based SAN and were excited to move our database onto it. For comparison, here are the the specs for the two systems:

Old server New server
Nickname DB1 DB2
CPU 2x 6-core Xeon X5690 CPUs at 3.47GHz 2x 8-core E5-2690 CPUs at 2.90 GHz
Memory 96 GB 1333 MHz 128 GB 1600 MHz
Storage
  • Dell PERC H800 SAS HBA
  • Dell MD1220 drive enclosure
  • 24x 15k RPM SAS drives configured in RAID 10
  • QLogic QLE2562 dual-port 8 Gbit FC HBA
  • PureStorage SSD-based SAN
OS CentOS 5.6 x86_64 CentOS 6.3 x86_64
Kernel 2.6.18-238.9.1.el5 2.6.32-279.2.1.el6.x86_64

As you can probably guess, moving from spinning disk onto SSD was the centerpiece of the upgrade. We’d tested the PureStorage system pretty extensively and used it in some secondary databases and seen some amazing improvements. We were excited to see how it would handle our production workload.

Basic Configuration

DB1 was built in early 2011 and was running CentOS 5.6. We’re generally pretty big fans of CentOS, so when building DB2 we went with CentOS 6.3, which was the latest version. We followed our standard installation procedure, which essentially means doing a minimal installation and then installing just what we needed, and of course updating all packages to the latest versions.

We provisioned a big LUN for the Postgres data and 500 GB each for Postgres “temp” and logging. While a separate partition/volume for pg_log/pg_xlog is a common practice, we also created a separate volume for temp as a safeguard against runaway queries that generate lots of temp files due to hash-joins or sorting gigantic result sets. In the past we’ve run into situations where the temporary files consume all available space, and if the sort space is on the same partition as your data you run the risk of crashing the database if Postgres attempts to allocate space and there’s none to be had. Since all the SAN volumes are thin provisioned, there’s no penalty for creating separate volumes, and we get the additional benefit of being able to specify different mount options for each volume. (It’s worth noting that PostgreSQL 9.2 allows you to specify a maximum size for temp space, but since we’re still on 9.0 that option isn’t available to us.) We then created XFS filesystems on each volume. We didn’t bother partitioning the volumes, just created the filesystems on the raw devices. This way, in the event we need to expand a volume, we don’t have to worry about growing the partition, just the LUN (in the SAN) and the filesystem.

The initial cutover

We configured streaming replication between DB1 and DB2 to get DB2 in sync with DB1. On the night of Sunday, August 19th, we cut over from DB1 to DB2 as the primary. All things considered, everything went well and our database was only inaccessible to our servers for a few minutes. (Incidentally, while the database was “down” all our client sites were fine since they are completely decoupled from the database.) After verifying everything was up we ran some test queries and confirmed the performance was awesome, gave ourselves some virtual high-fives and went to bed.

The next day, Monday, started out well. Everything that touched the DB was much faster. Pages that took 3-5 seconds to load were nearly instant. Jobs that took hours to run finished in under 15 minutes. It was as if someone had hit the turbo button on our whole system. Everything was wonderful. Or so we thought.

The problems begin

Around midday Monday we started seeing lots of errors being thrown in our applications with the cause of “too many clients” to the database. Our connection limit has been set at 250 forever, so it seemed strange to encounter problems with the connection limit out of the blue. We had recently done a code release so our first line of thought was that there was a bug in the code. We use Hibernate, and it’s pretty easy to blame Hibernate. The problem was intermittent — every 20-30 minutes we’d see a bunch of SQLExceptions due to “too many connections”, which would last 1-2 minutes, and then it would clear up and everything was fine again. We made it through the rest of the day, and in the evening the problem seemed to have cleared up. We thought maybe it was a transient problem that had resolved itself but we saw a bunch of the same exceptions in some jobs that ran overnight. Our system is pretty resilient though, and most connection attempts automatically retried until they succeeded. The problem was a high priority but since it wasn’t affecting end users we were able to limp along. Tuesday was much the same as Monday and we continued investigating. Wednesday the real trouble started.

The real trouble starts

Our nightly pg_dump, which would normally finish by 8:30 AM, was only halfway complete by 8 AM on Wednesday. We had the same too-many-clients exceptions throughout the night. But around midday we started getting reports from users that our UI was really slow: some pages that had been loading nearly instantly were taking 30-60 seconds to load, or more. This quickly turned the problem into a crisis. We saw the same “too many connections” errors in the frontend logs as we had in the backend. The basic behavior was the same – we’d get a wave of “too many connections” for a couple minutes, then it would clear up for a while, and then the cycle would repeat.

We finally had to admit that maybe the problem wasn’t in the code, but in the database. To mitigate the error, we lowered the active connection limits in our connection pools, which improved the “too many connections” error for users, but led page loads to hang while the connection pool waited for a slot to open. This wasn’t much of an improvement for end users though – rather than failing quickly, their page loads were hanging indefinitely.

A minor breakthrough – and a workaround

Thursday morning we were able to get much closer to the actual problem by running vmstat on the database itself. vmstat shows the amount of memory cached vs free along with IO in/out. As people who’ve run Postgres databases before are probably aware, it’s common to see almost 100% of available memory “used” in top, even when the database doesn’t seem to be doing very much. This is due to caching.

What we saw in vmstat, however, was that during the “problem” periods the “cached” was decreasing while the “free” was increasing while the io in/out dropped to nearly zero. After some thought, we realized that the reason for the “too many connections” errors was that queries that normally finished in a few milliseconds were instead taking 30+ seconds due to the IO blocking we were seeing in vmstat. With queries running much longer than usual, they quickly ran over each other and we hit the limit of 250.

So the root behavior we were seeing that seemed to be causing the problem was:

  • Periodically, something was aggressively attempting to reclaim cached memory.
  • While memory was being reclaimed, IO was almost completely blocked.
  • While IO was blocked, database queries naturally took longer than usual, leading to rapid consumption of available database connections.

Now that we seemed to have a root cause (or more concrete symptom) we were able to formulate some relevant search queries. Simply Googling for linux memory reclaim, one of the results on the first page was from the PostgreSQL performance list and seemed to describe very similar behavior on machines with a lot of cores. DB2 has 16 physical cores, and we had logical cores enabled, for a total of 32. Maybe that’s “a lot.” We found some other similar threads and even this blog post written on the very same day we started Googling (how’s that for coincidence?). We also found references to Linux & NUMA memory-reclaim problems for MySQL and MongoDB. Most posts seemed to indicate there’s a problem in the 2.6.32 kernel line with optimizations related to NUMA memory. If experiencing problems, the suggested remedy was to set /proc/sys/vm/zone_reclaim_mode to 0 if using NUMA. Well, /proc/sys/vm/zone_reclaim_mode was already zero, and we weren’t using NUMA anyway – we were using interleaved memory. But this was enough to plant the seed of doubt in our minds for the 2.6.32 kernel. DB1 had been running for well over a year without issue on a 2.6.18 kernel.

On Thursday, thankfully, we discovered that “manually” clearing the cache caused the IO blockage to clear up immediately, at least until the system had time to fill it back up again and the memory-reclaim process kicked in again. The cache is dropped by executing echo 1 > /proc/sys/vm/drop_caches. Every time queries began to block, dropping the cache immediately cleared it up. Since we now had a stopgap workaround, we just cronned dropping of the cache for every 30 minutes. This let us get through Friday and the weekend without having to babysit the database too much for failed connections, etc.

On Friday we decided the best course of action was to put a QLogic fibre channel HBA in DB1 (with its presumably known-good 2.6.18-series kernel), provision some LUNs on it, set it up as a streaming replica and fail back over to it. This way we’d be on a good server but still have the IO benefits of the SSD. We ordered the card from Newegg Friday and received it Monday.

Tactical Retreat

I installed the card Monday (8/27), configured DB1 as a replica, tested it Tuesday, and Wednesday evening we cut back over to DB1 as our primary. With fingers crossed, we went into Thursday and thankfully everything was fine. While overall performance was a little slower than it had been on DB2 (before it had gotten really bad), it was still blazingly fast compared to how it had been on spinning disk. But most importantly there was no evidence of memory-reclaim problems.

We set DB2 up as a hotstandby and performed some testing on it to try and recreate the problem. Unfortunately we didn’t have a lot of success (possibly due to hostandbys being read-only replicas of the primary) but based on the grumbling about NUMA and memory-reclaim in the 2.6.32 kernel line, we felt the best course of action was to go to an older, proven kernel. Since DB1 seemed to be fine with its 2.6.18-238.9.1.el5 kernel, we opted to do a clean install of CentOS 5.8, running the latest in the CentOS 2.6.18 line: 2.6.18-308.13.1.el5.

DB2 reborn

We installed CentOS 5.8, remounted the existing LUNs, resynced it with the primary and did some more testing. We didn’t notice any problems, but since we weren’t really able to reproduce the problem reliably in the first place that wasn’t much consolation. On Thursday, 9/6, we cut back over to DB2. As I write this, DB2 has been in service for several weeks without any sign of the memory reclamation issue, or, more importantly, the IO/query blocking.

Conclusions

While we’re really happy to have (apparently) resolved the problem, we found the resolution somewhat unsatisfying. We had tested DB2 pretty extensively internally prior to putting it into production the first time, literally running millions of real-world queries against it and never encountered anything like the blocking we saw once it took on the production workload. Prior to making it our primary DB we had used it for several days as a hot-standby read-only reporting server and still never saw an issue. After we cut back from DB2 to DB1 we did yet more testing and still weren’t able to reproduce the query-blocking behavior. While this definitely felt like a problem related to the Linux kernel we didn’t have any hard evidence, and going back to a 2.6.18-series kernel was to some degree a leap of faith. Thankfully, so far it’s worked out for us.

Though we didn’t ultimately figure out what it was in the kernel that caused the problem, we did learn a lot from the experience. Most importantly, we ironed out our database cutover procedure to the point where it’s not really a major event to cutover. Since clearly at some point we’ll want to upgrade from CentOS 5.8, we know that we can always fail back to another database with minimal interruption if a problem like this arises in the future.

Graphing SSH dictionary attacks with HighCharts

After my 10-year-old basement Linux server died this week from a power outage, I took the sad step of giving up on it. It’s died before and I’ve patched it back together with a new power supply here or an addon PCI SATA card there, but I finally decided to throw in the towel since I had a newer old computer that had been idle for several years. The one that died was an Athlon K7 750 MHz with 512 MB ram. The new one is an Athlon 2 GHz (3200+) with 1 gig. For my uses, specs don’t really matter that much, but it’s nice to have more power for free.

I put CentOS 6 on it and configured Samba and copied all the data off the old machine and was back up and running within a few hours. Since I forward ports through my FiOS router to this box I did my standard lockdown procedure, including adding myself to the AllowUsers in sshd_config. Afterwards I took a look in /var/log/secure and saw the typical flood of dictionary attacks trying to get in as root or bob or tfeldman or jweisz. I have iptables configured to rate-limit SSH connections to 2 per 5 seconds per IP so the box doesn’t get DoSed out of existence, but some stuff does make it through to sshd.

Looking through /var/log/secure, I got to thinking it would be interesting if there was some way to visualize the attacks in a handy graph. Then I remembered, oh, wait, I can do that.

I wrote a perl script to parse out the attacks from /var/log/secure and insert them into a Postgres DB. This turned out to be pretty easy. Then I thought it would be more interesting to tie the IP of each attack to its originating country. I’ve used MaxMind’s GeoIP DB pretty extensively before, but I was looking something free. That’s when I remembered that MaxMind has a free GeoIP DB: GeoLiteCity. I grabbed it and yum-installed the Perl lib and added the geo data to the attack DB. Rather than worry about normalizing the schema I just shoved the info into the same table. Life is easier this way, and it’s just a for-fun project.

So I got that all working and parsed it against the existing /var/log/secures via

[root@lunix2011 ~]# zcat /var/log/secure-20111117.gz | perl parse-secure.pl 

I wrote ssh.php to see what’s in the table:

ssh.php list of hacking attempts
ssh.php list of hacking attempts

So now that the data was all in place, time to move on to the graphs, which is what I really wanted to do. Last time I wanted to graph data programmatically I used JPGraph, which does everything in PHP and is super versatile. But I wanted something… cooler. Maybe something interactive. A little Googling turned up Highcharts which is absolutely awesome, and does everything in JavaScript. I basically modified some of their example charts and pumped my data into them and got the charts below.

Pie chart of attacks grouped by country for the past 30 days:

Pie chart by country
Pie chart by country

Bar graph of attacks per day:

Bar graph of daily attacks
Bar graph of daily attacks

So, that’s that. Code is in github if anyone wants to play around with it. I’ve cronned parse-secure.pl to run every 5 minutes so the data gets updated automatically.

Using WAL archiving & Compellent snapshots for PostgreSQL backups

I seem to have what may be an irrational dislike for differential backups in general. No matter what system I’m backing up, I feel far more confident in doing complete backups than differential ones. The significant exception to this is rsync, but even rsync does “full” backups of the files that have changed. Even so, I usually add the -W flag to rsync so it moves the entire file if it’s been modified. I guess I just like knowing there’s a single file that contains the entire database rather than having to restore a huge file and then replay a bunch of differential files in sequence.

This has worked for a long time, even with our PostgreSQL DB, which I’ve been backing up with good ol’ pg_dump, but it’s gotten to the point that the DB backup takes over 12 hours now, during which performance is seriously degraded. With the recent migration to the new server, overall performance seems significantly better overall, but performance during the nightly pg_dump is much worse. Rather than trying to troubleshoot it, I think it’s time I bit the bullet and move to WAL archiving to enable differential backups, and stop doing a full backup every night.

The PostgreSQL docs are pretty great at explaining how to do this. Basically:

  1. Configure WAL archiving to copy the WAL files to another server as soon as they’re complete.
  2. Issue the pg_start_backup() command.
  3. Perform a filesystem-level backup of the DB (usually /var/lib/pgsql on RedHat/CentOS distros).
  4. Issue the pg_stop_backup() command.

That’s basically it. The backup from step 2 plus the archived WAL files will allow you to recover to any point in time after issuing the pg_stop_backup() command. So if you complete your backup at 2 AM on Sunday and your DB crashes on Wednesday at 6 PM, you can restore to any point between 2 AM Sunday and whatever your most recently archived WAL file is (presumably within a few minutes of the crash at 6 PM Wednesday). If you decided you wanted to restore to 12:01 AM on Wednesday, you can do that using the recovery_target_time setting. My aversion to differential backups aside, this is pretty awesome.

As I said, the Postgres docs do a good job of covering this procedure completely. The only thing I have to add is how I did this using a SAN-level snapshot as my “filesystem backup.” I’d never done any scripting with Compellent before and it turned out to be pretty easy. From the Knowledge Center, under Software download the latest version of Command Utility (I downloaded 5.4.1). The utility itself is just a JAR, you’ll need Java installed to run it.

I considered doing a real filesystem backup of the DB (love that rsync) but the problem was that the DB is currently 1.2 TB and gobbling that much space on the NAS wasn’t very appealing. Compellent replays (snapshots) are just deltas, and I can easily store a week worth of backups for much less than (7 * 1.2 = ) 8.4 TB.

I wrote a crappy bash script to do everything, included below:

I retain the WAL files for 8 days and the snapshots for 7 days, but I may adjust this since the WAL files themselves consume a lot of space – about 30-40 GB per day. Though this is still less than the gzipped pg_dump I had been doing, which was about 85 GB per day.

I’ve cronned the script to run at 2 AM and so far it appears to work. Compellent replays are created almost instantly, so the backup script completed in about 10 seconds, which includes 6 seconds of sleep, which probably aren’t necessary. Considering that the pg_dump method took 12+ hours to complete, 10 seconds is immeasurably better. Well, I guess it is measurable, you just need to divide 12 hours by 10 seconds.

I’m pretty happy with this so far. The improved performance far outweighs any irrational philosophical objection I may have had to differential backups. Buuuut I’m still going to do pg_dumps on Saturdays.

Migrating a PostgreSQL DB to a new machine without doing a dump & restore

Long ago, before I stepped into my role as de facto DBA in my current job, dump & restore of our Postgres database (dumping the entire DB to a text file, formatting the data disk, and restoring the data) was a pretty regular event. This was needed because we didn’t regularly vacuum the DB, which in turn was due to vacuums taking forever (which in the end ended up being due to crappy underlying hardware). We started doing monthly global vacuums (they took so long that monthly was about all we could handle) and we discovered we no longer needed to do dump & restores. When we finally upgraded to Postgres 8.2, which introduced autovacuum, things got even better, since we didn’t have to manage vacuuming via cron jobs.

In the years between then and now our Postgres DB has ballooned to such a size that a D&R isn’t something we could feasibly do in a scheduled maintenance window. The last time I did one was when we bade farewell to the decrepit database in September, 2007. At that point our database consumed 730 GB on disk. Looking through my email archives, we began the dump at 6:30 PM on a Friday night and it completed at 3:48 AM Saturday. The restore started around 9 AM and ran until around 1 PM (I assume it went so much faster than the dump due to the new DB being significantly better hardware-wise). Building indices took until 9:27 PM Saturday. We then ran a global “ANALYZE” to generate DB stats; the analyze ran from 10 PM until 1 AM Sunday. We then had most of Sunday to process the backlog of data that accumulated since Friday afternoon when we took the database offline.

So, with a 730 GB DB, the entire procedure took 9 hours (dump) + 4 hours (restore) + 8 hours (index rebuild) + 3 hours (analyze), so about 24 hours.

However, as I said, in the years since then our database has grown as we house more and more data, currently at about 1220 GB. It might have been possible to do the migration via dump & restore in the scheduled window, but I wasn’t looking forward to it. Instead, I decided to try a different option: copying the data files directly from the old server to the new one. If this worked it would eliminate almost all the overhead and the move would be complete in however long it took to copy the data from one host to the other.

Reasons

We had a couple reasons for doing this upgrade. Performance wasn’t really one of them though; we were pretty confident that the performance of the DB was as good as we were likely to get with platter disks, and our SAN doesn’t currently have SSDs. The old DB has dual Xeon 5160 2-core CPUs @3.0ghz, 32 GB memory, and a RAID 5 OS volume. The database resided on an HP MSA70, with 24x 10krpm 146 GB SAS drives (+1 hot spare) in RAID 10 for a 1.6 TB logical volume. At the time I debated RAID 6 vs RAID 10 but in the end I opted for the performance of RAID 10 vs the capacity of RAID 6 and it worked out well.

But one of the reasons we decided to upgrade was that the drives in the DB were starting to die and the warranty had expired on them, and each disk cost about $300 to replace. That was a pretty big liability and I expected the disks to begin dying more frequently.

Another reason for upgrading was the benefits of having the data on the SAN, especially snapshotting. We’d been doing daily backups for a while but doing a dump of the DB while it’s in use makes it take forever, causing degraded performance while it’s running. Snapshot isn’t a perfect solution, but at least it’s an option.

Another reason for wanting to move the DB to the SAN was for DR purposes; if we setup SAN-SAN replication to another site, with the DB on the SAN, we get that backed up for free.

And probably the biggest reason, we were up to 1.25 TB used out of 1.6, over 80% full. We’d probably be good for another few months, but for me 80% is pretty full.

Prerequisites

In order for this to work, the version of Postgres on both machines has to be of the same minor version. In my case, the source DB (server A) was running 8.2.5 (the newest at the time the box was built), so I built 8.2.18 (source RPM) on the target (server B). Server B is pretty beefy: HP DL360 G7 with dual Xeon X5660 6-core CPUs @2.8 GHz, 96 GB PC3-10600 ECC mem, QLogic QLE-4062 iSCSI HBA connected to a 4TB volume on our Compellent SAN (tier 1, RAID 10 across 32x 15krpm FC disks). Both machines of course need to be of the same architecture, in my case x86_64. It might work across Intel/AMD, but I’m not sure about that; fortunately I didn’t have to worry about that.

Moving the data

When we did dump & restores, we dumped directly to a commonly-mounted NAS, which worked well, since we wouldn’t start the restore until the dump was complete, and we didn’t want to consume disk on the target with a gigantic dump file (in addition to spinning the disks with reading the dumpfile while attempting to write to them; the contention causes everything to go much slower). There wasn’t really any need to use a NAS as an intermediary in this case though, it would just double the amount of time needed to get the data from A to B.

I created an NFS export on B:

/data/pgsql             10.0.0.35(rw,no_root_squash)

And mounted it on A with these options in /etc/fstab:

10.0.0.36:/data/pgsql  /mnt/gannon/nfs nfs     rw,rsize=32768,wsize=32768,nfsvers=3,noatime,udp        0     0

I tried TCP vs UDP mounts and found UDP was faster.

I then copied the data over with my favorite Unix tool, rsync:

time rsync -atp --delete --progress /var/lib/pgsql/data /mnt/gannon/nfs/ > /home/evan/rsync.log

Dry run

In January I did a dry run of the procedure. I had tried copying data over without stopping postgres on A, so as not to cause a service interruption, but it didn’t work; data was changing too rapidly. By the time the rsync completed, the files it had copied over earliest had already been modified again. I ended up scheduling some downtime for a weekend and did the copy. With the above NFS settings I was able to transfer data at around 50 MB/s over our gigabit switches, the whole thing took 3-4 hours. When it came up, everything seemed to be fine. I was pretty happy, because 3-4 hours is a whole lot better than 24+.

Day of Reckoning

I finally did the real migration this past weekend. I started it at 8 PM and (after an rsync snafu) completed it around 4 AM. The snafu was caused by my use of the “--delay-updates” flag, which I later learned copies modified files to a /.~tmp~/ directory, and when all of them are copied, moves them into place in an attempt to make it a more “atomic” operation. I didn’t realize this was what was happening, and I got a little freaked out when I saw the disk usage for the target growing 100 GB larger than the source. I cancelled the rsync and ran it again, stupidly dropping the –delay-updates flag, which with the –delete flag caused it to delete all the stuff in .~tmp~ that it had already copied over. It deleted like 300 GB of stuff before I freaked out and cancelled it again. I cursed myself a few times, then manually moved the contents of .~tmp~ up to the parent to salvage what had already been transferred, and ran the rsync once again to move the remaining data. So it probably would have been done much sooner had I not cancelled it and then deleted a bunch of my progress.

You may notice that the rsync flags above don’t include -z. With huge binary files being transferred over a fast LAN I don’t think there’s much reason to use -z, in fact when I added -z the throughput plummeted.

After copying the data, I moved the virtual IP of the DB to the new machine, moved the cron jobs over, started postgres on B and everything worked. I finished all of my cleanup around 6 AM Saturday, though like I said, I would have been done much sooner had I not deleted a bunch of my progress. Even still, this is a lot better than the dump & restore scenario, and has the added benefit of being reversible. I’m planning to upgrade postgres on A to 8.2.18 and leave it as a standby server; if a problem arises with B, the data can be moved back relatively quickly.

Conclusion

Well, I don’t have any great insight to put here, but so far this has worked out. My next DB project is upgrading from 8.2 to either 8.4 or the 9.x series, but that’s going to require a lot more planning since client drivers will likely need to be updated, and I’m not sure if queries might need to be altered.

The end (I hope).

PostgreSQL query to determine the largest tables in the database

This is just a handy query I use from time to time to see which tables are growing madly, and how much of the growth is index bloat.

select
schemaname,
tablename,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) as size_p,
pg_total_relation_size(schemaname || '.' || tablename) as siz,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as total_size_p,
pg_total_relation_size(schemaname || '.' || tablename) - pg_relation_size(schemaname || '.' || tablename) as index_size,
(100*(pg_total_relation_size(schemaname || '.' || tablename) - pg_relation_size(schemaname || '.' || tablename)))/case when pg_total_relation_size(schemaname || '.' || tablename) = 0 then 1 else pg_total_relation_size(schemaname || '.' || tablename) end || '%' as index_pct
from pg_tables
order by siz desc limit 20;

This returns schema, table, size on disk (in human-readable and byte format – for sorting), and total size on disk including indices, and the percentage of the total size that comprises the indexes.

Sample result from our OpenFire Jabber server:

 schemaname |      tablename       |   size_p   |    siz    | total_size_p | index_size | index_pct
------------+----------------------+------------+-----------+--------------+------------+-----------
 public     | ofconversation       | 71 MB      | 159236096 | 152 MB       |   84623360 | 53%
 public     | ofconparticipant     | 38 MB      |  95395840 | 91 MB        |   55394304 | 58%
 public     | ofpresence           | 10184 kB   |  10452992 | 10208 kB     |      24576 | 0%
 public     | ofpubsubitem         | 5336 kB    |   7733248 | 7552 kB      |    2269184 | 29%
 public     | ofid                 | 4008 kB    |   4120576 | 4024 kB      |      16384 | 0%
 public     | ofpubsubsubscription | 912 kB     |   1458176 | 1424 kB      |     524288 | 35%
 public     | ofpubsubaffiliation  | 728 kB     |   1196032 | 1168 kB      |     450560 | 37%
 public     | ofoffline            | 832 kB     |    942080 | 920 kB       |      90112 | 9%
 pg_catalog | pg_depend            | 320 kB     |    794624 | 776 kB       |     466944 | 58%
 pg_catalog | pg_attribute         | 384 kB     |    761856 | 744 kB       |     368640 | 48%
 pg_catalog | pg_proc              | 376 kB     |    753664 | 736 kB       |     368640 | 48%
 pg_catalog | pg_rewrite           | 72 kB      |    270336 | 264 kB       |     196608 | 72%
 pg_catalog | pg_description       | 136 kB     |    245760 | 240 kB       |     106496 | 43%
 pg_catalog | pg_operator          | 112 kB     |    237568 | 232 kB       |     122880 | 51%
 pg_catalog | pg_class             | 88 kB      |    212992 | 208 kB       |     122880 | 57%
 pg_catalog | pg_type              | 64 kB      |    155648 | 152 kB       |      90112 | 57%
 pg_catalog | pg_statistic         | 72 kB      |     98304 | 96 kB        |      24576 | 25%
 pg_catalog | pg_amop              | 24 kB      |     90112 | 88 kB        |      65536 | 72%
 pg_catalog | pg_conversion        | 16 kB      |     90112 | 88 kB        |      73728 | 81%
 pg_catalog | pg_constraint        | 8192 bytes |     81920 | 80 kB        |      73728 | 90%
(20 rows)

Time: 8.917 ms

Do I still need swap space?

About three years ago I replaced our primary database. For years we’d been plagued by awful performance in the database and we were never able to diagnose the problem. The original server was a real beast at the time: 8 Opterons (single core), 32 gigs ram, and a fibre channel RAID connected via a QLogic HBA. This was back in 2005, so those specs don’t probably sound that impressive today, but this was a crazy configuration (with a crazy price tag to match). On paper it looked like this server should be basically invincible but the performance was awful, slowing down every process within the company. We contacted a few different companies (including CommandPrompt, which employs several of the core Pg devs) to see if they could assist us in diagnosing the problems but tuning only helped to a point. There was just something wrong with the box, maybe having to do with the FC HBA itself (which nobody knew much about).
Continue reading “Do I still need swap space?”