Moving an Exchange 2003 server to another location with minimal risk and disruption?

So our Exchange server is located in our office building. This made sense at the time because that’s where the users are. Over time though, this has proved problematic for a few reasons. Primarily, our office is certainly not a datacenter and doesn’t offer the amenities of one – clean, reliable power, and redundant cooling. In an average year we lose power probably 10-15 times, often for an hour or more. The rest of our production environment is hosted in a top-tier datacenter, so after a while I started to wonder why our Exchange server wasn’t there, and making plans to move it there. Oh, and did I mention I’m not an Exchange admin in any sense of the term? I just inherited the Exchange server about 2 months ago.

The first step was to setup a VPN between the office and the datacenter so that users in the office would be able to connect seamlessly to the Exchange server once it was moved. This turned out to be relatively easy. The next step was basically to move the Exchange server. This originally seemed like it would be an easy thing to do — having a long history with PostgreSQL I figured I could do essentially a “dump and restore” – run some command that would backup contents of the mail database to a file and then restore it to a new machine. Well, I quickly learned that wasn’t possible, at least not given the factors involved.

Microsoft suggests two ways of moving an Exchange server to new hardware: 1) replacing a machine in-place with another one that takes its name and doing a restore, and 2) setting up the new server “next to” the old one and moving mailboxes over one at a time. I ruled out the first method because it seemed like a total crapshoot with no easy “rollback” mechanism. Plus I had no idea how long it would take to do a restore of our Exchange server – total mailbox size at the time was over 300 GB, and it took about 28 hours just to do the backup, so it seemed like it could easily take over 72 hours, meaning even if we started it Friday at 6 PM, it wouldn’t complete by Monday morning, and people would come in to work to find they had no email. No good.

This left the second option – setting up another server and moving mailboxes one at a time. This seemed pretty simple, except for the fact that people frequently use Outlook Web Access (webmail) to check their mail when out of the office, and ActiveSync to get mail on their phones. We tested the 2-server setup a while back and while mail gets routed properly, and users in the office are able to connect to both Exchange servers without problems, when they try accessing their mail from outside the office it fails. This is because if A is the old server (which people use for webmail) and B is the new server, if you log in to webmail (server A) but your mailbox is homed on server B, webmail will issue you a 302 redirect to http://B . If that’s not a valid URL outside your office (as is the case with us) it won’t work. If we could move everybody’s mailboxes from A to B overnight, and then make webmail point to B rather than A, that would solve the problem, but again, we had no way to know how long that would take, and I didn’t want to risk making anyone’s mail unavailable.

The plan I then came up with was to set up an Exchange frontend server in our office in front of our existing Exchange server. The frontend server would handle all the OWA/ActiveSync stuff and abstract that away from the backend server (where the mailboxes live). I could then set up an Exchange server in our office in a VM, migrate mailboxes over to them one at a time, and when it was done, copy the VM Exchange server to an external USB drive and drive it to the datacenter (about 25 miles away) and import the VM to our VMware production cluster, fiddle with its IP address and voila – the Exchange server would be moved.

But then I had a better idea: set up the frontend server and the new backend server in the datacenter in the VMware cluster from the get-go. Then when people accessed webmail they’d be hitting a server in the datacenter, which would connect to the Exchange server in the office transparently and relay them their mail. I could then move each mailbox from A to B with B being in the datacenter and the move taking place over the VPN.

Well, this is what I ended up doing, and there have been some wrinkles in the process, but so far it’s generally been working as I expected. I moved my mailbox to the new server today, and the move itself went fine – took about 90 minutes to move my 1.5 GB mailbox. It wasn’t quite a seamless process – the mailbox was moved but I couldn’t send or receive mail from the other server or the Internet in general. I managed to fix outbound SMTP pretty quickly (we relay mail through a smarthost in the datacenter) but inbound wasn’t working because the old server and new one couldn’t communicate for some reason, and all mail was being delivered to the old server. Among the things I did in attempting to solve this problem were create a new routing group for the servers in the datacenter (since we only had one Exchange server before, we only had one routing group), and then setup a Routing Group Connector between the two. This seemed like it should have resolved it but it didn’t. From server A, I could “telnet B 25” and the connection would succeed, but if I issued a HELO I got 500 5.3.3 Unrecognized command. Same thing happened if I tried B -> A. After hours of checking settings, I came across a post on Experts Exchange that suggested the problem may be with the firewall (Cisco ASA) inspecting SMTP traffic. This was something that had flitted around in my head for a couple of seconds but I didn’t actually check it. In the end though, that’s what it was – the ASA in the datacenter was mangling the SMTP packets somehow and preventing the two from communicating. Once I issued the “no inspect esmtp” line, the whole day’s worth of mail came flooding through to my inbox (now on server B).

For some reason, however, mail was still not going B->A. I spent a while trying to figure out why – looking in logs, doing “telnet A 25” and everything seemed fine. The mail queue kept showing queued messages though and an error like “remote server didn’t respond to the connection.” What ended up solving it, though, was deleting the Routing Group Connector associated with the datacenter routing group and re-adding it. For whatever reason, that cleared it right up.

So as of right now, we have Office and Datacenter, with Office having Exchange server A, and Datacenter having Exchange servers B and C – B being the new backend and C being the new frontend. DNS has been updated so webmail points to C, and C connects to A or B to get the user’s mail for OWA/ActiveSync. It works, it’s fast, I’m mostly happy.

I should probably mention that we discussed moving to Google Apps in the midst of this project. I was about 70% in favor of it, but in the end it seemed too expensive. We’ve already paid for our Exchange licenses and a hardware message archiver. Google’s price for Google Apps is $83/person per year if you include their 10-year archival option. If you don’t already have infrastructure in place, that might be cheap, but when you’re comparing it to “$0” (and yes, I realize projects like the one I mentioned above aren’t free), it is a lot when you have ~100 users. In addition, most people at my company weren’t comfortable with the privacy/legal implications of having Google host our mail in the cloud – not to mention lots of people are Outlook addicts. They offered 25 GB storage per user, which was pretty compelling, and I personally love the Gmail interface, but in the end we opted to stick with Exchange for the time being.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: