Introduction to Database Availability Groups – Full of WIN!
Basic Overview
Now that Exchange 2010 has been released to beta, it’s now time to talk about all the fun things that we’ve been working on and working with. To start off with, I want to point everyone over to the actual Exchange 2010 Official site.
Now that I’ve pointed you at the bits, let’s get into some details about Database Availability Groups or “The DAG” as it’s called! To start off with, it’s a pretty simple concept. The DAG uses Windows Failover Clustering Services to allow automatic failover and uses continuous replication to keep copies of a Mailbox database floating on servers other than the one actually hosting the “active” copy. This is VERY simplistic, but I want to gloss over the details for a moment to build up to the details later. What this means is that now, we can host a bunch of copies of a Mailbox database on several servers (up to 16 servers can be in one DAG) and thanks to the magic of continuous replication, the log files are shipped and we can have multiple, concurrent copies of the database. In the event of a failure, Exchange 2010 “promotes” one of the copies of the database to “active” status and the Mailbox role then takes up the task of serving up the mailboxes on that database. Each database maintains separate status, so one server can host copies of multiple databases and only have some of those copies active at one time. This can be confusing, so let’s draw a diagram (ooo, pictures!):
In this diagram, we have three servers, and three copies of each database, one on each server. The “active” database copy is the one with the star. The flow of data from the “active” copy to the “passive” copies is concurrent.
Hopefully, it’s clear that a copy of each Mailbox database is hosted on two other servers in this scenario. There are actually several reasons for this, and let’s start talking about some cases. In the first one, let’s say that we lose MBDB01. In this case, it’s just a simple failover and the next preferred server will elevate and start hosting the mailboxes (and for those of you wondering, YES you can set the preferred failover scheme, for example, if you want it to go 1, 3, 2 instead of 1, 2, 3, you can set that). That is a pretty simple case, why else would you want so many copies? In this case, we could use this type of architecture to fail a server, apply patches, and avoid nasty maintenance downtime, but will still be protected if one of the other servers fails during that time. Good ‘ole double redundancy. The third case for maintaining at least three copies is that ensures that there are always enough servers in the DAG, up and running, to allow a quorum for the underlying cluster.
All of the mailboxes are hosted on one server, BUT, you are still able to have users access their e-mail, without long, expensive restores or complicated reconfiguration of your DNS or network!
How it actually works
Earlier on, I mentioned that the DAG uses Windows Failover Clustering and continuous replication to build the copies of the database. What is actually happening is (to me at least) much more interesting. The Windows Failover Clustering service is installed just for the purposes of the automatic failover. The way the databases are treated and how they are handled it much like the Exchange 2007 features of CCR with a few of the SCR features thrown in for good measure. One of the big differences between the DAG and CCR is that you can configure the number of database copies which allows you to make full use of the Clustering components. One of the reasons why I used the three server example, above, is because this is what Microsoft has recommended for the cluster to properly determine quorum decisions. You can get by with only two copies, but at least three is the recommended minimum.
One of the great features of using a DAG is that it is completely managed from Exchange. What this means is that when you are configuring the clustering you don’t have to be a clustering wizard or HA guru to set it up correctly. Exchange 2010 takes care of all the configuration for you, and as my co-worker Devin says, this is a HUGE win.
What people are saying and doing
All this talk about clustering and data redundancy brings up an interesting conversation that is currently floating around, and that is, with a sufficiently robust DAG structure, do you still have a need for on-site backups? This has opened up a whole can of worms, and I can say that I feel confident that using a properly designed DAG scheme can easily replace many of the functions of standard backups. There are still areas that I would feel more comfortable with a reliable set of backups (database corruption or total site failure), but the DAG can mitigate some of the risks.
That being said, the way that we currently are using our DAG is a little bit different than the scenario I laid out above. To get even more complicated, I have plans to modify our structure to take advantage of Network Load Balancing and turning our current structure into one that it aimed at a high amount of availability! Here’s the planned structure:
In this particular case, the plan is to basically mirror the servers using NLB to serve up one logical endpoint for the CAS, HT and UM roles (with the Hub Transport have to be careful to exclude the HT to HT traffic from the NLB, but that’s a topic for another post). With that in place, and using the DAG to take care of two copies of a single database, we expand our ability to perform maintenance with minimal downtime to our internal clients while also providing a high amount of uptime in the case of a failure.
So, now I’ve talked about the DAG and what it can do, but there is quite a bit more. I’ll follow this up shortly with some more advanced features like lag copies, off-site replication and other fun things!