Exchange Safety Net

The Exchange DAG Safety Net is one of the pieces of Exchange that create high availability during the message Transport process. Most people probably don’t think about message transport and assume that it “just works”. And, while they might be correct, I think that in order to confidently say that you have designed a highly available environment you should understand more about the pieces that contribute to HA in Exchange.

In Exchange 2007/2010 this feature was called the Transport Dumpster. In the simplest of terms, the Transport Dumpster is a message queue that contains messages that have been successfully delivered to a mailbox. The purpose of the Transport Dumpster is to ensure messages aren’t lost in the case of a lossy failover. In the event of a lossy failover, the messages in the Transport Dumpster will be replayed to the new active database copy.

In Exchange 2013, the Transport Dumpster was upgraded and renamed to the Safety Net. While the basic function remains the same, there are several improvements with the Safety Net.

Enhancements:

  • DAGs are no longer a requirement. If the Exchange server is not a member of a DAG, then the Safety Net will store a copy of the delivered messages on another mailbox server in the same AD site.
  • Safety net is no longer a single point of failure. There is the primary safety net and the shadow safety net. If the primary safety net fails for longer than 12 hours, the shadow safety net will redeliver the messages.
  • In a DAG environment, Safety Net works closely with shadow redundancy. Shadow copy is no longer required to keep a copy of the delivered message while it waits for replication to a passive copy to complete. This helps cut down on network bandwidth for replication.
  • Safety Net is now a more robust service built to provide transport HA. Because of this, the maximum size of Safety Net can’t be adjusted, potentially causing holes in the HA. The only setting that is now configurable is the expiration time.
  • Lastly, the Safety Net now applies to messages sent to public folders as well.

How it works:

The best way I’ve heard this put is simply ” the Safety Net begins where shadow redundancy ends.” While shadow redundancy protects the message with redundant copies while it is in transit, Safety Net takes over after the message has been delivered and keeps a redundant copy after delivery.

The mailbox server that first accepts the message will ultimately end up acting as the primary Safety Net for the message. Keep in mind this isn’t necessarily the mailbox server that hosts the destination mailbox. As the message is processed by the primary server, it moves through the Transport Pipeline to the mailbox Transport service and is delivered to the mailbox. At this point, the initial server that accepted the message will move a copy of the message from the active queue to the primary Safety Net.

The server that was holding the shadow queue for the message, polls the initial server for the status of the message. Once it is confirmed that the message was delivered, the message is moved from the shadow queue to the shadow Safety Net on the same server.

Depending on the environment, you  may want to adjust the amount time that a message can remain in the Safety Net.

To do that we run the following PowerShell command:
[PS] c:\> Set-Transport -SafetyNetHoldTime dd.hh:mm:ss

dd is days, hh is hours, mm is minutes, and ss is seconds. You will need to adjust this setting to at least match the ReplayLagTime of any lagged DAG copies in the environment.

Message resubmission:

Message resubmission from the Safety Net or the Shadow Safety Net is handled automatically, there is no manual intervention needed to initiate this process. The active manager component of the Exchange replication service is what will initiate message resubmission. There are really only two scenarios that will cause message resubmission from the safety net.

  1. Automatic or Manual failover of a MB Database that is part of a DAG.
  2. Activation of a lagged MB database copy.

The process is the same for each scenario with the exception of time the resubmission will cover. A normal DAG failover, the new active copy of the MB database will at most be a few hours behind the old copy of the database. Where as a lagged copy will, by design, be several days behind the old copy. Because of this, I will reiterate the fact that you need to make sure that the SafetyNetHoldTime greater than or, at minimum, equal to the ReplayLagTime set on the lagged database copy.

Just like the resubmission of messages from the Safety Net, the Shadow Safety Net resubmission process is completely automated. If the Primary Safety Net becomes unresponsive, the active manager will attempt to contact the queue for 12 hours before identifying the Primary Safety Net as unavailable. At this point, a broadcast is sent to the transport service on all of the mailbox servers within the transport high availability boundary looking for any Safety Nets that contain messages from the required time period. A shadow safety net will then reply and resubmit the messages matching the requested time period.

Any messages that are resubmitted through the Shadow Safety Net will require full processing through the Transport service on the mailbox server. If there are large number of messages, this can cause a large strain on the mailbox server resources. This process has been optimized so that only the messages in the Shadow Safety Net that meet the requested time period AND the requested mailbox database are resubmitted.