Load balancing SMTP traffic is something that makes sense for a lot of organizations. They have an investment in load balancers for their CAS array, web server farm, etc and so SMTP seems like another logical protocol to run through the load balancers and get all the benefits that it delivers.
However it is also quite easy to create a situation where SMTP traffic is not being load balanced as intended, and worse still there are scenarios where the use of some load balanced configurations may actually diminish SMTP high availability, or even undermine security.
Let’s take a look at some of the issues and how they can be identified and resolved.
Issues with Load Balancer Configurations
The first issues are reasonably easy to correct if they exist. These are primarily related to the configuration and features of the load balancer itself, such as:
- the priority of the target servers
- the load balancing method/algorithm used
- whether source NATing is being used
- the health monitors/probes
Consider the following scenario where incoming internet email is passed through an email security server/appliance, which is configured to then send the traffic to a load balancer for distribution to the Hub Transport servers. Various internal applications and systems also use the load balancer as their SMTP target.
Priority of Target Servers
In most load balancer configurations you can configure a priority or weight for the servers that are the targets of the traffic. Different vendors use their own terminology for this, but the general idea is that it provides the option to have preferred servers that will be considered first for a new connection if they are available.
Now there are situations where this is a deliberate design choice, and if that is your case then you may not need to worry about this particular issue. However there are some considerations to be aware of if you find that your servers are weighted differently for no particular reason.
Here is a traffic graph of a typical day for two servers that were configured with different weightings/priorities in the load balancer. You can see that SERVER1 handled a higher volume of traffic than SERVER2.
This graph was created by gathering traffic stats from the message tracking logs. For more information see Calculate Daily Email Traffic using Message Tracking Logs and Log Parser.
Depending on your server resources and traffic load this may not be an issue for you, but in some environments it could lead to load issues that interrupt mail flow. So if your actual intention is evenly distribute traffic across multiple Hub Transport servers then you would consider adjusting the server weight/priority accordingly.
In the above scenario when the weightings were adjusted the traffic became more evenly distributed (not perfectly, but that is due to other factors in that environment which I will cover next).
Load Balancing Method/Algorithm
Along similar lines to the previous issue, a load balancer will usually have multiple methods for deciding which server should be used for a connection. For example, the Kemp load balancers have quite a few scheduling options available.
If you’re seeing SMTP traffic imbalances similar to those in the previous example, and your server weighting/priority is not the cause, you should look at the load balancing method and investigate whether your current configuration is not the best suited for that traffic.
As one specific example, if the load balancing is based on source IP it may inadvertently lead to traffic imbalances. In the example environment shown at the beginning of this article, source IP-based load balancing would generally result in well balanced traffic from the internal applications and systems, assuming each internal IP is sending roughly equal volumes of email, otherwise some imbalances can still occur.
But that configuration may result in imbalanced email traffic coming from the internet (via the email security server/appliance), because that all appears to come from a single IP.
As the earlier graph showed this was causing some imbalance in overall SMTP traffic even after the server weight/priority was reconfigured, because while that resolved traffic imbalance from internal sources that are all on different IP addresses, the incoming internet email was still treated as coming from a single IP and was almost entirely being sent to a single Hub Transport server.
The obvious reaction here may be to choose a different load balancing algorithm, however my recommendation for environments where incoming internet email all traverses a single host like that is to consider not using the load balancer for distribution of that incoming internet traffic.
I will explain my reasons for that in the next sections.
One of my concerns with source NATing and load balanced SMTP traffic is the impact is has on the protocol logs generated by the Hub Transport servers.
Note that much of data presented in this section relies on protocol logging being turned on for all receive connectors on the Hub Transport servers.
[PS] C:\>Get-ReceiveConnector -Server SERVER1 | select name,protocollogginglevel | ft -auto Name ProtocolLoggingLevel ---- -------------------- Default SERVER1 Verbose Client SERVER1 Verbose Internal Relay Verbose Internet via Gateway Verbose
For more on protocol logging see Troubleshooting Email Delivery with Exchange Server Protocol Logging.
With all internal and incoming SMTP traffic going via the load balancer, which is source NATing the connections, the protocol logs only recorded traffic from the load balancer (IP 10.1.1.12 below) and no other IP addresses.
IP Name Hits -------------- ----------------------- ----- 10.1.1.12 10.1.1.12 25976 Statistics: ----------- Elements processed: 1428114 Elements output: 1 Execution time: 13.49 seconds
The above stats were collected using protocol logs and Log Parser. For more information see Report Top Sender IP’s on Exchange Server 2010 using Log Parser.
Looking at hits per receive connector (recorded as “connector-id” in protocol logs) there was no traffic being handled by the receive connector that was configured for internet traffic.
Connector Hits -------------------------------------------------- ------- SERVER1\Internal Relay 1422080 SERVER1\Default SERVER1 4363
While this doesn’t necessarily result in an email disruption for your environment, if you have a receive connector for a specific purpose and it is not being used for that intended purpose then your environment is not operating as intended.
Aside from that there is also the issue of being able to identify the relative traffic volume of internal vs internet email, if you’re relying on protocol log data to give you that information about your email traffic patterns.
Depending on your incoming email routes there are multiple ways to respond to this issue.
In the example scenario used in this article the email security server has its own load balancing capability for incoming email because you can specify multiple internal hosts to deliver email to. This would also apply to hosted email security services.
By configuring each Hub Transport as an internal delivery target instead of just using the load balancer, the protocol logs now log incoming internet email as coming from the IP addresses for the email security system, rather than the load balancer.
IP Name Hits -------------- ----------------------- ----- 10.1.1.12 10.1.1.12 24819 192.168.0.32 192.168.0.32 115 192.168.0.31 192.168.0.31 105 Statistics: ----------- Elements processed: 1397172 Elements output: 3 Execution time: 22.47 seconds
If you do not have an email security server/appliance or other hosted solution, and SMTP connections go directly from the internet to the load balancer, then you could look at using multiple MX records instead, although this would require the availability of multiple public IP addresses.
In addition, any traffic imbalance being caused by the use of source IP-based load balancing should no longer be present. This graph represents incoming internet SMTP connections per server, which began imbalanced and then evened out almost precisely once the load balancer was bypassed.
And importantly, with traffic bypassing the load balancer it should be getting handled by the intended receive connector (which I will explore more in the section further down on security implications).
Connector Hits -------------------------------------------------- ------- SERVER1\Internal Relay 1257702 SERVER1\Internet via Gateway 6374 Statistics: ----------- Elements processed: 1267529 Elements output: 2 Execution time: 3.23 seconds
Health Monitors and Probes
Yet another issue with load balancing SMTP is the nature of how load balancers detect service availability.
Most load balancers that are service-aware have a health monitor or probe that makes an SMTP connection to the Hub Transport server, waits for a sign that the service is responding, then disconnects. That sign may be simply waiting for the SMTP banner to be returned, or waiting for a response to HELO.
For example, here is the protocol log data for a health check by a load balancer:
"220 SERVER1.domain.local Microsoft ESMTP MAIL Service ready at Fri, 26 Apr 2013 09:40:12 +1000" helo domain.com 250 server1.domain.com Hello [10.1.1.10] quit 221 2.0.0 Service closing transmission channel
That probe may detect complete service failures, but won’t necessarily detect back pressure if it only goes as far as a HELO.
For example, I pushed one of my test lab servers into “medium” back pressure and then used Telnet to connect and test the response.
As you can see below it was only when I progressed the SMTP conversation past HELO and into the “mail from:” stage that the server returned the familiar 452 4.3.1 Insufficient system resources error, but only for external senders.
220 HO-EX2010-MB1.exchangeserverpro.net Microsoft ESMTP MAIL Service ready at Mo n, 29 Apr 2013 19:55:12 +1000 helo 250 HO-EX2010-MB1.exchangeserverpro.net Hello [10.1.1.4] mail from: email@example.com 452 4.3.1 Insufficient system resources mail from:firstname.lastname@example.org 250 2.1.0 Sender OK
So this server would be rejecting incoming internet email (the sender from @gmail.com), even though the load balancer considers the server to be healthy and available.
If you combine this service-awareness issue with the problem of all email coming from one IP address (ie the email security server/appliance) being distributed only to the server that is suffering back pressure, you can end up with an email disruption for your end users.
Admittedly the combination of factors required to cause that problem scenario may be uncommon, but the potential impact is quite high.
Another issue with some load balanced SMTP configurations is how it can impact the security of your Exchange environment.
The first potential impact is for distribution groups that are configured to require that all senders be authenticated but are otherwise not restricted as to who can send to them (this is the default for distribution groups created in Exchange 2007 and later).
Because some administrators add the source NAT address(es) of the load balancers into the list of remote IP addresses on their internal relay connectors configured in Exchange, this results in any sender that is coming via the load balancer being considered as authenticated and therefore allowed to send to the distribution list.
For internal relay connectors that aren’t exposed to the outside world this may only be a minor inconvenience.
Where this becomes more serious is when incoming internet email traffic arrives via that same load balancer, and can send email to any recipient anywhere – in other words, you’ve got an open relay.
This is a Telnet session from outside of my test lab firewall, through to the load balancer’s IP address, and I am able to relay an email through my Exchange servers.
250 HO-EX2010-MB2.exchangeserverpro.net Hello [10.1.1.12] mail from: email@example.com 250 2.1.0 Sender OK rcpt to: firstname.lastname@example.org 250 2.1.5 Recipient OK data 354 Start mail input; end with . subject: test relay test . 250 2.6.0 <546b08e1-fd0f-4baa-a473-03fba110a1af@HO-EX2010-MB2.exchangeserverpro. net> [InternalId=334267] Queued mail for delivery
This occurs because the source NATing causes Exchange to believe that the email is originating from the load balancer (10.1.1.12), and that IP address is configured as a remote IP address on the internal relay connector.
Ideally if internet email traffic is coming in directly to a load balancer, and the load balancer has no other mechanism for preventing an open relay scenario, then you should ensure that the receive connectors configured for internal applications and systems to relay email are not also handling the internet email traffic.
This could be achieved by using a different VIP and source NAT pool on the load balancer for that traffic, so that it does not get included in the remote IP range for the internal relay connector.
I’ve covered a lot of points in this article and before you get too alarmed I want to make a few things clear.
Firstly, not all of these scenarios are necessarily bad. A traffic imbalance may not be a concern for smaller networks, and may even be a deliberate configuration in some situations.
The impact on protocol logs may not be a concern for administrators who simply do not make any use of the data they contain.
Limitations around health probes/monitoring by the load balancer may not be a concern if you have other robust enterprise monitoring systems alerting you to those conditions already.
Distribution groups being emailed by unauthenticated senders may not be an issue if there is spam filtering in place, and if the organization actually engages in a lot of group email with external parties.
And the sharing of a relay connector for both internal (trusted) and incoming (untrusted) email may not be an immediate issue if the incoming traffic first passes through another device or host that blocks the relay attempt (eg an email security server/appliance).
However, if you do have any concerns about any of these issues I’ve raised then it would be wise to review your configurations, perform some testing, and consider whether there is a better configuration you could move to that mitigates any issues you are actually experiencing.