Issues With Load Balancing SMTP Traffic

Load balancing SMTP traffic is something that makes sense for a lot of organizations. They have an investment in load balancers for their CAS array, web server farm, etc and so SMTP seems like another logical protocol to run through the load balancers and get all the benefits that it delivers.

However it is also quite easy to create a situation where SMTP traffic is not being load balanced as intended, and worse still there are scenarios where the use of some load balanced configurations may actually diminish SMTP high availability, or even undermine security.

Let’s take a look at some of the issues and how they can be identified and resolved.

Issues with Load Balancer Configurations

The first issues are reasonably easy to correct if they exist. These are primarily related to the configuration and features of the load balancer itself, such as:

  • the priority of the target servers
  • the load balancing method/algorithm used
  • whether source NATing is being used
  • the health monitors/probes

Consider the following scenario where incoming internet email is passed through an email security server/appliance, which is configured to then send the traffic to a load balancer for distribution to the Hub Transport servers. Various internal applications and systems also use the load balancer as their SMTP target.

exchange-smtp-load-balancing-02

Priority of Target Servers

In most load balancer configurations you can configure a priority or weight for the servers that are the targets of the traffic. Different vendors use their own terminology for this, but the general idea is that it provides the option to have preferred servers that will be considered first for a new connection if they are available.

Now there are situations where this is a deliberate design choice, and if that is your case then you may not need to worry about this particular issue. However there are some considerations to be aware of if you find that your servers are weighted differently for no particular reason.

Here is a traffic graph of a typical day for two servers that were configured with different weightings/priorities in the load balancer. You can see that SERVER1 handled a higher volume of traffic than SERVER2.

exchange-smtp-load-balancing-weighting-01

This graph was created by gathering traffic stats from the message tracking logs. For more information see Calculate Daily Email Traffic using Message Tracking Logs and Log Parser.

Depending on your server resources and traffic load this may not be an issue for you, but in some environments it could lead to load issues that interrupt mail flow. So if your actual intention is evenly distribute traffic across multiple Hub Transport servers then you would consider adjusting the server weight/priority accordingly.

In the above scenario when the weightings were adjusted the traffic became more evenly distributed (not perfectly, but that is due to other factors in that environment which I will cover next).

exchange-smtp-load-balancing-weighting-02

Load Balancing Method/Algorithm

Along similar lines to the previous issue, a load balancer will usually have multiple methods for deciding which server should be used for a connection. For example, the Kemp load balancers have quite a few scheduling options available.

exchange-smtp-load-balancing-scheduling

If you’re seeing SMTP traffic imbalances similar to those in the previous example, and your server weighting/priority is not the cause, you should look at the load balancing method and investigate whether your current configuration is not the best suited for that traffic.

As one specific example, if the load balancing is based on source IP it may inadvertently lead to traffic imbalances. In the example environment shown at the beginning of this article, source IP-based load balancing would generally result in well balanced traffic from the internal applications and systems, assuming each internal IP is sending roughly equal volumes of email, otherwise some imbalances can still occur.

But that configuration may result in imbalanced email traffic coming from the internet (via the email security server/appliance), because that all appears to come from a single IP.

exchange-smtp-load-balancing-03

As the earlier graph showed this was causing some imbalance in overall SMTP traffic even after the server weight/priority was reconfigured, because while that resolved traffic imbalance from internal sources that are all on different IP addresses, the incoming internet email was still treated as coming from a single IP and was almost entirely being sent to a single Hub Transport server.

exchange-smtp-load-balancing-weighting-02

The obvious reaction here may be to choose a different load balancing algorithm, however my recommendation for environments where incoming internet email all traverses a single host like that is to consider not using the load balancer for distribution of that incoming internet traffic.

I will explain my reasons for that in the next sections.

Source NATing

One of my concerns with source NATing and load balanced SMTP traffic is the impact is has on the protocol logs generated by the Hub Transport servers.

Note that much of data presented in this section relies on protocol logging being turned on for all receive connectors on the Hub Transport servers.

[PS] C:\>Get-ReceiveConnector -Server SERVER1 | select name,protocollogginglevel | ft -auto

Name                                    ProtocolLoggingLevel
----                                    --------------------
Default SERVER1                                      Verbose
Client SERVER1                                       Verbose
Internal Relay                                       Verbose
Internet via Gateway                                 Verbose

For more on protocol logging see Troubleshooting Email Delivery with Exchange Server Protocol Logging.

With all internal and incoming SMTP traffic going via the load balancer, which is source NATing the connections, the protocol logs only recorded traffic from the load balancer (IP 10.1.1.12 below) and no other IP addresses.

IP             Name                    Hits
-------------- ----------------------- -----
10.1.1.12      10.1.1.12               25976

Statistics:
-----------
Elements processed: 1428114
Elements output:    1
Execution time:     13.49 seconds

The above stats were collected using protocol logs and Log Parser. For more information see Report Top Sender IP’s on Exchange Server 2010 using Log Parser.

Looking at hits per receive connector (recorded as “connector-id” in protocol logs) there was no traffic being handled by the receive connector that was configured for internet traffic.

Connector                                          Hits
-------------------------------------------------- -------
SERVER1\Internal Relay                             1422080
SERVER1\Default SERVER1                               4363

While this doesn’t necessarily result in an email disruption for your environment, if you have a receive connector for a specific purpose and it is not being used for that intended purpose then your environment is not operating as intended.

Aside from that there is also the issue of being able to identify the relative traffic volume of internal vs internet email, if you’re relying on protocol log data to give you that information about your email traffic patterns.

Depending on your incoming email routes there are multiple ways to respond to this issue.

In the example scenario used in this article the email security server has its own load balancing capability for incoming email because you can specify multiple internal hosts to deliver email to. This would also apply to hosted email security services.

exchange-smtp-load-balancing-04

By configuring each Hub Transport as an internal delivery target instead of just using the load balancer, the protocol logs now log incoming internet email as coming from the IP addresses for the email security system, rather than the load balancer.

IP             Name                    Hits
-------------- ----------------------- -----
10.1.1.12      10.1.1.12               24819
192.168.0.32   192.168.0.32            115
192.168.0.31   192.168.0.31            105

Statistics:
-----------
Elements processed: 1397172
Elements output:    3
Execution time:     22.47 seconds

If you do not have an email security server/appliance or other hosted solution, and SMTP connections go directly from the internet to the load balancer, then you could look at using multiple MX records instead, although this would require the availability of multiple public IP addresses.

exchange-smtp-load-balancing-05

In addition, any traffic imbalance being caused by the use of source IP-based load balancing should no longer be present. This graph represents incoming internet SMTP connections per server, which began imbalanced and then evened out almost precisely once the load balancer was bypassed.

exchange-smtp-load-balancing-source-nat-01

And importantly, with traffic bypassing the load balancer it should be getting handled by the intended receive connector (which I will explore more in the section further down on security implications).

Connector                                          Hits
-------------------------------------------------- -------
SERVER1\Internal Relay                             1257702
SERVER1\Internet via Gateway                       6374

Statistics:
-----------
Elements processed: 1267529
Elements output:    2
Execution time:     3.23 seconds

Health Monitors and Probes

Yet another issue with load balancing SMTP is the nature of how load balancers detect service availability.

Most load balancers that are service-aware have a health monitor or probe that makes an SMTP connection to the Hub Transport server, waits for a sign that the service is responding, then disconnects. That sign may be simply waiting for the SMTP banner to be returned, or waiting for a response to HELO.

For example, here is the protocol log data for a health check by a load balancer:

"220 SERVER1.domain.local Microsoft ESMTP MAIL Service ready at Fri, 26 Apr 2013 09:40:12 +1000"
helo domain.com
250 server1.domain.com Hello [10.1.1.10]
quit
221 2.0.0 Service closing transmission channel

That probe may detect complete service failures, but won’t necessarily detect back pressure if it only goes as far as a HELO.

For example, I pushed one of my test lab servers into “medium” back pressure and then used Telnet to connect and test the response.

As you can see below it was only when I progressed the SMTP conversation past HELO and into the “mail from:” stage that the server returned the familiar 452 4.3.1 Insufficient system resources error, but only for external senders.

220 HO-EX2010-MB1.exchangeserverpro.net Microsoft ESMTP MAIL Service ready at Mo
n, 29 Apr 2013 19:55:12 +1000
helo
250 HO-EX2010-MB1.exchangeserverpro.net Hello [10.1.1.4]
mail from: exchangeserverpro@gmail.com
452 4.3.1 Insufficient system resources
mail from:alan.reid@exchangeserverpro.net
250 2.1.0 Sender OK

So this server would be rejecting incoming internet email (the sender from @gmail.com), even though the load balancer considers the server to be healthy and available.

exchange-smtp-load-balancing-health

If you combine this service-awareness issue with the problem of all email coming from one IP address (ie the email security server/appliance) being distributed only to the server that is suffering back pressure, you can end up with an email disruption for your end users.

Admittedly the combination of factors required to cause that problem scenario may be uncommon, but the potential impact is quite high.

Security Implications

Another issue with some load balanced SMTP configurations is how it can impact the security of your Exchange environment.

The first potential impact is for distribution groups that are configured to require that all senders be authenticated but are otherwise not restricted as to who can send to them (this is the default for distribution groups created in Exchange 2007 and later).

exchange-smtp-load-balancing-auth-senders

Because some administrators add the source NAT address(es) of the load balancers into the list of remote IP addresses on their internal relay connectors configured in Exchange, this results in any sender that is coming via the load balancer being considered as authenticated and therefore allowed to send to the distribution list.

For internal relay connectors that aren’t exposed to the outside world this may only be a minor inconvenience.

Where this becomes more serious is when incoming internet email traffic arrives via that same load balancer, and can send email to any recipient anywhere – in other words, you’ve got an open relay.

This is a Telnet session from outside of my test lab firewall, through to the load balancer’s IP address, and I am able to relay an email through my Exchange servers.

250 HO-EX2010-MB2.exchangeserverpro.net Hello [10.1.1.12]
mail from: exchangeserverpro@gmail.com
250 2.1.0 Sender OK
rcpt to: paul@locklan.com.au
250 2.1.5 Recipient OK
data
354 Start mail input; end with .
subject: test relay
test
.
250 2.6.0 <546b08e1-fd0f-4baa-a473-03fba110a1af@HO-EX2010-MB2.exchangeserverpro. net> [InternalId=334267] Queued mail for delivery

This occurs because the source NATing causes Exchange to believe that the email is originating from the load balancer (10.1.1.12), and that IP address is configured as a remote IP address on the internal relay connector.

exchange-smtp-load-balancing-relayconnector

exchange-smtp-load-balancing-openrelay

Ideally if internet email traffic is coming in directly to a load balancer, and the load balancer has no other mechanism for preventing an open relay scenario, then you should ensure that the receive connectors configured for internal applications and systems to relay email are not also handling the internet email traffic.

This could be achieved by using a different VIP and source NAT pool on the load balancer for that traffic, so that it does not get included in the remote IP range for the internal relay connector.

Summary

I’ve covered a lot of points in this article and before you get too alarmed I want to make a few things clear.

Firstly, not all of these scenarios are necessarily bad. A traffic imbalance may not be a concern for smaller networks, and may even be a deliberate configuration in some situations.

The impact on protocol logs may not be a concern for administrators who simply do not make any use of the data they contain.

Limitations around health probes/monitoring by the load balancer may not be a concern if you have other robust enterprise monitoring systems alerting you to those conditions already.

Distribution groups being emailed by unauthenticated senders may not be an issue if there is spam filtering in place, and if the organization actually engages in a lot of group email with external parties.

And the sharing of a relay connector for both internal (trusted) and incoming (untrusted) email may not be an immediate issue if the incoming traffic first passes through another device or host that blocks the relay attempt (eg an email security server/appliance).

However, if you do have any concerns about any of these issues I’ve raised then it would be wise to review your configurations, perform some testing, and consider whether there is a better configuration you could move to that mitigates any issues you are actually experiencing.

About Paul Cunningham

Paul is a Microsoft Exchange Server MVP and publisher of Exchange Server Pro. He also holds several Microsoft certifications including for Exchange Server 2007, 2010 and 2013. Find Paul on Twitter, LinkedIn or Google+, or get in touch for consulting/support engagements.

Comments

  1. Benjamin Hodge says:

    Hi Paul,

    Great post on some of the decisions regarding whether to use HLB or just plain MX records for SMTP traffic. The Health Probe discussion is particularly interesting however for the issues in traffic load and security related to Source NAT by the HLB there are very easy ways to resolve this that I’d like to discuss for anyone who wishes to continue using their HLB.

    1. The problem with the load is actually with the “persistence” algorithm, not the “scheduling” algorithm. Something like Round Robin for scheduling would send an even number of new connections to each SMTP Server however with Source IP persistence enabled returning SMTP senders would be sent back to the same backend server. By keeping Round Robin enabled but setting your Persistence method to “None” you would ensure that each new SMTP connection (regardless of the Source IP) would be rebalanced and get an even spread.

    2. While many people might have a configuration where the requests to the SMTP servers come from the IP of the HLB (called Non-Transparent by KEMP) this is optional in most cases. Providing you meet the necessary network configuration requirements so traffic routes properly there is no reason why the load balancer can’t pass on the request using the original source IP of the client. This would ensure both logging and your SMTP connector configuration function as you would expect. If you’re a KEMP customer just contact Support and ask them for advice on configuring “Transparency”.

    3. If you can’t meet the necessary requirements for passing on the original source IP there is another way you can work around the problem of Source IP to ensure the security of your SMTP connectors (but you would have limited visibility in your logs as you mentioned). To do this you need to setup 2 separate virtual services, one for internal use and a 2nd for external use. It is possible to configure in KEMP (and I assume most HLB) for a Virtual Service to use a specific IP for connections to the Real Server, this would allow your SMPT servers to identify whether the internal or external VIP was being used for the connection. Your SMTP connectors would be set so that you had different security profiles depending on which of these 2 VIPs was used for the connection. By locking down access to the internally used VIP you can maintain security. I have also helped a customer use this method in order to allow unauthenticated SMTP traffic from 2 specific internal application servers where the application did not allow for SMTP authentication but where they still wanted all other connections to use auth. It worked perfectly in this situation.

    I’d be very interested in more information on how the SMTP health check can be improved, might be something for the Dev team to work on :-)

    Thanks again for a great discussion on some of the pitfalls in load balancing SMTP

    Cheers,
    Ben

    • Hi Ben, thanks for clarifying that persistence vs scheduling point.

      And for spelling out the options for working around these issues. Exactly the type of info I was hoping to draw out from the experts in this field.

      Regarding the health checks, I sent a note to Bhargav explaining one particular back pressure scenario. I think he’ll see straight away where potential improvements can be made but I’m happy to discuss further if you want to loop me in via email.

  2. Hello Sir,

    one of my hub Server(E2k10) is not telneting even locally (telnet localhost 25 ) so its not routing any mails.

    how can i find clues about the issue.?

    Regards,
    Sena

Leave a Comment

*

We are an Authorized DigiCert™ SSL Partner.