Autosupport stopped working

I installed a Clustered ONTAP System about 4 months ago, and I’ve been working with the customer since then on migration of their several hundred workloads onto the system in a staged approach. While doing a regular checkin, I noticed that Autosupport had stopped working on three of their 4 nodes, despite working when I finished the initial build.

Some checks of logs and within the organization showed that it had stopped working at the same time that they had changed their mail server IPs. Easy, you think. Maybe I put in the IPs into the autosupport setup? Checked that, nope, it goes to the hostname. Well, maybe I put in an /etc/hosts entry? (system services hosts show) – nope, wasn’t that. Checked autosupport’s destinations were configured the same on all four nodes – and they were. Maybe there’s a firewall issue? Ping from the node management LIF to the SMTP servers all works. Maybe it’s a specific SMTP firewall block? Used debug mode systemshell and tcp_client (note: don’t try this at home..) – that all worked. I got their firewall and exchange admins to check logs for the node management LIFs trying to make connections, and no attempts, other than my tcp_client ones. Ran pktt on all interfaces with target IPs of the mailhosts, and found no attempts to send out from e0M (home of the node-mgmt LIF), only one of the data LIFs. NetApp KB 3012724 talks about LIFs, and has this to say on the topic:

Clustered Data ONTAP 8.2.x:

  • AutoSupport is delivered from the node-mgmt LIF per node.

Looking through the autosupport history, the attempts fail, and the last error recorded is “FTP: weird server reply”. Uhh.. transport can be either http, https or smtp. Why is it mentioning FTP?

NetApp KB 201727 shows how to access debug logs for autosupport. I did that and saw the error message of “421 Service Unavailable”. Remember the FTP error? Well, that dear readers is because your NetApp, at its heart, is a big FreeBSD box, and it uses curl to send autosupport emails. And when curl gets a “421 Service Not Available” response from the mail server, that’s what it does.

Looking at the pktt logs closer, it’s because the autosupport email is going out of one of the data LIFs for an SVM on the host. Why would you suddenly decide to do that?! Well, let’s look at KB 3012724 again..

By default, routes for the node mgmt LIF have a lower (more preferred) metric than routes of data LIFs. However, the metric is used as a tie-breaker. The more-specific route to the destination will always be picked regardless of the metric.

..

Case 3 – The node-mgmt LIF and data LIFs on different subnets, destination is on the same subnet as the data LIFs. The implicit subnet route of the data LIFs (which isn’t seen in ngsh) will be the most specific route to the destination, and will therefore be the selected route. A data LIF will be used.

So, despite the earlier assurance that autosupport uses the node-mgmt LIF, the actual story is somewhat more complicated. It uses the node-mgmt LIF, unless it likes another one better. As for why only one of the 4 nodes worked? Well that node didn’t have any SVM LIFs on the same subnet as the mail servers, so it didn’t try using them to send the ASUP email.

So what do you do? You can either create individual host routes (/32) in the routing group for the node admin SVMs, or create a subnet route in there to prevent it occuring if IPs change again. I also found (as did it seems another posted on NetApp Communities), that setting the metric lower didn’t solve the problem, you had to set the metric to “1”.

Going forward, part of my system installation will always include a route for the mail server in the routing group that the node management SVM uses.