Arista VLAN assignment, and MLAGs

I have done a few Arista deployments lately – they’re awesome, cheap, 10GbE switches. The EOS config is very similar to Cisco IOS, but there is one really important difference for my purposes, regarding VLAN assignments.

On a Cisco switch, you could run the following command:

switchport mode trunk
switchport trunk allowed vlan add 123,124,125

Aristas will let you run this command, without error, but it won’t do what you expect. As soon as you set a port to be a trunk, it allows all VLANs on it, without being told. So on an EOS switch, the configuration is:

switchport mode trunk
switchport trunk allowed vlan none
switchport trunk allowed vlan add 123,124,125

The recommended way of configuring your VLANs is to define which “trunk groups” a VLAN is in (under vlan configuration), then assign ports to trunk groups, but this IOS like method also works. You can (and should) verify the 802.1q trunking configuration of a port (or port-channel) by adding “trunk” after it:

show interface Eth7 trunk

Arista has this very concise and well written page on how to setup their virtual chassis MLAG configuration (like Cisco vPC, Brocade Trill, etc). One important key point it doesn’t note clearly at least – the MLAG peer link needs to ONLY have the peer-link VLAN on it, and the peer-link VLAN can’t go to the uplink switches, or you will get a spanning tree shutdown.

 

Clustered ONTAP 8.3 – No more dedicated root aggregate!

O frabjous day! Callooh! Callay! ONTAP 8.3 is out, and with it, the long promised demise of the dedicated root aggregate for lower end systems!

To re-cap – NetApp has always said – have a dedicated root aggregate. But until Clustered ONTAP, that was more of a recommendation, like, say, brush your teeth morning, noon and night. When you only have 24 drives in a system, throwing away 6 of them to boot the thing seems like a silly idea. The lower-end (FAS2xxx) systems represent a very large number of NetApp’s sales by controller count, and for these systems, Clustered ONTAP was not a great move because of it. With 8.3 being Clustered ONTAP only, there had to be a solution to this pretty serious and valid objection, and there is – Advanced Disk Partitioning (ADP).

What is ADP? Basically it’s partitioning drives, and being able to assign partitions to RAID groups and aggregates. Cool, right? Well, yes, mostly. ADP can be used on All-Flash-FAS (AFF), but that is out of scope for this post. There are some important things to be aware of for these lower end systems.

  1. Systems using ADP need an ADP formatted spare, and then non-ADP spares for any other drives
  2. ADP can only be used for internal drives on a FAS2[2,5]xx system
  3. ADP drives can only be part of a RAID group of ADP drives
  4. SSD’s can now be pooled between controllers!

If a system is only using the internal drives, chances are, it is going to be a smaller system, and most of these don’t matter. The issue comes when it is time to add a disk shelf. Consider the following ADP layout system, assuming one data aggregate per controller:

ADP-24-disksADP-24-disks

 

If we were to add a shelf of 24 disks, and split it evenly between controllers, we would need to do some thinking first. We can’t add it to the ADP RG, and we need a non-ADP spare, for each controller. With ADP, and our 42 (18+24) SAS drives (21 per controller), we have used them like this:

  • N1_aggr0
  • N1_aggr1_rg0 – 6 data, 2 parity
  • N1_aggr1_rg1 – 9 data, 2 parity
  • N1 ADP Spare – 1
  • N1 Non ADP Spare – 1
  • N2_aggr0
  • N2_aggr1_rg0 – 6 data, 2 parity
  • N2_aggr1_rg1 – 9 data, 2 parity
  • N2 ADP Spare – 1
  • N2 Non ADP Spare – 1

For a total of:

  • 8 parity
  • 4 spare
  • 30 data

If we didn’t use ADP, we’d be using them like this:

  • N1_aggr0 – 1 root, 2 parity
  • N1_aggr1_rg0 – 15 data, 2 parity
  • N1 Non ADP Spare – 1
  • N2_aggr0 – 1 root, 2 parity
  • N2_aggr1_rg0 – 15 data, 2 parity
  • N2 Non ADP Spare – 1

For a total of:

  • 8 parity
  • 4 spare
  • … annnd 30 data

I toyed with running the numbers on moving the SSD drives to the shelf, meaning we could have larger ADP partitions used in RAID groups, but that still bites you in the end, as you will end up with the same number of RAID groups, but less balanced sizes as more shelves are added.

If we move to 2 shelves – 66 (18+24+24) SAS drives, we could use them like this with ADP:

  • N1_aggr0
  • N1_aggr1_rg0 – 6 data, 2 parity
  • N1_aggr1_rg1 – 9 data, 2 parity
  • N1_aggr1_rg2 – 10 data, 2 parity
  • N1 ADP Spare – 1
  • N1 Non ADP Spare – 1
  • N2_aggr0
  • N2_aggr1_rg0 – 6 data, 2 parity
  • N2_aggr1_rg1 – 9 data, 2 parity
  • N2_aggr1_rg2 – 10 data, 2 parity
  • N2 ADP Spare – 1
  • N2 Non ADP Spare – 1

For a total of:

  • 12 parity
  • 4 spare
  • 50 data

Or this without ADP:

  • N1_aggr0 – 1 root, 2 parity
  • N1_aggr1_rg0 – 15 data, 2 parity
  • N1_aggr1_rg1 – 10 data, 2 parity
  • N1 Non ADP Spare – 1
  • N2_aggr0 – 1 root, 2 parity
  • N2_aggr1_rg0 – 15 data, 2 parity
  • N2_aggr1_rg1 – 10 data, 2 parity
  • N2 Non ADP Spare – 1

For a total of:

  • 12 parity
  • 50 data
  • 2 spare

At 3 shelves, the story changes.. 90 (18+24+24+24) SAS drives, we could use them like this with ADP:

  • N1_aggr0
  • N1_aggr1_rg0 – 6 data, 2 parity
  • N1_aggr1_rg1 – 9 data, 2 parity
  • N1_aggr1_rg2 – 10 data, 2 parity
  • N1_aggr1_rg3 – 10 data, 2 parity
  • N1 ADP Spare – 1
  • N1 Non ADP Spare – 1
  • N2_aggr0
  • N2_aggr1_rg0 – 6 data, 2 parity
  • N2_aggr1_rg1 – 9 data, 2 parity
  • N2_aggr1_rg2 – 10 data, 2 parity
  • N2_aggr1_rg3 – 10 data, 2 parity
  • N2 ADP Spare – 1
  • N2 Non ADP Spare – 1

For a total of:

  • 16 parity
  • 4 spare
  • 70 data

Or this without ADP:

  • N1_aggr0 – 1 root, 2 parity
  • N1_aggr1_rg0 – 19 data, 2 parity
  • N1_aggr1_rg1 – 18 data, 2 parity
  • N1 Non ADP Spare – 1
  • N2_aggr0 – 1 root, 2 parity
  • N2_aggr1_rg0 – 19 data, 2 parity
  • N2_aggr1_rg1 – 18 data, 2 parity
  • N2 Non ADP Spare – 1

For a total of:

  • 12 parity
  • 74 data
  • 2 spare

So, a couple of conclusions:

  1. ADP is good for internal shelf only systems
  2. ADP is neutral for 1 or 2 shelf systems
  3. ADP is bad for 3+ shelf systems
  4. ADP is awesome for Flashpools (not really a conclusion from this post, but trust me on it? 😉

 

As a footnote: savvy readers will notice I’ve got unequally sized RAID groups in some of these configs. With ONTAP 8.3, the Physical Storage Management Guide (page 107) now says:

All RAID groups in an aggregate should have a similar number of disks. The RAID groups do not have to be exactly the same size, but you should avoid having any RAID group that is less than one half the size of other RAID groups in the same aggregate when possible.

This is in comparison to ONTAP 8.2 Physical Storage Management Guide (page 91) which says:

All RAID groups in an aggregate should have the same number of disks. If this is impossible, any RAID group with fewer disks should have only one less disk than the largest RAID group.

 

 

Out-of-band Management ports on NetApp – e0M vs SP vs Serial (and BMC!)

One of the things I’ve seen new (and sometimes existing..) customers to NetApp be most confused about, are the various ways of connecting to the system for management.

Over the years, there have been a couple of different out of band management systems (RLM and BMC are the older systems, SP on the newer ones). This post focuses on systems with Service Processor, or SP, as used in the FAS2200, FAS2500, FAS3100, FAS3200, FAS6100, FAS6200 and FAS8000 families. Lets start by going through the physical ports on the back of the controller. Where the ports are varies slightly by model, but the icons are consistent.

netapp-management-ports

A common question is “ok, so the wrench port is e0M, why doesn’t it just say that?”. The short answer is that it isn’t – although you could be forgiven for making that guess. Even NetApp’s label set for Clustered ONTAP includes an e0M cable label, despite their systems not having a specific port labelled e0M. Let’s look at how the ports connect up, from the point of view of an administrator:

netapp-management-block

 

From this simplified block diagram, you can see how they all relate. The port on the outside of the box actually connects to a switch inside the box, and that has both ONTAP’s e0M and the Service Processor’s IP interface connected to it. It’s almost literally running Ethernet on the motherboard traces (it’s actually something called RMII, not normal 802.3, but close enough). The internal switch is unmanaged, which is why you can’t do VLANs over that port. To clarify some more – the service processor is an independent CPU, with its own RAM, flash and OS running on it. It talks to ONTAP very closely, obviously, and to sensors throughout the system, but it’s separate to the main kernel running on the x86 CPU that runs ONTAP.

On 7-mode systems, e0M is just another interface in ONTAP, but in Clustered ONTAP, it can only be used for management LIFs, not data LIFs (or Cluster LIFs). On the FAS2500 and FAS8000, the wrench port, and therefore e0M, are finally 1G, but on previous systems, it’s only 100M. On 7-mode systems, you have to be careful – you don’t want it on the same subnet as any of your data service IPs, or traffic might go out through it, instead of a 1G or 10G port. To stop this, set “options interface.blocked.mgmt_data_traffic on” for all systems (running ONTAP 8.0.2 or higher), but ideally put it on a different subnet. It’s best practice to have, at the very least, a different OOB subnet to data services.

From our diagram again, if you need to do something like monitor boot/shutdown/reboot during an ONTAP upgrade, you can either connect to the Serial Console or the SP IP – the output is the same. I’ve done lots of remote upgrades this way. Once the system is up, and the SP is configured, there’s almost never a need to use the Serial Console again. The SPs don’t talk to each other, so if one node is online and the other is offline, you can’t use the online node to connect to the offline one.

If you’re the type who like managing your 7-mode NetApp from the command line, you would normally SSH into the e0M IP address, while for Clustered ONTAP, you would normally SSH to the Cluster Management IP. You could go from the SP to the system console, but that will be limited to 9600bps output through the serial connection, and if you’re looking at a lot of text, or pasting a lot of text, that can be limiting. For using GUI applications like OnCommand System Manager, you connect to the e0M IP on 7-mode, and the Cluster Management IP on Clustered ONTAP Systems.

A final question I’ve heard is “what is that USB port for?”. Officially, for regular users, it’s unsupported. Unofficially, you can use it to charge your iPhone while its running in hotspot mode, or to power your Airconsole.

Could this all be made simpler? Well, there are good purposes for all of the different IPs and interfaces you might use, so I’m not 100% convinced it could be. Everything new is complex initially, but once you get a handle on it, it all makes sense. Hope this has helped you!

Edit: 2018-07-11

Since writing this article, we’ve released some new platforms, which enable the USB port while at the boot menu, have faster serial ports, and move from an SP to a BMC. They’re pretty similar, except the BMC doesn’t tap into a serial link to the ONTAP controller – it relays it over an internal network, and it doesn’t share the wrench port with e0M anymore.

Autosupport stopped working

I installed a Clustered ONTAP System about 4 months ago, and I’ve been working with the customer since then on migration of their several hundred workloads onto the system in a staged approach. While doing a regular checkin, I noticed that Autosupport had stopped working on three of their 4 nodes, despite working when I finished the initial build.

Some checks of logs and within the organization showed that it had stopped working at the same time that they had changed their mail server IPs. Easy, you think. Maybe I put in the IPs into the autosupport setup? Checked that, nope, it goes to the hostname. Well, maybe I put in an /etc/hosts entry? (system services hosts show) – nope, wasn’t that. Checked autosupport’s destinations were configured the same on all four nodes – and they were. Maybe there’s a firewall issue? Ping from the node management LIF to the SMTP servers all works. Maybe it’s a specific SMTP firewall block? Used debug mode systemshell and tcp_client (note: don’t try this at home..) – that all worked. I got their firewall and exchange admins to check logs for the node management LIFs trying to make connections, and no attempts, other than my tcp_client ones. Ran pktt on all interfaces with target IPs of the mailhosts, and found no attempts to send out from e0M (home of the node-mgmt LIF), only one of the data LIFs. NetApp KB 3012724 talks about LIFs, and has this to say on the topic:

Clustered Data ONTAP 8.2.x:

  • AutoSupport is delivered from the node-mgmt LIF per node.

Looking through the autosupport history, the attempts fail, and the last error recorded is “FTP: weird server reply”. Uhh.. transport can be either http, https or smtp. Why is it mentioning FTP?

NetApp KB 201727 shows how to access debug logs for autosupport. I did that and saw the error message of “421 Service Unavailable”. Remember the FTP error? Well, that dear readers is because your NetApp, at its heart, is a big FreeBSD box, and it uses curl to send autosupport emails. And when curl gets a “421 Service Not Available” response from the mail server, that’s what it does.

Looking at the pktt logs closer, it’s because the autosupport email is going out of one of the data LIFs for an SVM on the host. Why would you suddenly decide to do that?! Well, let’s look at KB 3012724 again..

By default, routes for the node mgmt LIF have a lower (more preferred) metric than routes of data LIFs. However, the metric is used as a tie-breaker. The more-specific route to the destination will always be picked regardless of the metric.

..

Case 3 – The node-mgmt LIF and data LIFs on different subnets, destination is on the same subnet as the data LIFs. The implicit subnet route of the data LIFs (which isn’t seen in ngsh) will be the most specific route to the destination, and will therefore be the selected route. A data LIF will be used.

So, despite the earlier assurance that autosupport uses the node-mgmt LIF, the actual story is somewhat more complicated. It uses the node-mgmt LIF, unless it likes another one better. As for why only one of the 4 nodes worked? Well that node didn’t have any SVM LIFs on the same subnet as the mail servers, so it didn’t try using them to send the ASUP email.

So what do you do? You can either create individual host routes (/32) in the routing group for the node admin SVMs, or create a subnet route in there to prevent it occuring if IPs change again. I also found (as did it seems another posted on NetApp Communities), that setting the metric lower didn’t solve the problem, you had to set the metric to “1”.

Going forward, part of my system installation will always include a route for the mail server in the routing group that the node management SVM uses.

Go away and I will replace you with a very small shell script..

I did a recent migration of SAN to NAS for a client recently, and had to unmount all of their datastores.

This little one liner for the esxi shell lists all SAN volumes, then gets rid of them..

# esxcfg-scsidevs -m | sed -e ‘s/\:1 //g’ | awk ‘{ printf(“esxcli storage filesystem unmount -l %s;\nsleep 1;\nesxcli storage core device set –state=off -d %s;\n”,$4,$1);}’ >; /tmp/unmountluns.sh

Review the output for sanity, and run.

Setting clock from CLI is not allowed in this VDC

If you’re trying to set the time on a brand new out of box Cisco Nexus 5500 and you get the message “Setting clock from CLI is not allowed in this VDC.”, it’s because the clock protocol is set to ntp, even though you didn’t configure NTP. Go into config and type “clock protocol none”, and then it will let you set the time.

Then, when you’ve finished the config, set up NTP!

And while you’re at it, this page from Cisco is awesome for troubleshooting VPC

Sometimes you can’t get to here from there.

Sometimes things look impossibe. Like screwing in a screw with a handle directly above it (seriously, if I ever met the person who designed this..)

If I ever meet the person responsible for this, I'm punching them in the face

But there’s always a way around things. In this case, I used my fingers to tighten it into place.

Or this screw, which was cross threaded and wouldn’t come out. It didn’t stand up to a pair of channel locks. Sometimes you have to take the hard way.

And sometimes? What you don’t know is a blessing. I don’t have any photos of this unfortunately. But let’s say there was a two storey building, and on the second storey, was a server room, with two 50U racks inside it. You would think – ok, I need to add another one, these two obviously got in here. You enlist some burly gentlemen to help you move the rack up the stairs, and find problem 1 – it doesn’t fit through the back stairs. They take it down the back stairs, and up the front. Problem 2 – it doesn’t fit through the front door standing up, so you lay it down and move it into the corridor in front of the server room. Problem 3 – there are fire sprinklers in the middle of the ceiling, so you can’t stand it up. I scratched my head for a while, and then started removing bits of drop ceiling, until I found a section big enough to get it standing up, without any sprinkler pipes under it. I stood it up, and then went to move it into the server room. Same issue – but found another part of the drop ceiling without pipes to angle it down again to fit through the lower server room door. And it’s done and in place. I went back to the company’s office and asked them how they got the original racks in there?

The older and wiser sysadmin, who I hadn’t been working on for this project, answer sagely: “we built the room around them”. Sometimes not knowing is the solution to your problem. I doubt I would have even tried if I knew that…