Digging around into virtual networks is a particular hobby of mine, especially when it comes to creating opportunistic traffic flows to specific networks. Recently, I came upon a scenario in which an environment had a mixture of 1 GbE and 10 GbE on their servers. The 1 GbE links were used for management traffic, while the 10 GbE links were used for data and virtual machine traffic flows. Because the management IP address is the de facto resolver for external queries, especially when the hostname is resolved, there are instances where data flows across 1 GbE management links that would be better served across 10 GbE links.
Normally, I’d just use the routing table to control traffic flows. This can be done using TCP/IP stacks with multiple gateway addresses (routing use cases) or by adding a VMKernel port on the subnet where traffic needs to flow (switching use cases). In this situation, however, the use case was Network Block Device (NBD) transport mode to a Rubrik cluster for traffic flows initiated by the vSphere API for Data Protection (VADP) workflows. The external device (a Rubrik node) will request data from the ESXi host, meaning that the external device initiates the connection. Being a stateful TCP connection, the hypervisor can’t just pick any IP as it’s source address after the connection is made. I’d imagine that would confuse the socket.
I wondered if adding a VMkernel port on the same subnet as the NBD target would both allow me to alter the routing table while keeping a stateful session alive. It would require the hypervisor using the selection logic presented by the routing table, while still using the target IP address specified in the TCP handshake as the hypervisor’s source address for return traffic.
Now that I had enough information to set up a test, I went into the lab. Armed with Wireshark, I set up a topology in which an ESXi host was added to vCenter using its Fully Qualified Domain Name (FQDN), which resolves to the management IP assigned to vmk0. This is a very typical setup for most every environment I’ve run across.
Next, I added an additional VMkernel interface – vmk4 – and assigned it the same VLAN ID and an IP on the subnet used by my NBD endpoint. In this scenario, it’s the subnet used by a cluster of Rubrik nodes.
The result is an updated routing table: traffic destined for 172.17.28.0/22 now has a Local Subnet connection via vmk4. In theory, any traffic with a destination IP address of 172.17.28.0/22 should use vmk4, which is assigned the port group named VLAN28-RUBRIK_DATA. The teaming policy on this port group has the 10 GbE links as Active, and the 1 GbE links as Unused.
Finally, I created a Port Mirror to span traffic from my ESXi host’s VMkernel ports over to a Windows Server running Wireshark with a dedicated virtual network adapter just for packet captures. The goal was to sniff ingress and egress packets that were traversing vmk0 and vmk4.
The results are interesting. The handshake process begins with a Rubrik node (172.17.28.12) reaching out to the ESXi management IP on vmk0 (172.17.4.102) to establish the TCP session. According to the VDDK and the capture below, this is done using port 902. The first three packets – SYN, SYN/ACK, and ACK – are the handshake.
The ESXi host uses its routing table to determine which VMkernel port should respond to the requests coming from a Rubrik node. The rules of the port group are followed when determining which physical NIC to use for return flow traffic – in this case, the 10 GbE links are the only ones set to Active; the 1 GbE links remain Unused.
Even though vmk0 receives requests, it’s vmk4 that transmits responses with the backup data. Below is a different TCP session where I’ve limited the Port Mirror to only traffic traversing over vmk4. The packets are going to a different Rubrik node in the cluster (172.17.28.14) in this test, but the remaining network topology is unchanged. Notice how all of the packets are being forwarded from the ESXi host to the Rubrik node. This is because the reverse flow is still going to vmk0, which is not included in this capture.
The TCP session remains intact because the source IP address remains the same as the management address, vmk0, which is where the session was established. This remains true even though traffic is technically using vmk4. This can be further expressed using esxtop. Notice how vmk0, highlighted in blue, is receiving several thousand packets per second (PKTRX/s) while vmk4 is transmitting (PKTTX/s) roughly the same amount per sample period.
For those with mixed NIC speeds or types, such as 1 GbE along side 10 GbE, I can see value in being able to manipulate where data traffic is forwarded. Additionally, if certain networks were bound by a series of choke points – firewalls, weaker routing devices, or just a general distaste for routing elephant flows – there’s some value in being able to select a more opportunistic path / link.
All in all, it was a fun exercise to walk through packet flows and see what’s going on within the packets going back and forth. This logic should be valid for any product using VADP in a disparate subnet than the ESXi management interface (commonly vmk0), so I’m sharing it publicly. I used Rubrik in my example because it’s also in my lab and I think it’s cool technology. 🙂
If I managed to goof on any of the findings reported here, please let me know!