I didn’t get much chance with the Pavlov’s dogs challenge : not even a single attempt at an explanation. Maybe it was too weird a problem (or more embarrassing it did not interest anyone! ).
A quick reminder about the challenge: netstat shows almost no activity on an otherwise loaded interconnect (500Mb/s to 1Gb/s inbound and similar values going outbound as seen on the Infiniband switches and calculated as the product of PX remote messages recv’d/send x parallel_execution_message_size).
Well, anyway, here is what I think is the answer: the key information I gave was that the clusterware was running RDS over Infiniband. Infiniband HCAs have an inherent advantage over standard Ethernet network interfaces: they embed RDMA, which means that all operations are handled without interrupting the CPUs. That’s because the sending nodes read and write to the receiving node using user space memory, without going through the usual I/O channel. TCP/IP NICs also cause a number of interrupts the CPUs have to process because TCP segments have to be reconstructed while other threads are running.
The most likely cause of the netstat blindness is just that it cannot see the packets because the CPUs are unaware of them.
To quote the Wikipedia Pavlov’s dogs article, “the phrase “Pavlov’s dog” is often used to describe someone who merely reacts to a situation rather than use critical thinking”. That’s exactly what I thought of myself when I was trying to put the blame on the setup instead of thinking twice about the “obvious” way of measuring a network packet throughput.