Schmorp's POD Blog a.k.a. THE RANT
a.k.a. the blog that cannot decide on a name

This document was published 2015-07-02 21:55:39, and since then has not been materially modified.

Tidbits - a Linux NFS client bug

Just had a weird new NFS client bug (with Linux 4.0.1 - I'm suffering from various NFS bugs since about 2.6.18, so debugging NFS issues is kind of routine).

I don't know what triggered it, but my girlfriend complained her vi froze, and couldn't explain why (and neither could I). As it turned out, it was stuck in fsync, and since she was editing on an NFS volume, I quickly guessed it's an NFS problem, as vim was probably trying to write its swapfile.

tcpdump quickly showed the problem - the NFS client tried to connect to the server every three seconds, but was rejected:

23:28:06.188527 IP > Flags [S], seq 8126
23:28:06.188576 IP > Flags [S], seq 9041
23:28:06.189458 IP > Flags [S.], seq 3788, ack 8127
23:28:06.189477 IP > Flags [R], seq 8127, win 0
23:28:06.189926 IP > Flags [R.], seq 3508, ack 916, win 0

After checking firewall, conntrack etc., I looked closer after the actual tcpdump output and realised what the problem was: the client sent out two SYNs in very quick succession (within less than 50µs), but the SYNs were actually for two different connections!

The server replies to the first, and probably dropped the second. The NFS client then rejected the SYN ACK from the server, presumably because it had already forgotten the first connection request.

This is the moment where I already considered rebooting, as my experience with Linux NFS problems is that you can't solve them without rebooting, as you usually can't umount or remount the mountpoint when it is stuck.


Fortunately, it turned out to be fixable, at least temporarily:

mount -noremount,udp /path/to/mountpoint

This took a few seconds, then returned, together with all the other commands that were previously stuck (vim, sync, ls...).

I had expected it to switch the client to UDP, but it stayed with UDP, so it seems the remount alone sufficed.

I've never seen this failure mode before, so maybe it was a one time glitch. Let's wait and see...