...so I can get to sleep before 5am.
Sorry for the provocative title – but the pain is real.
We’ve all been there, I’m sure. That project that was so important but for some reason, the time put aside for testing got consumed by the implementation work –so now it goes live on Monday, and you realise it’s Sunday and nothing works.
I ran into a problem recently where a customer was working to load balance NFS to their scale-out NAS product, they were saying that it mounted fine but “showmount -e” just wouldn’t work properly from their Linux client…
It was really slow to respond and they'd been trying to fix it for a while. And now time was running out. I took a look at what they'd configured already on the load balancer, and I could see that they'd set it up for TCP-based NFS connections on ports 111, 2049, 10000 – which all looked fine on initial inspection. This was configured using HAProxy which obviously works as a reverse proxy and only supports TCP (outside of QUIC protocol).
Anyway, I’ve been in the industry for a long time, so I have a good knowledge of common protocols and how they work, including some of the older ones... Many times I’ve found myself explaining to a new engineer or a customer how some horrid old protocol works (think FTP – yuck!). Once again, this experience managed to save me...
As the conversation progressed I asked many questions, while trying to limit the avenues we needed to explore. I discussed how they were mounting the export and if they could access files – it worked when mounted fine. It was only the showmount command that actually failed to work properly...
It was then that the penny really dropped – they were using NFSv3, which optionally used UDP ports too! Considering this for a moment I decided to switch to an LVS based method of load balancing, which allowed UDP support and with the extra ports included, everything sprang into life nice and fast!
Now I understand that UDP is meant to be optional for NFSv3, but for me it only worked properly with showmount when UDP was also available… Maybe it was an older implementation or something (if anyone has any ideas, please share them). This did seem odd to me at the time. However, I’m not one to challenge a solution that works when tired late at night and showmount worked great, once both TCP and UDP protocols were forwarding traffic. So I took that as a win – and went back to bed!
However, the following day, when I was enjoying a well-earned coffee, this raised another question – why I have a customer deploying a new scale-out NAS product by a major manufacturer in 2021 that is still stuck using NFSv3. The Request for Comments (RFC) for v3 is from 1995! And newer revisions have long been around.
So what is actually wrong with NFSv3 anyway?
- Stateless - leading to performance and lock management issues
- Multiple separate ports and protocols - harder to reverse proxy, load balance or allow through firewalls
- no locking support without using Network Lock Manager (NLM)
- supports only Posix Access Control Lists (ACL)
- single operation per Remote Procedure Call (RPC).
And what’s so great about NFSv4?
- properly stateful - maintains OPEN and CLOSE states
- single port - firewall, proxying, and load balancing support is easier due to a fixed port of 2049
- pseudo filesystem - only shows what users should actually see
- delegation - client and server Read/Write delegation support (Improves performance, mostly)
- integrated locking support - integrated solution without additional layers such as NLM
- integrated improved ACLs - improved ACL support (Like Windows NTFS!) over NFSv3 POSIX ACLs
- NFS referrals - better support for file systems over many nodes
- parallel NFS support! Better scale-out NFS support, a huge upgrade in performance at scale, and a great improvement to NFS as a whole.
Why do I care..?
For me, the two things that matter the most are the simplified port and protocol use – this makes it much easier to load balance! I know it’s a single port and a single protocol, I can use any method available, and it’s much easier to configure. Next is parallel NFS (pNFS), which allows the client to establish multiple connections directly with the backend NFS servers – which have the file available bypassing the load balancer and offering endless scale.
So why do we still see so much NFSv3?
Although some manufacturers do actually support NFSv4 and higher, it’s often disabled by default for better compatibility. NFSv3 is the most widely supported. So, consider if you can actually turn it on – if you can, you’ll not only get a better, more modern protocol, that's easier to handle – but also, much better performance.
Embrace the future and let NFSv3 die already!
To be fair, I’m sure it’s a lot harder than I give it credit to support v4+ in your environment. We all have old clients kicking around, and it’s not always easy to guarantee that legacy systems will never need to talk to your storage. I’m also sure that some vendors have invested a lot in their very advanced implementations of version 3, also making v4 less attractive. Still, I look forward to a day that we let some of these older protocol implementations fall away – the way they were implemented can be hard to support, and hard to troubleshoot too.