NFS and HPC Survey Results | HPC Surveys/Polls

Home

Cluster News

HPC Surveys

NFS and HPC Survey Results

Details: Written by Number Six; Published: 22 July 2016; Hits: 9304

Article Index

Page 1 of 2

Recently, Bill Broadley of UC Davis presented the Beowulf mailing list with a survey on NFS and HPC. Bill has collected the results and given permission to post a summary on Cluster Monkey. Many thanks for all the 27 respondents (who are seasoned HPC administrators) . Note that not all respondents answered all questions..

1) cluster OS:

72% Redhat/CentOS/Scientific linux or derivative
24% Debian/Ubuntu or derivative
4% SUSE or derivative

2) Appliance/NAS or linux server

32% NFS appliance
76% linux server
12% other (illumos/Solaris)

3) Appliances used (one each, free form answers):

Hitachi BlueARC, EMC Isilon, DDN/GPFS, x4540
Not sure - something that corporate provided. An F5, maybe...? Also a
NetApp FAS6xxx
netapp
isilon x and nl
Isilon
NetApp
Synology

4) Which kernel do you use:

88% one provided with the linux distribution
12% one that I compile/tweak myself

5) what kernel changes do you make

CPU performance tweaking, network performance.
raise ARP cache size, newer kernel than stock 3.2 was needed for newer
ZFS

6) Do you often see problems like nfs: server 192.168.5.30 not responding, timed out:

42.3% Never
23.1% Sometimes
19.2% rarely
7.7% daily
7.7% often

7) If you see NFS time outs what do you do (free form answers)

nothing
nothing
Restart NFSd, look for performance intensive jobs, sometimes increase NFSd.
Look at what's going on on that server. That means looking at what the> disks are doing, what network flows are going to/from that server and determine if the load is something to take action on or to let.
Not much
Reboot
Resolve connectivity issue if any and run mount command on nodes. If this doesn't fix it, then reboot.
Ignore them, unless they become a problem.
Look for the root cause of the issue, typically system is suffering network issues or is overloaded by a user 'abuse/missuse'.
diagnose and identify underlying cause
Try to figure out who is overloading the NFS server (hard job)
Troubleshoot, typically a machine is offline or network saturation

8) which NFS options do you use (free form):

tcp,async,nodev,nosuid,rsize=32768,wsize=32768,timeout=10
nfsvers=3,nolock,hard,intr,timeo=16,retrans=8
hard,intr,rsize=32768,wsize=32768
all default
async
async,nodev,nosuid,rsize=32768,wsize=32768
tcp,async, nodev, nosuid,timeout=10
-rw,intr,nosuid,proto=tcp (mostly. Could be "ro" and/or "suid")
rsize=32768,wsize=32768,hard,intr,vers=3,proto=tcp,retrans=2,timeo=600
rsize=32768,wsize=32768
-nobrowse,intr,rsize=32768,wsize=32768,vers=3
udp,hard,timeo=50,retrans=7,intr,bg,rsize=8192,wsize=8192,nfsvers=3, mountvers=3
RHEL defaults
default ones, they're almost always the best ones
rw,nosuid,nodev,tcp,hard,intr,vers=4
rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp, port=0,timeo=600,retrans=2,sec=sys, clientaddr=10.5.6.7,local_lock=none, addr=10.5.6.1
defaults, netdev,vers=3
nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2
rw,hard,tcp,nfsvers=3,noacl,nolock
default rhel6 (+nosuid, nodev, and sometimes nfsver=3)
tcp, intr, noauto, timeout, rsize, wsize, auto
nfsvers=3,rsize=1024,wsize=1024,cto

9) Any explanations:

We have not yet made the change to nfsv4, we use nolock due to various application "issues", we do not hard set rsize/wsize as they have been negotiating better values for a number of years on their own under v3, and the timeout/retrans are a bit of a legacy set of values from working on this issue of server overload. Hard was a choice on our end to pick that having things hang definitely seemed better then having things fail and go stale. We still agree with the choice of hard. Intr just helps to "interupt" stuck things when needed.
We like to be able to ctrl-C hung processes. For some systems we use larger rsize/wsize if the vendor supports it.
works for me without tewaks
We didn't use tcp until the last couple of years.
Probably needs a revisit- block size was set up for 2.x series kernels
default of centos 7
nfsv4 was not stable enough last time out, don't fix rsize/wsize as client/server usually negotiate to 1M anyway
We have frequent power outage (5+ times a year) and noauto helps our not to hang on mounting nfs shares. Drawback is you have to manually mount. Time out helps with this issue as well.
These are adjusted if necessary for particular workloads

10) what parts of the file system do you use NFS for (free form):

/home
/home
/home
/home
/home
/home
/home
/home and /apps
We use NFS for the OS (NFSRoot), App tree, $HOME, Group dedicated space, as well as some of our scratch spaces. All of these come from different NFS servers.
/home, /apps
/home /opt /etc /usr /boot
/home,/apps,
/home, /apps, /scratch - all of 'em
/home, long term project storage, shared software
/cluster/home,/cluster/local,/cluster/scratch,/cluster/data
home, apps, shared data
/usr/local, /home
/home , /apps
various
/home, /group, /usr/local
/home, parts of /opt, some specific top level auto-mountable dirs
What above is called /apps and /home for a few medium sized systems
/home, /local, /opt, /diskless
/home, /opt, diskless node images

You have no rights to post comments

JComments

Main Menu

Search

Login And Newsletter

Feedburner

Subscribe Now!

Front Page RSS Feed

This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.