disadvantages of linux cluster - admin

Wed Nov 6 09:11:21 EST 2002

On Tue, 5 Nov 2002 alvin at Maggie.Linux-Consulting.com wrote:

> - one person should be able to maintain 100-200 servers ... within an 8hr
>   period .... ( to hit reset if needed and boot it properly )
> 
> 	- system and user data can be 100% automated ...
> 	( just need someone to test it before hand )
> 
> 	- if its truely automated properly, baby sitting 500machines or
> 	100 machines wont matter much ... 
> 
> - there is the initial hurdle of baby sitting 1pc at home vs
>   10 pc in the office  and 20 pcs in the office...after that 100 or more
>   wont matter as its being automated along the way...
> 	- debian, redhat, suse makes distro maintainance a no-brainer if
> 	its done right and if you trust them  ( i dont trust um )
> 	so i have my own scripts that does the same for files i care about
> 
> -- keeping user data in sync between machines in the cluster is little
>    trickier ... ( use your favorite method ?? -- muti-home, mirroring,
>    copying, nfs'ing )
> 
> c ya
> alvin
> 
> -- think google .. they have 10,000 machines... and at 5,000 servers,
>    they still had 5-10 people maintainting 5,000 servers

The limiting element (we've found) in either LAN or cluster is not
software scaling at all.  We have OS installation down to a few minutes
of work, and once installed tools like yum automate most maintenance.

It is hardware, humans, and changes.  Hardware breaks and the
probability of failure is proportional to the number of systems.  Humans
have problems with this package or application or that printer and those
problems also scale with the number of systems.  Even if your
OS/software setup is "perfect", you cannot avoid it costing minutes to
hours of systems person time every time a system breaks, every time a
human you're supporting needs help. You also cannot avoid constantly
working on the future -- preparing for the next major revision upgrade,
installing new hardware, building a new tool to save yourself time on
some specific task that isn't yet scalably automated.

This teaches us how to minimize administrative expense.

  * Buy high quality hardware with on-site service contracts (expensive
up front but cheap later on) OR be prepared to deal with the higher rate
of failure and increase in local labor cost.  Note that either strategy
might be cost-benefit optimal depending on the number of systems in
question and your local human resources and how well, quickly, and
cheaply your vendor can provide replacement parts.  To achieve the
highest number of systems per admin person, though, you'll definitely
need to go with the high quality hardware option.

  * Shoot your users.  G'wan, admit it, you've thought about it.  They
just clutter up the computing landscape.  Well, OK, so we can't do that
<sigh>.  So user support costs are relatively difficult to control,
especially since it is a well known fact that all the things one might
think of to reduce user administrative costs (providing extensive online
documentation, providing user training sessions, providing individual
and personalized tutorial sessions) are metaphorically equivalent to
pissing into a category 5 hurricane.  

  * Don't upgrade.  Don't change.  Don't customize.  It is a well-known
fact that one could get as much work done with the original slackware or
RH 5.2 -- or even DOS -- as one can today with RH 8.0 (scaled for CPU
speed, of course).  A further advantage of never changing is that
eventually even the dullest of users figures out pretty much everything
that can be done with the snapshot you've stuck with for the last five
years.

So Google can manage with a relatively few admin humans because they
probably hide hardware expenses behind a fancy service contract (so that
they REALLY have another ten full time Dell maintenance folks who do
nothing but pull and fix systems all day long) and because they don't
have any users.  Well, they have LOTS of users but they're all far away
and can't come into their offices ranting and don't expect their hands
to be held while learning a simple command like ls with no more than a
few dozen command line options.  And I'm sure that THEY never change a
thing they don't have to, and dread the day they have to.

More realistically, we're finding that in an active LAN/cluster
environment, two full time admins are a bit stretched when the total
number of LAN seats plus cluster nodes reach up towards 400-500, over
200 apiece, with all of the above (HHC) being the limiting factors.  One
reason the Tyans we opted for for our last round of cluster nodes have
been a problem is the anomalously high costs of installing them (see
ongoing discussion of their quirky BIOS) and their relatively high rate
of hardware failure.  We're now considering going back to Intel Xeon
duals and are evaluating a loaner -- they are a bit more expensive but
if they reduce human costs they'll be worth it.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf