Dell Linux mailing list - testing

Tue Jul 29 21:52:43 EDT 2003

hi ya angel

On Wed, 30 Jul 2003, Angel Rivera wrote:

> We have set of jobs we call beater jobs that beat memory, cpu, drives, nfs 
> etc. We have monitoring programs so we are always getting stats and when 
> something goes wrong they notify us. 

yup... and hopefull there is say 90- 95% probability that the "notice of
failure" as in fact correct ... :-)
	- i know people that ignore those pagers/emails becuase the
	notices are NOT real .. :-0

	- i ignore some notices too ... its now treated as a "thats nice,
	that server is still alive" notices

> We had a situation where a rack of angstroms (64 nodes 128 AMD procs and 
> that means hot!) were all under testing.  The heat blasting out the rear wa 
> hot enough to triggered an alarm in the server room so they had to come take 
> a look. 

yes.. amd gets hot ...

and ii think angstroms has that funky indented power supply and cpu
fans on the side where the cpu and ps is fighting each other for the
4"x 4"x 1.75" air space .. pretty silly .. :-)

> > testing and diags
> > 	http://www.linux-1u.net/Diags/ 
> > 
> > and everybody has their own idea of what tests to do .. and "its
> > considered tested" ... or the depth of the tests.. 

...  
> This is really important when you get a demo box to test on for a month or 
> so.

i like to treat all boxes as if it was never tested/seen before ...
assuming time/budget allows for it 

..
> them over we have a heck of time getting them off-line for anything less 
> than a total failure. 

if something went bad... that was a bad choice for that system/parts ??

> > - testing is very very expensive ...

..

> Ah, the voice of experience.  We are very loathe to take a shortcut. 

short cuts have never paid off in the long run ..  you usually
wind up doing the same task 3x-5x  instead of doing it once correctly
	( take apart the old system, build new one, test new one
	( and now we're back to the start ... and thats ignoring
	( all the tests and changes before giving up on the old
	( shortcut system

> Sometimes it is very hard. When we bought those 28TB of storage, the first 
> thing we heard was that we can test it in production.  Had we done that, we 
> may have lost data-we lost a box. 

i assume you have at least 3 identical 28TB storage mechanisms..
otherwise, old age tells me one day, 28TB will be lost.. no matter
how good your raid and backup is
 	- nobody takes time to build/tests the backup system from
	bare metal ... and confirm the new system is identical to the
	supposed/simulated crashed box including all data being processed
	during the "backup-restore" test period

> > 
> > - if it aint broke... leave it alone .. if its doing its job :-)
> 
> *LOL* Once it is live our entire time is spent not messing anything up. And 
> that can be very hard w/ those angstroms where you have two computers in a 
> 1U form factor and one goes doen. :) 

you have those boxes that have 2 systems that depend on eachother ??
	- ie ..turn off 1 power supply and both systems go down ???

	( geez.. that $80 power supply shortcut is a bad mistake 
	( if the number of nodes is important

	- lots of ways to get 4 independent systems into one 1U shelf
	and with mini-itx, you can fit 8-16 independent 3GHz machines
	into one 1U shelf
		- that'd be a fun system to design/build/ship ...
		( about 200-400 independent p4-3G cpu in one rack )

	- i think mini-itx might very well take over the expensive blade
	market  asumming certain "pull-n-replace" options in blade
	is not too important in mini-itx ( when you have 200-400 nodes
	anyway in a rack )

have fun
alvin

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf