Dell Linux mailing list - testing

Angel Rivera angel at wolf.com
Tue Jul 29 21:26:33 EDT 2003


Alvin Oga writes: 

[snip] 

>> Then we beat them using our suit of programs for a 
>> week. If there are any problems, the clock gets reset.
> 
> yes... that is the trick .... to get a god set of test suites

We have set of jobs we call beater jobs that beat memory, cpu, drives, nfs 
etc. We have monitoring programs so we are always getting stats and when 
something goes wrong they notify us. 

We had a situation where a rack of angstroms (64 nodes 128 AMD procs and 
that means hot!) were all under testing.  The heat blasting out the rear wa 
hot enough to triggered an alarm in the server room so they had to come take 
a look. 

>  
> 
> testing and diags
> 	http://www.linux-1u.net/Diags/ 
> 
> and everybody has their own idea of what tests to do .. and "its
> considered tested" ... or the depth of the tests.. 
> 
> 1st tests should be visual ..
> 	- check the bios time stamps and version
> 	- check the batch levels of the pcb
> 	- check the manufacturer of the pcb and the chips on sdrams
> 	- blah ... dozens of things to inspect

> than the power up tests
> 	- run diags to read bios version numbers
> 	- run diags for various purposes

This is really important when you get a demo box to test on for a month or 
so. The time between you getting that box and your order starts landing on 
the loading dock means there have been a lot of changes if you have a good 
vendor.  We test and test before they go into production-cause once we turn 
them over we have a heck of time getting them off-line for anything less 
than a total failure. 

> 
> - diagnostics and testing should be 100% automated including
>   generating failure and warning notices
> 	- people tend to get lazy or go on vacation 
> 	and most are not as meticulous about testing foo-stuff
> 	while the other guyz might care that bar-stuff works  
> 
> - testing is very very expensive ...
> 	- getting known good mb, cpu, mem, disk, fans
> 	( repeatedly ) is the key ... 
> 
> 	- problem is some vendors discontinue their mb in 2 months
> 	so the whole testing clock start over again 
> 
> 	- in our case, its cheaper to find smaller distributors
> 	that have inventory of the previously tested known good mb
> 	that we like

Ah, the voice of experience.  We are very loathe to take a shortcut. 
Sometimes it is very hard. When we bought those 28TB of storage, the first 
thing we heard was that we can test it in production.  Had we done that, we 
may have lost data-we lost a box. 

> 
> - if it aint broke... leave it alone .. if its doing its job :-)

*LOL* Once it is live our entire time is spent not messing anything up. And 
that can be very hard w/ those angstroms where you have two computers in a 
1U form factor and one goes doen. :) 

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list