Testing (Was: Re: Dell Linux mailing list - testing)
angel at wolf.com
Wed Jul 30 08:33:14 EDT 2003
Alvin Oga writes:
> hi ya angel
> On Wed, 30 Jul 2003, Angel Rivera wrote:
>> We have set of jobs we call beater jobs that beat memory, cpu, drives,
>> nfs etc. We have monitoring programs so we are always getting stats and
>> when something goes wrong they notify us.
> yup... and hopefull there is say 90- 95% probability that the "notice of
> failure" as in fact correct ... :-)
> - i know people that ignore those pagers/emails becuase the
> notices are NOT real .. :-0
We have very high confidence our emails and pages are real. Our problem is
information overload. We need to work on a methodology to make sure the
important ones are not lost in the forest of messages.
> - i ignore some notices too ... its now treated as a "thats nice,
> that server is still alive" notices
I try and at least scan them. We are making changes to help us gain
situational awareness without having to spend all out time hunched over the
>> We had a situation where a rack of angstroms (64 nodes 128 AMD procs and
>> that means hot!) were all under testing. The heat blasting out the rear was hot enough to triggered an alarm in the server room so they had to come
>> take a look.
> yes.. amd gets hot ...
> and ii think angstroms has that funky indented power supply and cpu
> fans on the side where the cpu and ps is fighting each other for the
> 4"x 4"x 1.75" air space .. pretty silly .. :-)
each node has it's own power supply. When everything is running right it's
the bomb. When not, then you have to take down two nodes to work on one. Or,
until you get used how it is built, you have to be very careful that the
reset button you hit is for the right now and not its neighbor. :)
>> This is really important when you get a demo box to test on for a month >> or so.
> i like to treat all boxes as if it was never tested/seen before ...
> assuming time/budget allows for it
Before a purchase, we look at the top 2-3 choices and start testing them to
see how fast and how we can tweak them. One of the problems is that between
that time and the order coming in the door there can be enough changes that
your build changes do not work properly.
> i assume you have at least 3 identical 28TB storage mechanisms..
> otherwise, old age tells me one day, 28TB will be lost.. no matter
> how good your raid and backup is
> - nobody takes time to build/tests the backup system from
> bare metal ... and confirm the new system is identical to the
> supposed/simulated crashed box including all data being processed
> during the "backup-restore" test period
They are 10 - 2.8 (dual 1.4 3ware 7500 cards in a 6-1-1 configuration.) The
vendor is right down the street. We keep on-site spares ready to do so we
always have a hot spare on each card.
We don't back up very much from the cluster. just two of the management
nodes that keep our stats. It would be impossible to backup that much data
in a timely manner.
> you have those boxes that have 2 systems that depend on eachother ??
> - ie ..turn off 1 power supply and both systems go down ???
> ( geez.. that $80 power supply shortcut is a bad mistake
> ( if the number of nodes is important
> - lots of ways to get 4 independent systems into one 1U shelf
> and with mini-itx, you can fit 8-16 independent 3GHz machines
> into one 1U shelf
> - that'd be a fun system to design/build/ship ...
> ( about 200-400 independent p4-3G cpu in one rack )
> - i think mini-itx might very well take over the expensive blade
> market asumming certain "pull-n-replace" options in blade
> is not too important in mini-itx ( when you have 200-400 nodes
> anyway in a rack )
No they are two standalone boxes in a 1U with different everything. That
means it is very compact in the back and power and reset buttons close
together in the front-so you have to pay attention. But they rock as compute
We are now going to explore blades now. Anyone have recommendations?
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf