AMD Opteron memory bandwidth (was Re: CPUs for a Beowulf)

Wed Sep 10 13:20:52 EDT 2003

At 08:18 AM 9/10/2003 -0600, Keith D. Underwood wrote:

> > That's an excellent design decision.
> > Putting a check word on each packet means
> >   - the physical encoding layer need to know about packetization
> >   - a packet must be held until the check passes
> >   - the tiny packets grow
> >   - to do anything with the per-packet info, packet copies must be kept
> > These all add complexity and latency to the highest speed path.
> >
> > By putting the check on fixed block boundaries you can still detect and
> > fail an unreliable link
>
>All very true when you have 1, 2, 4, even 8 HT links that could cause a
>system to crash.  And I'm not suggesting that ECC would be better (that
>was Greg's statement), but....   if you had 10000 HT links running their
>maximum distance (if you used HT links to build a mesh at Red Storm
>scale) and any bit error on any of them causes an app to fail because
>you don't know which packet had an error...  That would be bad.

Any time you're looking at large distributed systems, one needs to plan for 
and be able to handle failures.  If a single hit causes the entire shooting 
match to collapse, it's never going to work.  In fact, I'd claim that what 
"scalability" really means is that the inevitable errors have limited 
propagation.  Otherwise, as you increase the number of "widgets" in the 
system, the probability of any one failing starts to get close to one.

The real design decisions come in when deciding at what level to handle the 
error.  Detecting is straightforward at the bottom level, but error 
handling may be best dealt with at a high level.  Perhaps the overall 
performance of the ensemble is better if everyone goes in lockstep, rather 
than retrying the failed communication.  Compare, for example, 11 for 8 
Hamming coding at the byte level (low latency, poor rate efficiency), and 
CRC error detection and retries at a higher level, and then, all the sorts 
of block interleaving schemes which turn burst errors into isolated (and 
correctable on the fly) errors. Sometimes you trade determinism for 
performance, or latency for overall bit error rate.  A lot depends on your 
error statistics, and such is grist for much communications theory 
analysis, and keeps coding specialists employed.

Consider also, things like algorithm robustness to bit flips in RAM.  On 
one system I worked on, it was faster to do the calculations three times 
and vote the results, with no ECC, than to do them once with ECC, because 
of the increased latency of the ECC logic, and the adverse interaction 
between ECC and cache.

There is a fair amount of literature on this... The Byzantine Generals 
problem (unreliable communication between multiple masters) is a good 
example.  Fault robustness/tolerance in large cluster configurations is a 
subject that is near and dear to my heart because I want to fly clusters in 
space, where errors are a given, and maintenance is impossible.

James Lux, P.E.
Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf