AMD Opteron memory bandwidth (was Re: CPUs for a Beowulf)
James.P.Lux at jpl.nasa.gov
Wed Sep 10 13:20:52 EDT 2003
At 08:18 AM 9/10/2003 -0600, Keith D. Underwood wrote:
> > That's an excellent design decision.
> > Putting a check word on each packet means
> > - the physical encoding layer need to know about packetization
> > - a packet must be held until the check passes
> > - the tiny packets grow
> > - to do anything with the per-packet info, packet copies must be kept
> > These all add complexity and latency to the highest speed path.
> > By putting the check on fixed block boundaries you can still detect and
> > fail an unreliable link
>All very true when you have 1, 2, 4, even 8 HT links that could cause a
>system to crash. And I'm not suggesting that ECC would be better (that
>was Greg's statement), but.... if you had 10000 HT links running their
>maximum distance (if you used HT links to build a mesh at Red Storm
>scale) and any bit error on any of them causes an app to fail because
>you don't know which packet had an error... That would be bad.
Any time you're looking at large distributed systems, one needs to plan for
and be able to handle failures. If a single hit causes the entire shooting
match to collapse, it's never going to work. In fact, I'd claim that what
"scalability" really means is that the inevitable errors have limited
propagation. Otherwise, as you increase the number of "widgets" in the
system, the probability of any one failing starts to get close to one.
The real design decisions come in when deciding at what level to handle the
error. Detecting is straightforward at the bottom level, but error
handling may be best dealt with at a high level. Perhaps the overall
performance of the ensemble is better if everyone goes in lockstep, rather
than retrying the failed communication. Compare, for example, 11 for 8
Hamming coding at the byte level (low latency, poor rate efficiency), and
CRC error detection and retries at a higher level, and then, all the sorts
of block interleaving schemes which turn burst errors into isolated (and
correctable on the fly) errors. Sometimes you trade determinism for
performance, or latency for overall bit error rate. A lot depends on your
error statistics, and such is grist for much communications theory
analysis, and keeps coding specialists employed.
Consider also, things like algorithm robustness to bit flips in RAM. On
one system I worked on, it was faster to do the calculations three times
and vote the results, with no ECC, than to do them once with ECC, because
of the increased latency of the ECC logic, and the adverse interaction
between ECC and cache.
There is a fair amount of literature on this... The Byzantine Generals
problem (unreliable communication between multiple masters) is a good
example. Fault robustness/tolerance in large cluster configurations is a
subject that is near and dear to my heart because I want to fly clusters in
space, where errors are a given, and maintenance is impossible.
James Lux, P.E.
Spacecraft Telecommunications Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf