AMD Opteron memory bandwidth (was Re: CPUs for a Beowulf)

Keith D. Underwood kdunder at sandia.gov
Tue Sep 9 23:31:54 EDT 2003


> I think that it's more precisely a longitudinal check.  "CRC" usually
> means a polynomial calculation over all of the bits, where any bit flip
> might change any bit in the final check word.  A longitudinal-only check
> means that a bit flip only impacts the check word in that bit position.
> 
> Longitudinal checks are much easier to implement in very high speed
> systems because you don't have to handle data skew combined with
> different length logic paths.  But they catch fewer errors precisely
> because they are easier to implement -- they don't combine as many
> source bits into the result.

As I read the spec, it's actually a little weirder than that:

"A 32-bit cyclic redundancy code (CRC) covers all HyperTransportTM
links. The CRC is calculated  on each 8-bit lane independently and
covers the link as a whole, not individual packets."  

So, for a 16 bit link (such as on an Opteron), there are two
concurrently running CRC's, one each for the two 8 bit lanes.  Now, the
strange part is that the CRC's cover the link and not the packets.  So,
a CRC is transmitted every 512 bit times (and hypertransport packets
aren't that big).  That means that you don't know which packets had a
bad bit. 

> > That's kind-of OK for small systems,
> > but doesn't scale.
> 
> Errrm, I have exactly the opposite viewpoint: ECC will fail to catch and
> correct most multibit errors, and most HT errors will be multibit.
> It's better to fail on corruption than to silently further corrupt.

The problem is that there is no sane mechanism to know which packets are
corrupted (and to therefore retransmit).  At scale, that doesn't really
work.  e.g. if you built a Red Storm scale system using just these
links, it would crash frequently because a CRC error would happen
somewhere and there wouldn't be a recovery mechanism.  (BTW, for those
suggesting mimicking the T3E with these things, the T3E wasn't cache
coherent.  It just had a shared address space mechanism of sorts.)

Someone asked something about Red Storm - here is a public link that
includes public presentations on the topic:

http://www.cs.sandia.gov/presentations/

					Keith




_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list