AMD Opteron memory bandwidth (was Re: CPUs for a Beowulf)

Donald Becker becker at scyld.com
Wed Sep 10 09:56:57 EDT 2003


On 9 Sep 2003, Keith D. Underwood wrote:
> > I think that it's more precisely a longitudinal check.  "CRC" usually
...
> > Longitudinal checks are much easier to implement in very high speed
.
> As I read the spec, it's actually a little weirder than that:
> 
> "A 32-bit cyclic redundancy code (CRC) covers all HyperTransportTM
> links. The CRC is calculated  on each 8-bit lane independently and
> covers the link as a whole, not individual packets."  

Ahhh, that implies that it uses a group longitudinal check.
That has much better checking than a serial single bit check, and by
limiting the width the reduce the skew and logic path problem of a wider
check.

> So, for a 16 bit link (such as on an Opteron), there are two
> concurrently running CRC's, one each for the two 8 bit lanes.  Now, the
> strange part is that the CRC's cover the link and not the packets.  So,
> a CRC is transmitted every 512 bit times (and hypertransport packets
> aren't that big).  That means that you don't know which packets had a
> bad bit. 

That's an excellent design decision.
Putting a check word on each packet means
  - the physical encoding layer need to know about packetization
  - a packet must be held until the check passes
  - the tiny packets grow
  - to do anything with the per-packet info, packet copies must be kept
These all add complexity and latency to the highest speed path.

By putting the check on fixed block boundaries you can still detect and
fail an unreliable link

> > > That's kind-of OK for small systems, but doesn't scale.
> > Errrm, I have exactly the opposite viewpoint: ECC will fail to catch and
> > correct most multibit errors, and most HT errors will be multibit.

I'll repeat:
   ECC Bad!  ECC Slow!

> corrupted (and to therefore retransmit).  At scale, that doesn't really
> work.  e.g. if you built a Red Storm scale system using just these
> links, it would crash frequently because a CRC error would happen
> somewhere and there wouldn't be a recovery mechanism.

If you are getting CRC errors, you very likely have errors that ECC (*)
would silently pass or further corrupt.

* Any high-speed ECC implementation.  It's possible to keep adding
  check bits, but anything past SECDED Single Error Correction, Double
  Error Detection becomes time consuming and expensive.


-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
914 Bay Ridge Road, Suite 220		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list