Off topic - G5 question

Thu Jun 26 18:12:22 EDT 2003

On Thu, 26 Jun 2003, Mark Hahn wrote:

> I find that many of my users spend most of their serious time inside
> tuned library functions, so the compiler is less relevant.  though Intel's
> compilers *do* auto-vectorize onto SSE2.  gcc can use SSE2, but doesn't
> currently autovectorize.  incidentally, gcc 3.3 *does* also support 
> intrinsics for Altivec vector operations; I'm not sure whether similar 
> intrinsics are available for SSE2.
> 
> > flags on both systems works for me, at least, as this is likely to be
> > the extent of the effort I want to make tuning per architecture, ever.
> 
> I find this irresponsible (sorry rgb!).  if you're going to consume 
> more than a few CPU hours, and those cycles are coming from a shared 
> resource, failing to optimize is just plain hostile.

Oh, that's ok.  I freely admit to utter irresponsibility.  Laziness too!

I was (of course) at least partly tongue in cheek, although the nugget
of truth in what I was saying is what I'd call the "CM5 phenomenon".
CM5's were a nifty architecture too -- basically (IIRC) a network of
sparcstations each with an attached vector unit.  Programming it was
nightmarish -- it went beyond instrumenting code, one basically had to
cross-compile code modules for the vector units and use the Sparcs as
computers to run these modules and handle all the nitty gritty for them
(like access to I/O and other resources).  Then the CM5 bellied up (Duke
sold its old one for a few thousand $$ to somebody interested in
salvaging the gold in its gazillion contacts).  All that effort totally
wasted.  We're talking years of a human life, not just a few weeks
porting a simple application, and he was just one of Duke's many CM5
users...:-)

Or if you prefer, the MPI phenomenon.  Back when the CM5 had to be hand
coded, so did nearly everything else.  Code maintenance was nightmarish,
as ever parallel supercomputer manufacturer had their own message
passing API and so forth.  So the government (major supercomputer
purchaser that it is) finally said that enough was enough and that they
were only going to use one API in the future and if machines didn't
support it they weren't going to be bought, and lo, MPI was born so
people didn't have to rewrite their code.

Or related examples abound in all realms of computedom.  The more you
customize or optimize per architecture, the less portable your code
becomes and the more expensive it becomes to change architectures until
you find yourself in a golden trap, still buying IBM mainframes because
you have all this custom stuff that your business relies on that (in its
day) was heavily customized to take advantage of all of the nifty
features of IBM mainframes.

Curiously, a cluster of G5's with their embedded Altivecs, or a cluster
of playstations with their embedded vector processor, bear a strong
resemblance to the CM5 architecturally and (in the case of the PS2's) in
how one has to program them with what amounts to a cross-compiler with
instrumented calls to send code and data on to the vector unit for
handling.

This kind of architecture makes me very nervous.  Optimizing is no
longer just a matter of picking good compiler flags or writing core
loops sensibly (things that can easily be changed or that are good ideas
on nearly any architecture); it starts to become an issue of writing
code "just for the Altivec" or "just for the PS2" -- hacking your
application (a process that can take a LONG time, not just hours or
days) irreversibly so that you have to spend hours or days again to
change it back to run on a more normal architecture.

It is this lack of portability that can, and I say can, be a killer to
productivity and ultimately to one's ability to select freely from COTS
architectures.  Once the code is modified and optimized for (say) a G5,
one has to take into account the ADDITIONAL cost of un/remodifying it
and reoptimizing it for (say) an Opteron or Itanium or
whateverisnextium.  This makes Apple very happy, in precisely the way
that Microsoft is made happy every time a businessman says "we have to
run Windows because everybody is used to it and it would cost us too
much time/money/stress to change" even when it is visibly less
cost-effective head to head with alternatives.

> I still see users fling their code at a machine with 
> 	g77 prog.f && ./a.out
> and wonder why it's so slow.  usually, I stifle the obvious answer,
> and (hey, this is Canada) apologize for not providing the information
> they need to make supra-moronic use of the machine...

<suit material='asbestos' action='don'>
Right.  They need to be running

   gcc -O3 -o prog prog.c
   prog

instead.  Obviously;-)
</suit>

(yoke, yoke, I was mayking a yoke to help you forget how screwed you
are...:-)

Still, you're right of course, especially where things like SSE2 or
Altivec aware compilers can do a lot of the optimizing for you if your
code isn't too complex, or the issue is using ATLAS-tuned blas instead
of a really shitty off the shelf blas.  Still, there is something to be
said for writing just plain sound and efficient code (in C:-) and using
mundane -O-level optimization as a basis for how good a machine is,
especially if you don't plan to study the microarchitecture layout and
figure out if you can get four flops in the pipeline per clock, but only
if you rearrange your code in a completely nonintuitive way so it does a
multiply/add in a certain way...

> regards, mark 'surly' hahn.

(and to cheer you up..:-)

   rg-let-me-show-you-my-hand-tuned-8087-library-some-day-b

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf