[From nobody Mon Jul 23 12:46:19 2012
Message-ID: &lt;447C516A.1020908@ahpcrc.org&gt;
Date: Tue, 30 May 2006 09:06:34 -0500
From: Richard Walsh &lt;rbw@ahpcrc.org&gt;
User-Agent: Thunderbird 1.5.0.2 (Macintosh/20060308)
MIME-Version: 1.0
To: Eugen Leitl &lt;eugen@leitl.org&gt;
Subject: Re: [Beowulf] Cell in HPC
References: &lt;20060528151601.GB26713@leitl.org&gt;
In-Reply-To: &lt;20060528151601.GB26713@leitl.org&gt;
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

All,

This is an excellent review of the CELL against leading VLIW/EPIC 
(Itanium), Superscalar (Operton)
and Vector (Cray X1E) processors.  From my reading the messages are:

     1.  Vector operations deliver higher percentage of peak at lower power
           than the alternatives on the HPC kernels.  A comparison to an 
MTA-like highly
           multi-threading architecture is missing in the comparison.  
Most 32-bit numbers
           exceed the non-vector alternatives by substantially more than 
2x in measured
           performance on Dense matrix, Sparse Matrix, Stencil, and FFT 
kernels
           (so dual core will not create parity on sustained/measured 
[no peak] comparisons
            in their view).

     2.  Three tiered memory system with simple local memory (store, 
like old Cray-2)
          that is user/software managed is preferable to cache in the 
above context.
           Double buffering and prefetching to local store reduce memory 
delays dramatically.

     3.  CELL's vector instructions from local memory need augmenting to
          include more &quot;unaligned load&quot; support ... indexed and non-unit 
stride
          capability (seems like loads from memory to the local store do 
have
          these features.)

     4.  Double precision (64-bit operations) are severely hamper by 
instruction issue
          delays.  The reviewer suggest a few minor modifications to the 
design to
          reduce this problem ... so its performance at 64-bits drops 
off dramatically.

They also argue that the CELL chip will be produced in large enough quantity
to compete on price with the mulit-core super-scalars.  I am not so sure 
of this.
Also, the issue of vector type memory operations across a &quot;commodity 
interconnect&quot; in
the context of the Beowulf distributed memory architecture is not addressed.
Vector memory references are especially revealing of the limitations of 
the RDMA
capabilities of current interconnects.

CELL is a data-parallel heavy weight pitted against the 
instruction-parallel multi-core
alternatives in which the question of how latency should be hidden is 
being considered
--underneath stacks of independent/atomic instuction blocks (threads) 
which may
or may not come from the same program, or with in a pipeline of vector 
operations
that stream data from memory.  Apps with partitionable data with some 
kind of non-random
reference pattern (most HPC appls) favors data-parallelism and vectors, 
while work loads
with more completely random references and the large thread counts 
(graphs algorthms)
typical of the mixed user environment of servers favor the thread-level 
instruction
parallelism.

There is one micro processor architecture that I have seen from MIT. VTA 
(Vector Thread
Architecture) which seems to combine both a workable fashion.  I 
recommend the articles
describing the VTA microprocessor out of Krste Asanovic's group at MIT.  
I think they have the
ISA finished and are taping out the chip as I type.

Regards,

rbw


Eugen Leitl wrote:
&gt; http://www.hpcwire.com/hpc/671376.html
&gt;
&gt; Researchers Analyze HPC Potential of Cell Processor
&gt;
&gt; Though it was designed as the heart of the upcoming Sony PlayStation3 game console, the STI Cell processor has created quite a stir in the computational science community, where the processor's potential as a building block for high performance computers has been widely discussed and speculated upon.
&gt;
&gt; To evaluate Cell's potential, computer scientists at the U.S. Department of Energy's Lawrence Berkeley National Laboratory evaluated the processor's performance in running several scientific application kernels, then compared this performance against other processor architectures. The results of the group's evaluation were presented in a paper at the ACM International Conference on Computing Frontiers, held May 2-6, 2006, in Ischia, Italy.
&gt;
&gt; The paper, &quot;The Potential of the Cell Processor for Scientific Computing,&quot; was written by Samuel Williams, Leonid Oliker, Parry Husbands, Shoaib Kamil and Katherine Yelick, of Berkeley Lab's Future Technologies Group and by John Shalf from NERSC.
&gt;
&gt; &quot;Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency,&quot; the authors wrote in their paper. &quot;We also conclude that Cell's heterogeneous multi-core implementation is inherently better suited to the HPC environment than homogeneous commodity multi-core processors.&quot;
&gt;
&gt; Cell, designed by a partnership of Sony, Toshiba, and IBM Cell, is a high performance implementation of software-controlled memory hierarchy in conjunction with the considerable floating point resources that are required for demanding numerical algorithms. Cell takes a radical departure from conventional multiprocessor or multi-core architectures. Instead of using identical cooperating commodity processors, it uses a conventional high performance PowerPC core that controls eight simple SIMD (single instruction, multiple data) cores, called synergistic processing elements (SPEs), where each SPE contains a synergistic processing unit (SPU), a local memory, and a memory flow controller.
&gt;
&gt; Despite its radical departure from mainstream general-purpose processor design, Cell is particularly compelling because it will be produced at such high volumes that it will be cost-competitive with commodity CPUs. At the same time, the slowing pace of commodity microprocessor clock rates and increasing chip power demands have become a concern to computational scientists, encouraging the community to consider alternatives like STI Cell. The authors examined the potential of using the forthcoming STI Cell processor as a building block for future high-end parallel systems by investigating performance across several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations on regular grids, as well as 1D and 2D fast Fourier transformations.&quot;
&gt;
&gt; According to the authors, the current implementation of Cell is most often noted for its extremely high performance single-precision (32-bit) floating performance, but  the majority of scientific applications require double precision (64-bit). Although Cell's peak double precision performance is still impressive relative to its commodity peers (eight SPEs at 3.2GHz = 14.6 Gflop/s), the group quantified how modest hardware changes, which they named Cell+, could improve double precision performance.
&gt;
&gt; The authors developed a performance model for Cell and used it to show direct comparisons of Cell against the AMD Opteron, Intel Itanium2 and Cray X1 architectures.  The performance model was then used to guide implementation development that was run on IBM's Full System Simulator in order to provide even more accurate performance estimates.
&gt;
&gt; The authors argue that Cell's three-level memory architecture, which decouples main memory accesses from computation and is explicitly managed by the software, provides several advantages over mainstream cache-based architectures. First, performance is more predictable, because the load time from an SPE's local store is constant. Second, long block transfers from off-chip DRAM can achieve a much higher percentage of memory bandwidth than individual cache-line loads. Finally, for predictable memory access patterns, communication and computation can be effectively overlapped by careful scheduling in software.
&gt;
&gt; &quot;Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency,&quot; the authors wrote. While their current analysis uses hand-optimized code on a set of small scientific kernels, the results are striking.  On average, Cell is eight times faster and at least eight times more power efficient than current Opteron and Itanium processors, despite the fact that Cell's peak double precision performance is fourteen times slower than its peak single precision performance.  If Cell were to include at least one fully utilizable pipelined double precision floating point unit, as proposed in their Cell+ implementation, these speedups would easily double.
&gt;
&gt; The full paper can be read at: http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf.
&gt;
&gt; The paper was written primarily by members of LBNL's Future Technologies Group, part of Berkeley Lab's Computational Research Division (http://crd.lbl.gov/), which creates computational tools and techniques that enable scientific breakthroughs, by conducting applied research and development in computer science, computational science, and applied mathematics.
&gt;
&gt; -----
&gt;
&gt; Source: Lawrence Berkeley National Laboratory
&gt;
&gt;   
&gt; ------------------------------------------------------------------------
&gt;
&gt; _______________________________________________
&gt; Beowulf mailing list, Beowulf@beowulf.org
&gt; To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
&gt;   


-- 

Richard B. Walsh

Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
rbw@ahpcrc.org  |  612.337.3467

-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted.  If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
----------------------------------------------------------------------- 


]