[From nobody Mon Jul 23 12:46:19 2012 Message-ID: <447C516A.1020908@ahpcrc.org> Date: Tue, 30 May 2006 09:06:34 -0500 From: Richard Walsh <rbw@ahpcrc.org> User-Agent: Thunderbird 1.5.0.2 (Macintosh/20060308) MIME-Version: 1.0 To: Eugen Leitl <eugen@leitl.org> Subject: Re: [Beowulf] Cell in HPC References: <20060528151601.GB26713@leitl.org> In-Reply-To: <20060528151601.GB26713@leitl.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit All, This is an excellent review of the CELL against leading VLIW/EPIC (Itanium), Superscalar (Operton) and Vector (Cray X1E) processors. From my reading the messages are: 1. Vector operations deliver higher percentage of peak at lower power than the alternatives on the HPC kernels. A comparison to an MTA-like highly multi-threading architecture is missing in the comparison. Most 32-bit numbers exceed the non-vector alternatives by substantially more than 2x in measured performance on Dense matrix, Sparse Matrix, Stencil, and FFT kernels (so dual core will not create parity on sustained/measured [no peak] comparisons in their view). 2. Three tiered memory system with simple local memory (store, like old Cray-2) that is user/software managed is preferable to cache in the above context. Double buffering and prefetching to local store reduce memory delays dramatically. 3. CELL's vector instructions from local memory need augmenting to include more "unaligned load" support ... indexed and non-unit stride capability (seems like loads from memory to the local store do have these features.) 4. Double precision (64-bit operations) are severely hamper by instruction issue delays. The reviewer suggest a few minor modifications to the design to reduce this problem ... so its performance at 64-bits drops off dramatically. They also argue that the CELL chip will be produced in large enough quantity to compete on price with the mulit-core super-scalars. I am not so sure of this. Also, the issue of vector type memory operations across a "commodity interconnect" in the context of the Beowulf distributed memory architecture is not addressed. Vector memory references are especially revealing of the limitations of the RDMA capabilities of current interconnects. CELL is a data-parallel heavy weight pitted against the instruction-parallel multi-core alternatives in which the question of how latency should be hidden is being considered --underneath stacks of independent/atomic instuction blocks (threads) which may or may not come from the same program, or with in a pipeline of vector operations that stream data from memory. Apps with partitionable data with some kind of non-random reference pattern (most HPC appls) favors data-parallelism and vectors, while work loads with more completely random references and the large thread counts (graphs algorthms) typical of the mixed user environment of servers favor the thread-level instruction parallelism. There is one micro processor architecture that I have seen from MIT. VTA (Vector Thread Architecture) which seems to combine both a workable fashion. I recommend the articles describing the VTA microprocessor out of Krste Asanovic's group at MIT. I think they have the ISA finished and are taping out the chip as I type. Regards, rbw Eugen Leitl wrote: > http://www.hpcwire.com/hpc/671376.html > > Researchers Analyze HPC Potential of Cell Processor > > Though it was designed as the heart of the upcoming Sony PlayStation3 game console, the STI Cell processor has created quite a stir in the computational science community, where the processor's potential as a building block for high performance computers has been widely discussed and speculated upon. > > To evaluate Cell's potential, computer scientists at the U.S. Department of Energy's Lawrence Berkeley National Laboratory evaluated the processor's performance in running several scientific application kernels, then compared this performance against other processor architectures. The results of the group's evaluation were presented in a paper at the ACM International Conference on Computing Frontiers, held May 2-6, 2006, in Ischia, Italy. > > The paper, "The Potential of the Cell Processor for Scientific Computing," was written by Samuel Williams, Leonid Oliker, Parry Husbands, Shoaib Kamil and Katherine Yelick, of Berkeley Lab's Future Technologies Group and by John Shalf from NERSC. > > "Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency," the authors wrote in their paper. "We also conclude that Cell's heterogeneous multi-core implementation is inherently better suited to the HPC environment than homogeneous commodity multi-core processors." > > Cell, designed by a partnership of Sony, Toshiba, and IBM Cell, is a high performance implementation of software-controlled memory hierarchy in conjunction with the considerable floating point resources that are required for demanding numerical algorithms. Cell takes a radical departure from conventional multiprocessor or multi-core architectures. Instead of using identical cooperating commodity processors, it uses a conventional high performance PowerPC core that controls eight simple SIMD (single instruction, multiple data) cores, called synergistic processing elements (SPEs), where each SPE contains a synergistic processing unit (SPU), a local memory, and a memory flow controller. > > Despite its radical departure from mainstream general-purpose processor design, Cell is particularly compelling because it will be produced at such high volumes that it will be cost-competitive with commodity CPUs. At the same time, the slowing pace of commodity microprocessor clock rates and increasing chip power demands have become a concern to computational scientists, encouraging the community to consider alternatives like STI Cell. The authors examined the potential of using the forthcoming STI Cell processor as a building block for future high-end parallel systems by investigating performance across several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations on regular grids, as well as 1D and 2D fast Fourier transformations." > > According to the authors, the current implementation of Cell is most often noted for its extremely high performance single-precision (32-bit) floating performance, but the majority of scientific applications require double precision (64-bit). Although Cell's peak double precision performance is still impressive relative to its commodity peers (eight SPEs at 3.2GHz = 14.6 Gflop/s), the group quantified how modest hardware changes, which they named Cell+, could improve double precision performance. > > The authors developed a performance model for Cell and used it to show direct comparisons of Cell against the AMD Opteron, Intel Itanium2 and Cray X1 architectures. The performance model was then used to guide implementation development that was run on IBM's Full System Simulator in order to provide even more accurate performance estimates. > > The authors argue that Cell's three-level memory architecture, which decouples main memory accesses from computation and is explicitly managed by the software, provides several advantages over mainstream cache-based architectures. First, performance is more predictable, because the load time from an SPE's local store is constant. Second, long block transfers from off-chip DRAM can achieve a much higher percentage of memory bandwidth than individual cache-line loads. Finally, for predictable memory access patterns, communication and computation can be effectively overlapped by careful scheduling in software. > > "Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency," the authors wrote. While their current analysis uses hand-optimized code on a set of small scientific kernels, the results are striking. On average, Cell is eight times faster and at least eight times more power efficient than current Opteron and Itanium processors, despite the fact that Cell's peak double precision performance is fourteen times slower than its peak single precision performance. If Cell were to include at least one fully utilizable pipelined double precision floating point unit, as proposed in their Cell+ implementation, these speedups would easily double. > > The full paper can be read at: http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf. > > The paper was written primarily by members of LBNL's Future Technologies Group, part of Berkeley Lab's Computational Research Division (http://crd.lbl.gov/), which creates computational tools and techniques that enable scientific breakthroughs, by conducting applied research and development in computer science, computational science, and applied mathematics. > > ----- > > Source: Lawrence Berkeley National Laboratory > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Richard B. Walsh Project Manager Network Computing Services, Inc. Army High Performance Computing Research Center (AHPCRC) rbw@ahpcrc.org | 612.337.3467 ----------------------------------------------------------------------- This message (including any attachments) may contain proprietary or privileged information, the use and disclosure of which is legally restricted. If you have received this message in error please notify the sender by reply message, do not otherwise distribute it, and delete this message, with all of its contents, from your files. ----------------------------------------------------------------------- ]