From the do-you-want-fries-with-that department
As processor frequencies continue to level off and mainstream processors keep sprouting cores, the end of the "frequency free lunch" has finally arrived. That is, in the good old days each new generation of processor would bring with it a faster system clock that would result in a "free" performance bump for many software applications -- no reprogramming needed. Can we ever get back to the good old days?
Due to the preponderance of multi-core processors, many developers are now tasked with making applications run in parallel -- and that free lunch became really expensive. In the HPC world, many applications were already parallel due to the use of cluster architectures and MPI. The cost for lunch was not all that much because using multiple cores (local or remote) was built into many applications. There is a difference, however, between a core and memory living in the same neighborhood (in the same processor) vs living in the next town (server) along the InfiniBand highway, but this could be managed with things like OpenMP/MPI hybrid computing models (in effect the inner loops are threaded on the local multi-core processor using OpenMP and the outer loops are managed across the cluster with MPI).
The lunch got more expensive when the GPU restaurant came to town. Extraordinary good food served fast, but expensive in terms of programming. The lunch was not free, but who could resist the SIMD buffet! Of course, the GPU restaurant was not a full service and a local core was needed to set the table and get the food to the right customer. And just the other night, a new many-core restaurant has landed in town that offers both full service and lots and lots of tables. In terms of software, the lunch at the new place is quite reasonable because it looks a lot like the menu in the established multi-core joints that are everywhere.
All this free lunch talk does make one wonder about the cost of a good meal in the future. There are also some other cafes just an arms length away that may be worth visiting in that search for an almost free lunch. Then there are some others, not on main street, that really have some interesting dishes.
Welcome to the Epiphany Bistro
For years Andreas Olofsson (@adapteva) of Adapteva Inc (Lexington, MA) has been working on his many-core Epiphany chip. In 2011, Adapteva produced a 16-core 65nm System-On-Chip (“Epiphany-III”) that became the basis for the popular Parallella board. The chip worked beyond expectations and is still being produced today for use in many projects including the $99 Parallella board. The second Epiphany product was a 28nm 64-core SOC (“Epiphany-IV”) completed in the summer of 2011 and demonstrated 70 GFLOPS/Watt processing efficiency. It was one of the most energy-efficient processors available at that time. The chip was sampled to a number of customers and partners, but was not produced in volume due to lack of funding. At that time, Adapteva also created a physical implementation of a 1024 core RISC processor array, but it was never taped out due to funding constraints.This situation has changed for the better thanks to a DARPA grant that has allowed Adapteva to tape out the 1024 core Epiphany-V. The processors are being manufactured with a 16-nm FinFET process by TSMC and should be available. in 4-5 months (Taiwan Semiconductor Manufacturing Co).
For those wishing investigate the Epiphany architecture, the Parallella board is still available and has become the nucleus around which a growing community of users has formed. Most notably is the amount of open software tools that have been ported to the platform. The Parallella board has the following features:
- 18-core credit card sized computer
- #1 in energy efficiency @ 5W
- 16-core Epiphany RISC SOC
- Zynq SOC (FPGA + ARM A9)
- Gigabit Ethernet
- 1GB SDRAM
- Micro-SD storage
- Up to 48 GPIO pins
- HDMI, USB (optional)
- Open source design files
- Runs Linux
- Starting at $99
Epiphany-V: A 1024 processor 64-bit RISC System-On-Chip
As mentioned, the Epiphany technology has taken a major leap forward. At 1024 cores, the announced Epiphany-V is gaining interest from many directions. The new version is based on the distributed shared Epiphany memory architecture that is comprised of an array of RISC processors communicating via a low-latency mesh Network-on-Chip. Each node in the processor array is a complete RISC processor capable of running an operating system as a Multiple Instruction, Multiple Data ("MIMD") device. Epiphany uses a flat cache-less memory model in which all distributed memory is directly readable and writable by all processors in the system. For example, Epiphany packets are 136 bits wide and transferred between neighboring nodes in one and a half clock cycles. Packets consist of 64 bits of data, 64 bits of address, and 8 bits of control. Read requests puts a second 64-bit address in place of the data to indicate destination address for the returned read dataThe Epiphany-V introduces a number of new capabilities compared to previous Epiphany products, including 64-bit memory addressing, 64-bit floating point operations, 2X the memory per processor, and custom ISAs for deep learning, communication, and cryptography. Adapteva will not disclose final power and frequency numbers until silicon returns, but based on simulations they can confirm that the performance should be in line with the 64-core Epiphany-IV chip adjusted for process shrink, core count, and feature changes. The following is a summary of the Epiphany-V features:
- 1024 64-bit RISC processors
- 64-bit memory architecture
- 64/32-bit IEEE floating point support
- 64MB of distributed on-chip memory
- 1024 programmable I/O signals
- Three 136-bit wide 2D mesh NOCs
- 2052 Independent Power Domains
- Support for up to 1 billion shared memory processors
- Binary compatibility with Epiphany III/IV chips
- Custom ISA extensions for deep learning, communication, and cryptography
Besides the technical achievements, the design was completed at 1/100th the cost of the status quo and demonstrates an 80x advantage in processor density and 3.6x-15.8x advantage in memory density compared to state of the art processors
Figure One: Epiphany Mesh that can extend to one billion processors.
Perhaps just as impressive as the cost effective design of the Epiphany processor is the range of software tools that is already available, In addition to support for the standard GNU tool chain, the following projects are also available and any software developed for previous Epiphany processors will run on the new 1024 Epiphany version. Now that is a free lunch.
- OpenMP from University of Ioannina
- MPI from BDT/ARL
- OpenSHMEM from ARL
- OpenCL from BDT
- Erlang from Uppsala University
- Bulk Synchronous Parallel (BSP) from Coduin
- Epython from Nick Brown
- PAL from Adapteva/community
And Now, A Point Worth Noting
The gist of this tale is not so much suggesting the Epiphany chip will play a role in HPC (too soon to tell). Although thinking about a billion cores is a fun exercise and the Epiphany architecture should start to gain more traction ( $99 gets you in the game). The point is that when there is a true parallel design in both hardware and software, then the free lunch may make a comeback (sort of). Of course there are limits. The Epiphany architecture was designed for good performance across a broad range of applications, but really excels at applications with high spatial and temporal locality of data and code (i.e. distributed memory codes).Amdahl's law notwithstanding, instead of increasing megahertz, massively increasing general purpose cores maybe the way forward and a chance at cheaper lunches. Of course applications need to be amenable to such an architecture, but at the places I dine, these seem to be all over the menu.