The Soul of Cell
If you've been following my ongoing commentary on the Cell Broadband Engine, then you realize that I've been fairly critical of the architecture and its implementation in the third iteration of Sony's PlayStation console. Not wanting to be an ignorant critic on the subject, I presented Dr. H. Peter Hofstee, chief architect of the Cell Synergistic Processor, with a handful of questions that he graciously answered.
Do you have an analogy for how data is manipulated by the Cell Broadband Engine? How is the Cell's process of internal data manipulation different from general purpose processors?
Stanford's Bill Dally has a nice analogy that explains the memory wall problem processors have run into. Imagine doing a plumbing project ... you start and you see you need a pipe ... so you drive to the store ... come back with a pipe, and you discover you need a fitting ... so you drive to the store ... come back with a fitting ... and discover you need solder ... (etc.) very inefficient! When microprocessors started memory was just a few processor cycles away ... similar to having all you need in the cupboard. Today, main (DRAM) memory is hundreds of processors cycles away ... and getting things is like a drive across town to the plumbing store. What do you need to do when your supply is far away? Make a shopping list! This is exactly what the SPEs in Cell enable you to do. Instead of getting data from main memory right when you discover it is needed, you construct a list of what you need, and kick off a (DMA) processor that gets it for you. You can even create multiple lists, both of supplies you need, and stuff you're done with and want to put out there, so that you can always keep working.
What do you see as the inherent problem with current architectures? PowerPC? x86? SPARC? How does the Cell Broadband Engine address these shortcomings?
The main problem with current architectures is the memory wall I explained above. Because the programs do not provide shopping lists, the only way to get more than one thing on the way from main memory is to guess ahead at what may be needed, a very difficult thing to do. Another analogy ... at 512 cycles latency, say, to main memory, an 8 byte interface, and a 64byte memory access size, and a fully pipelined interface at the processor frequency, you need 64 64-byte memory accesses in flight to fully utilize the available memory bandwidth. Most processors support only a handful. This looks like a situation where you have a bucket brigade with 64 people, but only a handful of buckets ... no way you will see efficient use of the people. This phenomenon is the reason that Cell achieves nearly two orders of magnitude better performance on applications where the problem comes down to collecting data from memory in a pattern that can be calculated, but isn't so trivial the hardware can guess it. A lot of problems are like that: fast Fourier transforms, volume rendering, raycasting and raytracing, and many others.
Some other problems, like the fact that single thread processor performance isn't improving as fast as it used to, and the fact that almost all systems are really limited in their performance by the power the system allows, can, and are being addressed by building multi-core chips. Cell is multi-core, but what is unique about it is the fact that it has two different types of cores sharing memory, which allowed us to optimize each more for their own tasks.
How many workable processor designs were discarded before resting on the final design of the Cell Broadband Engine? Did you find any patent issues constraining while working on the processor design?
In the first year in the design center we built a fully functional SPU (SPE minus the DMA unit). Much of it was ok, but there was still much to improve, so we changed a bunch of things, like the local store size. Towards the end of the first year we redid much of the chip architecture, making the chip much more programmable, fixing things (our first version of real-time partitioning was not what we wanted), and introducing new elements, like the security architecture. After that we mainly changed the chip configuration (like the number of SPEs) as we learned more about what would fit on a chip and how to best balance the chip and make it manufacturable.
IBM has a vast experience in microprocessors and a deep patent portfolio, and much of what we did was new ... there are 100s of new patents that resulted from this project ... not too many constraints. Some other things were really very old ... so not too much of a problem there either.
Could you elaborate on the internal workings and design of the Synergistic Processor Elements? Could you contrast the design of these specialized processors to the Power Processor Element?
The Power processor element (PPE) is a more conventional architecture ... it brings instructions and data in as they are needed. It relies on caches (cupboards) for good performance. A task like running the operating system runs well on the Power processor, and often we will use the PPE to define the work. The Power processor also guarantees us that it is easy to get started with Cell, as Cell is fully Power architecture compliant. The SPEs do not have a cache, but instead bring data from shared memory into a local store memory before operating on it. To make this work well, you tend to have to restructure your code (and bring out the shopping lists), but when you do so, the performance tends to be very high. So the two processors nicely complement one another.
How do you feel about backwards compatibility between processor generations? Does the need for backwards compatibility necessarily stifle innovation?
We had about 5 years to do the project, and we felt that without some form of backward compatibility, it would likely have taken us 10 years. By building on what already works well, the Power architecture and the operating systems, compilers, applications etc. that come with it, we could get a flying start. It may have seemed a little bit constraining in the very beginning, but we realized very quickly that building on Power really freed us up to work on the new things we cared about, like real-time, and the SPEs, and security etc., rather than hold us back. Again, the hundreds of patents I think are testimony to this.
What is your personal vision of the "ultimate processor design?" How does the current incarnation of the Cell Broadband Engine figure into your vision?
Actually, one of the nice things about the job of a processor architect is that the "ultimate" processor design does not exist! 25 years ago, when main memory was close, Cell would have been a really dumb idea, but when "main memory" was a spinning drum, decades before that, Cell could have looked pretty good! It is the job of the architect to respond to the ever changing physical realities, and best bridge these realities to where the programmers can take over and do their thing.
For now, I am really happy with Cell, and we should not change things too fast, so that we do not outrun the software community. Still, architects have to think pretty far ahead, since it takes a good while to build a new processor, so we are already beginning to think about what might come after this, and how to build on the ideas that were introduced in Cell. We believe it is very possible to build on these ideas, just like we built on Power when we introduced Cell, and get a very compelling result.