Design Article
Is multicore hype or reality?
Jack Ganssle
2/1/2008 5:35 AM EST
Multicore processors are here to stay but memory is a bottleneck.
For many years, processors and memory evolved more or less in lockstep. Early CPUs like the Z80 required a number of machine cycles to execute even a NOP instruction. At the few-megahertz clock rates then common, processor speeds nicely matched EPROM and SRAM cycle times.
But for a time, memory speeds increased faster than CPU clock rates. The 8088/6 had a prefetcher to better balance fast memory to a slow processor. A very small (4 to 6 bytes) FIFO isolated the core from a bus interface unit (BIU). The BIU was free to prefetch the most-likely-needed next instruction if the core was busy doing something that didn't need bus activity. The BIU thus helped maintain a reasonable match between CPU and memory speeds.
Even by the late 1980s, processors were pretty well matched to memory. The 386, which (with the exception of floating-point instructions) has a programmer's model very much like Intel's latest high-end offerings, came out at 16 MHz. The three-cycle NOP instruction thus consumed 188 nsec, which partnered well with most zero wait-state memory devices.
But clock rates continued to increase while memory speeds started to stagnate. The 386 went to 40 MHz, and the 486 to over 100. Some of the philosophies of the reduced instruction set (RISC) movement, particularly single-clock instruction execution, were adopted by CISC vendors, further exacerbating the mismatch.
Vendors turned to Moore's Law as it became easier to add lots of transistors to processors to tame the memory bottleneck. Pipelines sucked more instructions on-chip, and extra logic executed parts of many instructions in parallel.
A single-clock 100 MHz processor consumes a word from memory every 10 nsec, but even today that's pretty speedy for RAM and impossible for flash. So on-chip cache appeared, again exploiting cheap integrated transistors. That, plus floating point and a few other nifty features meant the 486's transistor budget was over four times as large as the 386.
Pentium-class processors took speeds to unparalleled extremes, before long hitting two and three gigahertz. Memory devices at 0.33 nsec are impractical for a variety of reasons, not the least of which is the intractable problem of propagating those signals between chip packages. Few users would be content with a 3-GHz processor stalled by issuing 50 wait states for each memory read or write, so cache sizes increased more.
But even on-chip, zero wait-state memory is expensive. Caches multiplied, with a small, fast L1 backed up by a slower L2 and in some cases even an L3. Yet more transistors implemented immensely complicated speculative branching algorithms, cache snooping and more, all in the interest of managing the cache and reducing inherently slow bus traffic.
And that's the situation today. Memory is much slower than processors and has been an essential bottleneck for fifteen years. Recently CPU speeds have stalled as well, limited now by power dissipation problems. As transistors switch, small inefficiencies convert a tiny bit of VCC to heat. And even an idle transistor leaks microscopic amounts of current. Small losses multiplied by hundreds of millions of devices means very hot parts.
Ironically, vast numbers of the transistors on a modern processor do nothing most of the time. No more than a single line of the cache is active at any time, most of the logic to handle hundreds of different instructions stands idle till infrequently needed, and page translation units that manage gigabytes handle a single word at a time.
But those idle transistors do convert the power supply to waste heat. The "transistors are free" mantra is now stymied by power concerns. So limited memory speeds helped spawn hugely complex CPUs, but the resultant heat has curbed clock rates, formerly the biggest factor that gave us faster computers every year.



ESD editorial staff: SRambo
2/1/2008 12:58 PM EST
Dear Jack,
I read your article in EE Times and I couldn't agree more. (I write a direct reply because the website has no comment facility).
First of all the term multicore is widely abused. SMP is not at all the same as putting a couple of CPUs.As you point out clearly, even with single CPU the bottleneck is the memory and worse, for real-time, it is the cullprit.Even on this PC there a factor of 100 between memory access to L1 vs. external DDRAM. Windows makes that even worse, we measured that the concurrency performance of a 1.6 GHz Windows PC is equivalent to a 15 Mips microcontroller running a native RTOS. It is even rather unclear how Windows manages the scheduling and as a word on consolation, Linux doesn't perform much better. While a dual core can help a little bit (if the code runs in a loop), I have seen benchmarks where the performance even goes down when using a quad-core. Very predictably because of the shared memory issue.
A couple of things need to change:
- People / engineers should learn to distinguish between real-time on a desktop and real-time for an embedded device. You are doing an excellent job with e.g. your newsletter, but when are computer scientist going to start teaching it?
- Designers should stop developing shared memory architectures. Not only because of the speed mismatch, but it has also other benefits like physical decoupling between application tasks, less power consumption and simplicity. No need for complex bus sharing protocols en cache coherency logic.
- Software engineers should learn that (embedded) sofware is concurrent by nature and that communication/interaction between "tasks" is as important as the tasks themselves. Communication means more than bus bandwidth, it also means latency.
What it comes down to is that embedded software should be designed from the beginning as concurrent programs. This fits well with a model driven architecture design process. The issue is I believe that computer sciencist often don't see this concurrent and real-time aspect and hardware designers design often synchronously. In other words, both groups think in terms of sequential loops.
You might say that this will never change because of legacy reasons. I believe that this is likely true for the IT market. But for the embedded world, there is little reason as quite a lot of designs are started from scratch anyway. You might say that we don't have the programming model for it. This is true if one keeps searching for inspiration on the desktop. In reality we have the programming models. Just think abouut CSP. CSP has been associated with the INMOS transputer and its arcane occam programming language. It was very succesfull in a small group and failed because of wrong marketing, but its value remained.
I have spend most of my live applying this computing paradigm with success. But we called it a "pragmatc superset of CSP". Targets ranged from single chip micros to systems with a few 1000's DSPs. We have now reinvented this concept and called it OpenComRTOS. It is a network centric RTOS, but it is also a programming paradigm. We used formal methods to develop it and the results are astounishing. We can fit in 1 KB of code (SP) or a full blown RTOS with MP support (evenst, semaphores, fifos, resources, memory polls, ...) in 5 KBytes. We have a demo where it runs distributed on 2 16bit micros, each with only a few KBytes. Another demo runs the same code (after recompilaton) on top of a Windows node connected via internet to a remote virtual server running Linux. We cn trasparantly put a few tasks on these 16bit micros and hook it in the network using a simple RS232 driver. The aim of this reply is not as much to promote this OpenComRTOS, but to show that "multicore" programming doesn't need to be an enigma. Most of the basic solutions were thought out some 30 years ago. Even Dijkstra had already solved most of the fundamental issues. If you design "parallel", there is no need to reverse engineer big sequential programs and you gain a lot. Actually, is some part of very compute intensive, you have two options. Either one splits the data over multiple CPUs or if you have a big vectorising CPU, you can sequentialise. But if you run out of cache, remember the first paragraphs above as you will start loosing performance rapidly.
Best regards,
Eric Verhulst
www.openLicenseSociety.org
Sign in to Reply
mitME
2/4/2008 8:04 PM EST
Jack has given us another great article to ponder over. I continue to "think" of embedded systems as very small and limited in resources and application size/complexity. Only once and awhile do I realize that some individuals live in a larger embedded world than I do.
I started in embedded apps with 8 bit uC's and have only processed to 32 bit uC's in the 70MHz to 500-600MHz range. This article reminds me that GHz uC's and SMP/MMP design issues are just around the corner. Lastly I remember that Amdahl's Law was for BIG IRON MAINFRAMES issues only. Well it seems I'm now working on them as a embedded hardware/software techocrat.
Sign in to Reply
mitME
2/4/2008 8:06 PM EST
Jack has given us another great article to ponder over. I continue to "think" of embedded systems as very small and limited in resources and application size/complexity. Only once and awhile do I realize that some individuals live in a larger embedded world than I do.
I started in embedded apps with 8 bit uC's and have only progressed to 32 bit uC's in the 70MHz to 500-600MHz range. This article reminds me that GHz uC's and SMP/MMP design issues are just around the corner. Lastly I remember that Amdahl's Law was for BIG IRON MAINFRAMES issues only. Well it seems I'm now working on them as a embedded hardware/software techocrat.
Sign in to Reply
HolisticGlint
5/26/2008 4:23 AM EDT
Hi Jack - I'm not sure parallels can be drawn here between embedded and desktop multicore systems. Embedded software has always been closer to the hardware and so engineers can take advantage of interesting memory architectures such as in you Tensilica example. While many systems are SMP on the surface in detail there is often a complex hierarchy of caches, TCMs and streaming ports that can reduce the memory bottleneck allowing multiple processors to operate together very efficiently.
However you are right about Amdahl's law - it is hard to see many embedded applications being able to take advantage of more than 4 or 8 processors beyond running things like multiple independent network stacks in routers. That said, running several applications in parallel requiring several processors each on chip (not too crazy - consider video VoIP on a mobile phone) with 10s of cores could be incredibly efficient if almost impossible to actually implement.
Sign in to Reply
gheorghe
5/26/2008 1:48 PM EDT
The multi-core is not the solution. We are thinking as in the current model of programming. Here is the mistake. For the future we will work on a new software model -- the Informational Individual! The processor will build on a structure and functionality of Informational Individual not a C++ language. Now we are thinking as in 1970. Now the problems are in the kind of thinking in software. Here is the barrier.
Sign in to Reply
jonnybegoode
6/15/2009 2:00 PM EDT
There are pro's and con's to shared memory architectures depending on the application. So I wouldn't say no one should design with it. It is, ultimately, a more versatile design if your workload varies significantly. If your processor cores act as nothing more than streaming pipelines -- data goes in at predictable rates, data comes out at predictable rates -- such as that in a video processor, then having separate memory interfaces works perfectly because that is an "embarrassingly parallel" task.
However, let's say you had a more dynamic execution environment. You have three wireless interfaces. Each one capable of receiving anywhere from megabits to just a few kilobits of data. If two are active at once -- say, a WiFi connection and a bluetooth, humor me here -- then you could potentially split the processing task for each to one processor with its own dedicated memory.
But if all 3 are active? Moreover, what do you design for in terms of performance goals? Do you want to be able to handle simultaneous LAN + WiFi (both in the megabits/sec range) for tethering? Do you want to have 3 separate processors (taking up die space) and 3 separate memory interfaces (at least two of which will be sitting idle most of the time)?
Wouldn't it be great if the WiFi processor could have access to all the available memory bandwidth for packet buffering when only the WiFi connection is active?
A possible solution -- and the way forward as I see it -- is to virtualize memory access. Instead of the processor and peripheral devices directly accessing memory, have a crossbar or mux-style architecture where an arbiter translates accesses from each core device and does interleaving/spreading to multiple memory banks. This increases latency (caching can be used to reduce this, and we're talking core clock latencies) but is significantly more versatile as it offers the benefits of scalable bandwidth from multiple memory connections with the flexibility of shared memory architectures.
Sign in to Reply