Design Article
The Nulticore effect
Jack Ganssle
12/8/2008 1:00 AM EST
- A handful of "embarrassingly parallel" problems can derive great
performance benefits from SMP.
- In many applications one can reduce power consumption by using more
processors at slower clock rates.
Actually, there is a third problem that multicore solves: the vendors' need to sell us more transistors as they continue to exploit Moore's Law.
Now a study in IEEE Spectrum shows that even for the classic embarrassingly parallel problems like weather simulations multicore offers little benefit. The curve in that article is priceless. As the number of cores grow from two to 64 performance plummets by a factor of five. Additional processors nullify each other.
Call it the Nulticore Effect.
One might think that more CPUs equals faster systems, but in traditional symmetrical multiprocessing groups of cores sharing the same memory bus, a bus that even with a single core is already as congested as Highway 101 at rush hour. Memory simply can't keep up with a single-cycle machine that can swallow a couple of instructions per nanosecond.
We all know this; it's the reason a modern processor is crammed full of complex circuits like pipelines and cache. Every access to the bus entails numerous wait states which bring the system to a screeching halt. Add more cores, all demanding access to that same bus, and system performance is bound to drop.
Other problems surface. We know that absent scheduling algorithms like RMA (rate monotonic analysis) - which itself is highly problematic - preemptive multitasking is not deterministic. Though most embedded systems use preemptive multitasking, there's no way to insure the system won't fail from a perfect storm of interrupts and task switches.
And it's hard - really hard in a complex system - to get multitasking right. Add in multiple cores, each of which is constantly blocking the others from memory, and determinism looks about as likely as every school kid's plan to become an NBA star.
Reentrantly sharing memory is tough enough with a single processor; when many share the same data the demands on developers to produce perfectly locked and reentrant code become overwhelming.
Then there's the little issue of parallelizing programs, an unsolved problem that is to supercomputing what the holy grail is to the Knights Templar - plenty of rumors, lots of speculation, but no hard results.
There are a lot of smart people working on these problems and I've no doubt they will be solved at some point. But today a generally better approach is asymmetric multiprocessing, where each core has its own memory space. More on that later.
Jack G. Ganssle is a lecturer and consultant on embedded
development issues. He conducts seminars on embedded systems and helps
companies with their embedded challenges. Contact him at jack@ganssle.com. His website is www.ganssle.com.



krwada
12/9/2008 8:44 PM EST
What we got heah ... is a failyah to communicate ... between processors and memory that is!
We have successfully beaten the path from processor bandwidth limitation to peripheral, or interconnect bandwidth limitation. It appears as if we hit this limit fairly recently too.
Sign in to Reply
igrbt
12/10/2008 2:53 AM EST
Interestingly, the same problem of scalability on SMP systems was encountered in 60s of the last century - above the 4 CPUs performance did not rise linearly. This actually also was considered academic.
The moral is simple - architecturally we did not advance too far.
Sign in to Reply
LFHeller
12/10/2008 1:02 PM EST
XMOS, with a similar architecture to the Inmos transputer, and also designed by David May, is a good example of the MIMD approach, with each core having it's own memory.
Sign in to Reply
bugeye
12/10/2008 2:20 PM EST
It depends on the problem. There are plenty of examples that do scale nicely on SMP.
Sign in to Reply
TechnoMarketeer
12/11/2008 9:25 AM EST
Hooray - the first time I have seen some sense spoken about the realities of multicore in embedded for a long time.
While none of what Jack has said is new news, its amazing to see how much rubbish has been claimed by so many embedded vendors in this regard, 'almost' all of who frankly are delivering half a solution, either the hardware or the software half. The solution to the problem of gaining benefit in terms of performance or power reduction is indeed a hardware + software challenge. In the main its the classic case of the hardware vendors 'solving' the problem from their PoV then throwing the rest of the problem over the wall to the software engineering community. Half a solution is no solution.
None of this is new news to anyone who as operated in the HPC market or seen or been involved in super computer design. So in this case listen and learn from that community, they have analysed this problem almost to death, especially the memory bottleneck problem, and there are some very interesting tools and solutions out there to help address the problem. Although as yet no complete solution unless you are a super computer supplier. And our embedded systems can't bear those costs yet.
Certain IP/semi vendors in particular are guilty of explaining the potential power benefits of their multicore processors by referring to applications that are "embarrassingly" parallel as worked examples. Its time a few reality checks were brought into this whole debate, thanks Jack!
In the meantime, if you have a parallelizeable application like graphics, or perhaps multi data ports all doing the same job, then indeed multicore has potential for you, for 98% of the rest of embedded designs the AMP approach at the hardware level combined with sw segmentation is the only way forward for now....and frankly thats not new news either....SoCs have been architected with 3+ cores on average for at least 5 years.
I look forward to the AMP discussion.
Geoff
Sign in to Reply
Uberuber
12/11/2008 9:57 AM EST
No surprise. Compare with the new "stream processor" type multicore that have been a commercially viable for embedded. Why? -They have no conventional hw caches (managed by compiler/software) and can exploit hundreds of cores per thread per clock cycle with finegrained parallelism, often in SIMD, with low overhead. Radically different architecture compared to conventional multicore. Check out nVidia or Stream Processors websites for more details. Still, it's going to be about the productivity of the tools. At least stream processing give the tools a better chance to manage the critical CPU to memory traffic, and CPU synchronization. Programmers should not have to bother about scheduling memory traffic and "parallelizing".
Sign in to Reply
FBG
12/11/2008 11:16 AM EST
Jack,
This has been around for a while - ILLIAC IV in the early 70s, for example. This is also why multiprocessor machines need architectural changes to increase memory bandwidth and why real multiprocessor servers (as opposed to PCs on steroids) implement things like wider memory busses, interleaved memory, and dedicated processor memory. And of course then there is the issue of software, but that's a whole different can of worms.
Sign in to Reply
sreaves22645
12/11/2008 3:45 PM EST
Hello Jack,
As usual you are a breath of fresh air. I worked on Univac systems (418III,1108,100/62,1100/94) and I believe that Univac beat that horse to death as well. I think that is why you never saw an 1100/96 or a 1100/98 (6 or 8 CPU) water cooled mainframe. I'm sure that SOTA machinne that I worked on in the mid to late 80's is now in the landfill or was recycled into scrap metal in China.
Sam Reaves
Sign in to Reply
ESD editorial staff: SRambo
12/15/2008 10:24 AM EST
"Reentrantly sharing memory is tough enough with a single processor; when many share the same data the demands on developers to produce perfectly locked and reentrant code become overwhelming."
Your comment brought to mind a quote in the book, "Programming Erlang: Software for a Concurrent World" by Joe Armstrong, that I have just finished reading.
"If you have multiple processes sharing and modifying the *same* memory, you have a recipe for disaster -- madness lies here."
Erlang is used to build highly-fault tolerant switching systems, such as phone switches. Seems that the telephone companies figured out years ago the correct way to do multi-core/processor systems. After all when is the last time you had to press the rest button on your Plain Old Telephone (POT)?
Erlang has many features that are useful in Embedded Systems such as the ability to update running code. Phone Switches have to run for years without ever being take out of service.
http://www.pragprog.com/titles/jaerlang/programming-erlang
--Bob Paddock
ASQ Certified Quality Software Engineer.
Sign in to Reply
mescusag
3/24/2009 6:05 PM EDT
It is very much expected to happen in a multicore environment that a overworked memory seem to disable its ability to perform well especially in parallel computing since much learnings are still needed to bring parallel computing to desktop
Sign in to Reply