Design Article
Dodging Amdahl's Law with message passing, FPGA-based, parallel processing
Dave Strenski and Brian Durwood
2/24/2010 4:02 PM EST
Enter the unintended consequence of scaling. Amdahl's law says that as you add more processors, you get bogged down by more overhead. Basically the Nth guy you add to build a brick wall begins to slow things down because all the brick layers are reaching for bricks off the same pile, and get in each other's way. Add another N brick layers and it just gets worse. So the idea is to compliment the original process (the first brick layer) with a co-processor that makes that brick layer more efficient (faster), independent of any other brick layer. Image a machine that hands the brick layer a pre-cemented brick, so all they need to do is place it. Or, there is always the old analogy:
"I know how to make 4 horses pull a cart - I don't know how to make 1024 chickens do it."
--Enrico ClementiUsing co-processors dodges Amdahl's law by using more powerful nodes, thus needing fewer of them to reach the same level of performance. While this approach is successful, it puts more burden on the programmer to make a heterogeneous programming model, and successfully implement it on a given node and across multiple nodes. How does the program deploy the algorithm in this new environment? Can it be emulated in one simulation? How does the programmer debug a multi-node program? all using co-processors? This article will discuss these basics within the tool flow and then focus primarily on memory mapping issues at the low end of FPGA enabled coprocessing, and at the high end of the thousand processor arrays.
In our vision of heterogeneous processing, FPGAs are tightly coupled with one or more microprocessors on a mother board, sharing a common memory space. Distributed Global memory cannot be directly accessed (by design) and is accessed instead through a message passing interface such as "MPI", across an interconnect like GigE or Infiniband, or custom high performance interconnects like Cray's SeaStar network. There used to be multiple flavors of passing interfaces but MPI is now the most common. Other parallel languages such as OpenMP and PThreads are alternatives, but require shared memories running on the nodes with the FPGAs. There are also PGAS (Partitioned Global Address Space) languages like SHMEM, UPC, and CoArray FORTRAN. These give the programmer a one-sided messaging model which allows them to have a global address space across the whole machine, but without cache coherency.




stephendoyle
3/1/2010 7:02 AM EST
I don't think that you are really dodging Amdahl's law. There is still overhead associated with the message passing, but by keeping the number of nodes down you are reducing the impact of the overhead on the overall system ... exactly in line with Amdahl's law.
Sign in to Reply
TheMerc
3/11/2010 4:22 PM EST
Sorry guys. We solved this issue at NASA back in 1982. Unfortunately, we were so far ahead of the rest of the world (256 processor heterogeneous system) back then that no one ever read the papers we published and you probably cannot find them now.
The only thing that makes it worse is that software developers have become so reliant on "canned" Operating Systems and Development Environments (IDE's)that they don't think for themselves about how the overall system really plays together.
Sign in to Reply
devbisme
3/11/2010 4:48 PM EST
I thought Amdahl's law related the maximum speedup of a program to the percentage that was inherently sequential. So a program that was 5% sequential could never be sped-up more than 20x, no matter how many FPGAs, microprocessors or fast MPIs were thrown at it. Or is this article referring to some other form of Amdahl's law?
Sign in to Reply