News & Analysis
Multithreading spin offered for VLIW processors
Kariyatil Krishnadas
1/25/2002 5:31 PM EST
BANGALORE, India Researchers are pursuing a twist on very-long-instruction-word (VLIW) processors that exploits standard architectures, compilers and hardware to provide multithreading support. The architecture, dubbed Weld, was unveiled in Hyderbad, India, at the Eighth International Conference on High Performance Computing.
North Carolina State University researchers said their VLIW-based technique inserts instructions in a way that allows compilers to create multiple threads from a single program. By inserting "bork" (branch and fork operations) instructions, the compiler creates multiple threads as acyclic regions of the control graph. At run-time, threads are "welded" to fill in the holes by special hardware called the operations welder.
Program counters and fetch units are duplicated to support multithreading, according to a conference paper by Emre Ozer, Thomas M. Conte and Saurabh Sharma of the electrical and computer engineering department of North Carolina State University at Raleigh. "The experimental results show that the Weld architecture attains a maximum 27 percent speedup [over] a single-threaded VLIW architecture," they wrote.
Variable memory latencies are a major problem for VLIW processors. Multithreading tolerates long-latency instructions and such run-time events as cache misses, branch mispredictions and exceptions. It has also been used to improve single-program performance by spawning program threads, such as loop iterations or acyclic code.
Weld aims at two goals: better utilization of processor resources during unpredictable run-time events and dynamic filling of issue slots that cannot be filled at compile time.
"Unpredictable events that cannot be detected at compile time, such as cache misses, may stall a VLIW processor for numerous cycles. The Weld architecture tolerates those wasted cycles by issuing operations from other active threads when the processor stalls for an unpredictable run-time event with the main thread," the researchers reported.
"VLIW processors are also limited by the fact that they use a discrete scheduling window," they wrote. "VLIW compilers partition the program into several scheduling regions, and each scheduling region is scheduled by applying different instruction-level parallelism (ILP) optimization techniques. . . . The VLIW compiler cannot fill all the schedule slots in every MultiOp [a group of instructions that can be potentially executed in parallel], because the compiler cannot migrate operations from different scheduling regions.
"A hardware mechanism called the operation welder is introduced in this work to achieve our second goal. It merges operations from different threads in the issue buffer during each cycle to potentially eliminate empty issue slots. A scheduling region is a potential thread in our model. Executing operations from different scheduling regions and filling the issue slots from these regions simultaneously at run-time can increase resource utilization and performance."
Architectural details
The Weld architectural model presupposes that threads are generated from a single program by the compiler. It also presupposes that a single main thread and potentially several speculative threads will exist during run-time but that the main thread will have topmost priority. Each thread has its own program counter, fetch unit and register file. All threads share the branch predictor and instruction and data caches.
Weld consists of a basic five-state VLIW pipeline. The fetch stage fetches MultiOps from the instruction cache, and the decode/weld stage decodes and welds them together. The operations welder is integrated into the decode stage.
The operand-read stage reads operands into the buffer for each thread and sends them to the functional units. The execute stage executes operations, and the write-back stage writes the results into the register file and data cache.
A new instruction and some extensions to the instruction set architecture (ISA) were needed to support multithreading in a VLIW architecture. The branch and fork operation was introduced to spawn threads and create hardware contexts for those threads. The target address is the address of the speculative thread.
A separability bit and a synchronization bit were added to each MultiOp in the ISA. The former lets the MultiOp distinguish between separable and inseparable MultiOps. The latter is set in the first MultiOp of each thread at compile-time to help synchronize threads at run-time.
Parallel efforts
The researchers compared Weld with related approaches. Single-program speculative multithreading (SPSM), for example, speculatively spawns multiple paths in one program and simultaneously executes those paths or threads on a superscalar core. "In SPSM there is a main thread that can spawn many speculative threads, whereas speculative threads can also spawn speculative threads in Weld," the paper states. SPSM is for dynamic (superscalar) architectures.
Dynamic multithreading processors (DMPs) "provide simultaneous multithreading on a superscalar core, with threads created by the hardware from a single program. Each thread has its own program counter, rename tables, trace buffer and load and store queues. All threads share the same file register, Icache, Dcache and branch predictor." DMT is similarly proposed for dynamically scheduled processors.
Multiscalar processors have many processing units, each with their own register file, Icache and functional units. In Weld, threads share the functional units and caches. Like SPSM and DMT, multiscalar is proposed for dynamically scheduled processors.
Threaded multiple-path execution (TME) executes multiple alternate paths on a simultaneous-multithreading superscalar processor. It uses free hardware contexts to assign paths of conditional branches. Speculative loads are allowed. In Weld, threads are created at compile time.
XIMD, a VLIW-like architecture, "has many functional units and a big global register file like the VLIW's, with every functional unit having an instruction sequencer to fetch instructions. A program is partitioned into many threads by the compiler or a partitioning tool. The XIMD compiler takes each thread and schedules it separately. Those separately scheduled threads are merged statically to reduce static code density or to optimize for execution time." Weld merges threads at run-time by leveraging dynamic events.
The researchers are also working on a compiler model for Weld that eliminates the speculative memory operations buffer to avoid "squashes" from load/store conflicts at the thread level.
Other conference papers covered topics ranging from enhancing branch prediction accuracy to perfecting performance analysis of mobile packet data services.
Additional coverage of the Eighth International Conference on High Performance Computing is available at EETimes.com.



