Design Article

Building your own RISC simulator

Vijaya Sagar Vinnakota, Senior Software Engineer, Wipro Technologies Ltd., Bangalore, India

3/7/2002 4:44 PM EST

Building your own RISC simulator
The process of design, development and testing of a processor takes a long time, during which many models are made to fine-tune its functionality and performance. These models simulate the processor behavior in various levels of detail. For instance, typical field-programmable gate array (FPGA) models match their processor's functionality but not the timing characteristics. Yet, these models help the designers identify and correct most of the flaws.

The production of hardware models is usually discontinued after the processor is proven and accepted in the market. However, the software models, also more popularly known as simulators, continue to be used, enhanced and produced as long as the processor is in use. In spite of certain limitations such as being unable to exactly reproduce time-critical behavior such as interrupt latency and bus cycles, these simulators serve as close functional approximations and inexpensive alternatives to their processors, the reference hardware boards and associated environment.

It is fairly trivial to design a processor simulator as a simple transformation function/mapping between the processor's instruction set (Ip) and the instruction set (Ih)of the simulator's host machine. This mapping may simply be based on a lookup-table if Ip is a functional subset of Ih-that is, if there is a one-one mapping between Ip and Ih with allowance to difference in instruction formats. If the two instruction sets are significantly different from each other, a slightly involved mapping has to be employed. In this case, each instruction of Ip has to be implemented in terms of two or more instructions from Ih.

These mappings can be implemented by designing the simulator as an interpreter for the instruction stream of a program written for the target processor. The simulator can take as input, either the executable instructions of Ip or their assembler mnemonics. In either case, the interpretation is easier by using an intermediate high-level language (HL) that is supported by the host. The translation from HL to Ih is best left to the host's HL translator.

The simple instruction mapping approach fits most normal programming tasks. However, on a closer look, it becomes evident that there is more to simulating a processor than implementing its instruction set. A fine-grain behavioral simulation should involve modeling key functional blocks and macro blocks that make the processor.

Typical blocks that constitute a RISC processor include arithmetic/logic unit (ALU), instruction decoder, processor control logic, register files, instruction pipeline, barrel shifters, multipliers, write-buffers and internal buses. Depending on the target application or users, a simulator designer has to include models of these blocks into the simulator. For instance, if the simulator is to be used for detailed clock-cycle level profiling, the simulator must include a good model of the instruction pipeline and its clock.

Rather than look at a specific commercial architecture, one way to focus on the architectural, functional and simulation issues involved is to build a behavior model of a hypothetical RISC processor, in this case, called Crisp. In any design, the key elements that must be modeled are the clock, memory interface, execution unit, ALU and the degree of pipelining and parallelism among the components.

For a real processor, a clock signal provides the heartbeat. Each instruction takes a pre-designed number of clock cycles to complete. Such a clock is not an essential requirement for building a software model of the processor. Yet, instruction-level profiling and fine-grain performance analysis of programs will be difficult if such a model makes no provision for a clock. Also, as will be seen later, a model with a clock eases simulating the behavior of an instruction pipeline.

While the hardware design of a system clock is fairly complicated and involves high precision engineering for the oscillator and phase-locked loops (PLL) for fine-tuning, its software equivalent can be modeled very easily. A system-wide counter can act as the clock with its value being updated at appropriate stages of executing each instruction.

Code framework for simulating a RISC microprocessor. Program maps RISC instructions to a subset of the host computer.
Source: Wipro Technologies Ltd.

It is clear that this behavior is opposite to that observed on a real processor where the clock drives the instruction execution. However, letting the instruction execution phases drive the clock is a good enough approach for a software simulator.

It might be worthwhile to consider using a floating-point value for the clock counter so as to represent half/quarter cycles or any other intermediate points within a clock cycle for very fine-grain timing analysis. For instance, RD and WRsignals go high/low at set points in a cycle, and data/address buses contain valid data only during a specific portion of the cycle. In the case of our hypothetical CPU, Crisp receives its clock from an external source such as a PLL.

Memory is best modeled as an array of data words. A more sophisticated approach would be to model memory as an abstract data type with features such as separate program and data memories, write protection and storage hierarchy, including TLB multilevel cache, primary memory, and secondary memory.

Registers can be treated as an extension to the memory model. Register files can be supported by a two-dimensional array of data words, with one column per register.

Crisp has 15 general-purpose registers named r0 through r14. By convention, r13 is used as the stack pointer and r14 as the link register for procedure calls. r15, a special register, serves as the program counter (instruction pointer). These registers are 32-bit wide.

The execution unit can be modeled by as a mapping of the instruction set of the processor being modeled to that of the host processor. Or, as a simple translation of the semantics of a model instruction to that of a language construct interpretable on the host processor.

An example of model instruction:

operator operand_1 operand_2

A 'C' translation:

operator(operand_1, operand_2)

Though it seems unnecessary to introduce one more level of indirection between the model instruction and translation in the form of a function call, its utility becomes evident when it is realized that different types of operators might involve different kinds of processor subsystems.

add r0,

; involves only registers and ALU

add r0, [r1]

; involves registers, memory and ALU

mov r0, 0x10

; involves only registers (instruction

; register and r0)

mov [r0], 0x10

; involves registers and memory

ALU operations come next only to memory operations in number, in any typical program. The ALU can also be modeled on lines similar to those of the execution unit. The operators of the processor being modeled are mapped on to those of the host processor or to those of any language understood on the host processor, for example:

Model instruction:

add r0, r1

Execution Unit model:

_add(_reg_r0, _reg_r1)

ALU model:

return (_reg_r0 += _reg_r1);

Crisp does not have a multiplier but has a barrel-shifter to perform shifts of length 1-32 in a single cycle. Most of the Crisp instructions are in three-address code format with unspecified operands filled by an assembler with default values.

Most modern processors have a three- to six-stage instruction execution pipeline, which helps to maximize the utilization of different components of a processor, which function in parallel and independent of each other (sharing the same clock).

A software model need not simulate parallelism in the real-world time. It is necessary and sufficient if various components of the processor run in parallel with respect to the software clock that is available in the model.

Crisp employs a three-stage fetch-decode-execute pipeline. The pipeline is clocked at the same speed as the external clock input.

In reality, the Crisp assertion that all instructions complete in three cycles is impractical. Allowance has to be made for memory latencies, load/store delays, multiplier output delays and the like. A memory-interface module can abstract the details of the memory hierarchy, associated buffers and latencies. This calls for altering the pipeline behavior according to the processor specification.

Individual processors can be designed as separate processes and inter-process communication facilities offered by the host OS can be used to communicate data and control signals between the processors. This makes the simulator modular and easy to implement.

Interrupts and exceptions such as data aborts can be handled by using setjmp (for setting up exception handling code) and longjump (for handling an exception). User-defined signals can also be used for this purpose.

Speculative branching can be implemented by pre-fetching the target based on the probability of the branch being taken. The probability can be computed by maintaining a history of "branch taken/not-taken" per branch instruction in the program.

Some processors execute independent instructions out-of-order to improve throughput. The instruction stream can be converted into a dependency graph of code-blocks and then be executed out-of-order based on the dependencies. A thorough understanding of the target processor's instruction retiring policy is important to implement this feature. This feature can be abstracted off the simulator if appropriate allowance can be made to the resulting reduction in performance of the target processor being simulated.

Fine-grain profiling can be performed by accessing the system-wide clock counter via appropriate interfaces. For example, get_clock_ticks() and set_clock_ticks()) at required points of execution.

It is important however, to understand the requirements of the users before adding complex features to the simulator. In the absence of a demonstrated need (current/future) for modeling specific processor features or functional units, it is better to abstract them and keep the simulator simple and functional. The aim of a simulator is not to replace an FPGA prototype.

This article is based on excerpts taken from ESC class #341, Build your own RISC processor simulator.





Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)

Feedback Form