- an N64 emulator port to the PSP. We're not gonna pretend we know what he's talking about here, so here is what he said...
quote:
DynaRec Status
Just a quick update on the dynarec status, as I know a lot of people
are more interested in this than the grizly details of branch delay
instructions :)
Last weekend (13/14 May) I managed to assemble the fragment buffers
into native x86 code, and execute this dynamically. I spent some time
debating whether to target MIPS or Intel initially, but I decided that
it would be a lot easier for me to debug the code generation on the PC
than it would be to debug code gen on the PSP.
In the end I'm glad I started with the PC as it allowed me to fix a
number of hairy problems without going down the torturous path of
debugging self modifying code on the PSP with just a few printf()
statements to help track down any problems.
To start with on the x86 code generation, all I did was convert my
fragment simulator loop directly into assembly. So the generated code
for each instruction in the fragment looked something like this:
set current pc
set branch delay flag
get op code in ECX
call handler for specified op code (from R4300.cpp)
if ( exception set ) exit to exception handler
if ( branch instruction and branch taken ) exit to branch handler
So a fragment with 100 instructions in would have this block repeated 100 times (with different pc, delay flag constants etc).
This generated a lot of assembly (i.e. 200KB or so of N64 instructions
would generate over 4MB of x86 assembly, i.e. an expansion of around
2000%). At this point I wasn't interested in performance though - I
wanted to make sure that I was preserving the behaviour of the fragment
simulator as much as possible. The exception and branch handlers
mentioned in the pseudo code above warrant more detailed description,
but I'll leave that for another post.
At this point I spent a few hours debugging, but generally everything
was working pretty well. The emulator was currently running around the
same speed with 'dynarec' enabled as with it disabled (I use quotes
because there isn't really much 'dynamic' about this code yet).
I spent the rest of the weekend and the early part of last week trying
to optimise the generated assembly and see what I could get away with
removing. One of the first things to go was setting the program counter
before executing each instruction. The only instructions that need to
know this tend to be branching instructions (which are relative and
need to know the current PC to work out their target address) and other
instructions that can potentially fire exceptions. The vast majority of
instructions don't need to know the current PC though (e.g. arithmetic
ops, logical ops etc).
Next I had a look at reworking things so I only needed to explicitly
set the branch delay flag if a branch delay slot was actually active. I
made the precondition that the branch delay slot was always clear, and
explicitly set/cleared it when I knew the state needed to change.
Finally I removed exception handling from all the instructions I knew
to be safe. For instance, I know ANDI (and immediate) can never throw
an exception. As I only perform counter updates at the end of the
block, an exception can never be fired when executing this instruction.
After all these changes I had an instruction execution block which looked something like this:
if ( pc needed ) set current pc
if ( branch delay instruction )set branch delay flag
get op code in ECX
call handler for specified op code (from R4300.cpp)
if ( can throw exception and exception set ) exit to exception handler
if ( branch instruction and branch taken ) exit to branch handler
This meant that the vast majority of instructions looked as follows:
get op code in ECX
call handler for specified op code (from R4300.cpp)
So I had nice big fragments like this:
...
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
...
Essentially I removed the vast majority of the instruction fetch/decoding overhead from the emulation.
With this version of the dynarec, 200KB of N64 code was now generating
just 2MB of x86 assembly (i.e. an expansion ratio of around 1000%). The
PC version was running around 60% faster with dynarec enabled than with
it disabled, which is a pretty significant speedup (although this is
still very early in the process).
What's also important is that this is before I've done any real
optimisation of the generated code. For each instruction I'm still
calling the generic instruction handler which has the overhead of
figuring out which source registers to use, which register is the
destination etc. The *real* speedup comes from generating code to
handle op codes explicitly, as you remove all this decoding overhead
along with the overhead of jumping to another function. Once you've
removed most of the generic instruction handling you can start looking
at caching register values to minimise the amount of memory that's
being moved around.
With the PC version up and running fairly successfully, I've spent this
weekend getting the PSP code generation working. I don't want to go
into too many details (as I want to go into more depth in future
posts), but I know people are keen to hear some news about how this is
going.
I got the basic code generation working on Saturday morning (thankfully
I'd already resolved most of the tricky issues in developing the x86
version the previous weekend). I spent most of Saturday afternoon
fixing some really horrible instruction cache related bugs. I'm still
not 100% sure I've fixed them, but it seems very stable at the moment.
At the moment I'm at the same stage with the PSP version of the dynarec
that I was with the PC version last weekend - the code generation is
running fine (and executing on the PSP without crashing more
importantly :) but I've only just started looking at optimising things.
It's still too early to speculate on numbers for the performance
improvement it will give. Currently it's running around 10% faster with
dynarec enabled, but it's still very early days.
More soon.
Branch delay instructions
Well it's been a little longer than I intended since my last post. I've
been busy with work, and when I've had free time I've found it hard to
drag myself away from the compiler for long enough to post an update :)
In my last post on the subject, I left saying that I was trying to iron
out bugs in the fragment 'simulator' and was about to start work on
native code generation. I want to pick up from that point, but first I
wanted to talk about about branch delay instructions and how they
effect the implementation of emulators and the new DynaRec work I've
been doing.
The couple of bugs in the fragment simulator were indeed to do with
exceptions and interrupts triggering in unexpected places. The problem
occurred if an exception was triggered in the branch delay slot
following a jump instruction.
In case you're not familiar with the MIPS architecture, after branch
instruction has been executed (but before control arrives at the target
instruction), the CPU executes the instruction immediately following
the branch. So to give an example, this is a block of mips code that
calls the function foo( 6, 3 ):
li a0, 6 # load the first argumentjal foo # call fooli a1, 3 # load the
second argument# the result is returned in v0 here...foo:jr ra #
returnadd v0, a0, a1 # add the arguments, store result in v0
If you've never read MIPS assembly before this will probably look a little strange.
When we call foo with 'jal foo' (Jump and Link), we don't set the
second argument until after the jump to 'foo'! Notice also that we
calculate the return value for foo after we return with the Jump
Register (JR) instruction.
The reason this code works is because of the branch delay slots. When
the call to foo is executed ('jal foo') the CPU keeps going for one
more instruction and executes 'li a1, 3'. Control then jumps to the
start of 'foo', where the CPU immediately executes 'jr ra', jumping
back to where it just came from. Again, the CPU executes the following
instruction, calculating the sum of the arguments and returning the
result in v0.
Although this seems rather pointless and unnecessarily complicated, it
serves a very good purpose. With most modern CPUs the instruction
execution is pipelined, which means that the processor breaks down the
work into discrete chunks (fetch instruction, decode instruction,
execute, commit etc), and executes multiple instructions in parallel.
Wikipedia (as always) explains this in a lot more detail.
With many architectures the CPU has to throw away the contents of the
pipeline when a branch is executed, as the subsequent instructions may
refer to data (register or memory contents) that is invalid until after
the call has completed. With certain architectures (including MIPS),
the engineers decided that it was wasteful to throw away all this work
on every branch, and designed the processor to continue processing the
pipeline until the subsequent instruction had completed.
What this means is that when writing code for MIPS processors, you have
to be careful that your branch delay slot doesn't have any unintended
side effects. For maximum efficiency you also have to be careful to try
and do useful work in the branch delay slot (rather than just filling
it with a NOP for instance). Normally this isn't an issue as the
compiler generates all the code for you, but it's certainly an issue if
you're writing assembly. It's also been a very important consideration
when I've been writing the code generation for the new DynaRec (I'll
cover this in a later post).
So, how do branch delay instructions effect emulators? Although they
help pipelined CPUs to improve performance, they require emulators to
do bit of extra bookeeping and this slows things down slightly and adds
complexity. Every time daedalus interprets an instruction it has to
check if a branch was taken, and if so set a flag to indicate that a
branch delay instruction is due to be executed. When the branch delay
instruction is executed, the flag is cleared and the emulator sets the
program counter to the original target of the branch.
This is all fairly straight forward, but complications arise when
exceptions fire as a branch delay slot is due to be executed. In the
example above, what would happen if an interrupt fired immediately
before the branch delay instruction 'li a1, 3' was executed? Normally
once the interrupt has been handled, the operating system restores
control by jumping to the instruction that was due to be executed,
allowing the program to run from that point. If this happened in our
example above, a1 would be loaded with the value of 3, but the jump to
'foo' that was originally delayed will never take place. The code would
continue running without ever calling 'foo'!
In order to handle this situation, when an exception (or interrupt) is
triggered, the CPU sets a flag in the 'cause' register. The operating
system keeps track of this flag, along with the program counter where
the exception fired and various other bits of information. When it's
done handling the exception, the operating system uses this information
to allow the CPU to correctly restore control to the code that was
originally executing. In our example above, the CPU would see that the
branch delay flag is set in the cause register, and restore control to
the jal instruction immediately preceeding the instruction where the
interrupt occured.
It should be fairly clear that there's quite a lot that must be taken
care of by the emulator to make sure that the program executes as
intended. This is all fairly simple to keep track of when processing
instructions individually, but it becomes more complicated when you
start to dynamically recompile the code.
The bugs that I mentioned at the start of this post were caused because
I wasn't correctly setting the branch delay flag for instructions
causing exceptions in branch delay slots. When I build fragments for
the dynamic recompiler I avoid triggering certain types of interrupts
(e.g. vertical blank and timer interrupts) in the middle of the
fragment to reduce the overhead of having to add the code to handle
these situations. Unfortunately there are many other types of
exceptions that can occur in the middle of a fragment, such as page
faults, I/O completion interrupts etc. It turns out these are very
rare, but unfortunately not rare enough to save me from a full day of
debugging :(
This was a little more detailed than I was originally planning. Branch
delay instructions are quite a simple concept on the surface, but they
can cause all sorts of complexities when it comes to dynamic
recompilation. Fortunately I feel very confident that I've now fixed
all the branch delay related bugs in the dynamic recompiler, so
hopefully I shouldn't have to think too much about them again in the
near future.
I'll follow up this post with a quick update on the state of the new dynarec engine.