Update on Daedalus Nintendo 64 PSP Emulator Work-In-Progres

Source: MaxConsole

StrmnNrmn has given us a mega tech update on the progress of Daedalus 64 for the PSP - an N64 emulator port to the PSP. We're not gonna pretend we know what he's talking about here, so here is what he said...



quote:


DynaRec Status

Just a quick update on the dynarec status, as I know a lot of people are more interested in this than the grizly details of branch delay instructions :)

Last weekend (13/14 May) I managed to assemble the fragment buffers into native x86 code, and execute this dynamically. I spent some time debating whether to target MIPS or Intel initially, but I decided that it would be a lot easier for me to debug the code generation on the PC than it would be to debug code gen on the PSP.

In the end I'm glad I started with the PC as it allowed me to fix a number of hairy problems without going down the torturous path of debugging self modifying code on the PSP with just a few printf() statements to help track down any problems.

To start with on the x86 code generation, all I did was convert my fragment simulator loop directly into assembly. So the generated code for each instruction in the fragment looked something like this:

set current pc
set branch delay flag
get op code in ECX
call handler for specified op code (from R4300.cpp)
if ( exception set ) exit to exception handler
if ( branch instruction and branch taken ) exit to branch handler


So a fragment with 100 instructions in would have this block repeated 100 times (with different pc, delay flag constants etc).

This generated a lot of assembly (i.e. 200KB or so of N64 instructions would generate over 4MB of x86 assembly, i.e. an expansion of around 2000%). At this point I wasn't interested in performance though - I wanted to make sure that I was preserving the behaviour of the fragment simulator as much as possible. The exception and branch handlers mentioned in the pseudo code above warrant more detailed description, but I'll leave that for another post.

At this point I spent a few hours debugging, but generally everything was working pretty well. The emulator was currently running around the same speed with 'dynarec' enabled as with it disabled (I use quotes because there isn't really much 'dynamic' about this code yet).

I spent the rest of the weekend and the early part of last week trying to optimise the generated assembly and see what I could get away with removing. One of the first things to go was setting the program counter before executing each instruction. The only instructions that need to know this tend to be branching instructions (which are relative and need to know the current PC to work out their target address) and other instructions that can potentially fire exceptions. The vast majority of instructions don't need to know the current PC though (e.g. arithmetic ops, logical ops etc).

Next I had a look at reworking things so I only needed to explicitly set the branch delay flag if a branch delay slot was actually active. I made the precondition that the branch delay slot was always clear, and explicitly set/cleared it when I knew the state needed to change.

Finally I removed exception handling from all the instructions I knew to be safe. For instance, I know ANDI (and immediate) can never throw an exception. As I only perform counter updates at the end of the block, an exception can never be fired when executing this instruction.

After all these changes I had an instruction execution block which looked something like this:

if ( pc needed ) set current pc
if ( branch delay instruction )set branch delay flag
get op code in ECX
call handler for specified op code (from R4300.cpp)
if ( can throw exception and exception set ) exit to exception handler
if ( branch instruction and branch taken ) exit to branch handler

This meant that the vast majority of instructions looked as follows:

get op code in ECX
call handler for specified op code (from R4300.cpp)


So I had nice big fragments like this:

...
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
get op code in ECX
call handler for specified op code (from R4300.cpp)
...

Essentially I removed the vast majority of the instruction fetch/decoding overhead from the emulation.

With this version of the dynarec, 200KB of N64 code was now generating just 2MB of x86 assembly (i.e. an expansion ratio of around 1000%). The PC version was running around 60% faster with dynarec enabled than with it disabled, which is a pretty significant speedup (although this is still very early in the process).

What's also important is that this is before I've done any real optimisation of the generated code. For each instruction I'm still calling the generic instruction handler which has the overhead of figuring out which source registers to use, which register is the destination etc. The *real* speedup comes from generating code to handle op codes explicitly, as you remove all this decoding overhead along with the overhead of jumping to another function. Once you've removed most of the generic instruction handling you can start looking at caching register values to minimise the amount of memory that's being moved around.

With the PC version up and running fairly successfully, I've spent this weekend getting the PSP code generation working. I don't want to go into too many details (as I want to go into more depth in future posts), but I know people are keen to hear some news about how this is going.

I got the basic code generation working on Saturday morning (thankfully I'd already resolved most of the tricky issues in developing the x86 version the previous weekend). I spent most of Saturday afternoon fixing some really horrible instruction cache related bugs. I'm still not 100% sure I've fixed them, but it seems very stable at the moment. At the moment I'm at the same stage with the PSP version of the dynarec that I was with the PC version last weekend - the code generation is running fine (and executing on the PSP without crashing more importantly :) but I've only just started looking at optimising things. It's still too early to speculate on numbers for the performance improvement it will give. Currently it's running around 10% faster with dynarec enabled, but it's still very early days.

More soon.

Branch delay instructions

Well it's been a little longer than I intended since my last post. I've been busy with work, and when I've had free time I've found it hard to drag myself away from the compiler for long enough to post an update :)

In my last post on the subject, I left saying that I was trying to iron out bugs in the fragment 'simulator' and was about to start work on native code generation. I want to pick up from that point, but first I wanted to talk about about branch delay instructions and how they effect the implementation of emulators and the new DynaRec work I've been doing.

The couple of bugs in the fragment simulator were indeed to do with exceptions and interrupts triggering in unexpected places. The problem occurred if an exception was triggered in the branch delay slot following a jump instruction.

In case you're not familiar with the MIPS architecture, after branch instruction has been executed (but before control arrives at the target instruction), the CPU executes the instruction immediately following the branch. So to give an example, this is a block of mips code that calls the function foo( 6, 3 ):


li a0, 6 # load the first argumentjal foo # call fooli a1, 3 # load the second argument# the result is returned in v0 here...foo:jr ra # returnadd v0, a0, a1 # add the arguments, store result in v0



If you've never read MIPS assembly before this will probably look a little strange.
When we call foo with 'jal foo' (Jump and Link), we don't set the second argument until after the jump to 'foo'! Notice also that we calculate the return value for foo after we return with the Jump Register (JR) instruction.

The reason this code works is because of the branch delay slots. When the call to foo is executed ('jal foo') the CPU keeps going for one more instruction and executes 'li a1, 3'. Control then jumps to the start of 'foo', where the CPU immediately executes 'jr ra', jumping back to where it just came from. Again, the CPU executes the following instruction, calculating the sum of the arguments and returning the result in v0.

Although this seems rather pointless and unnecessarily complicated, it serves a very good purpose. With most modern CPUs the instruction execution is pipelined, which means that the processor breaks down the work into discrete chunks (fetch instruction, decode instruction, execute, commit etc), and executes multiple instructions in parallel. Wikipedia (as always) explains this in a lot more detail.

With many architectures the CPU has to throw away the contents of the pipeline when a branch is executed, as the subsequent instructions may refer to data (register or memory contents) that is invalid until after the call has completed. With certain architectures (including MIPS), the engineers decided that it was wasteful to throw away all this work on every branch, and designed the processor to continue processing the pipeline until the subsequent instruction had completed.

What this means is that when writing code for MIPS processors, you have to be careful that your branch delay slot doesn't have any unintended side effects. For maximum efficiency you also have to be careful to try and do useful work in the branch delay slot (rather than just filling it with a NOP for instance). Normally this isn't an issue as the compiler generates all the code for you, but it's certainly an issue if you're writing assembly. It's also been a very important consideration when I've been writing the code generation for the new DynaRec (I'll cover this in a later post).

So, how do branch delay instructions effect emulators? Although they help pipelined CPUs to improve performance, they require emulators to do bit of extra bookeeping and this slows things down slightly and adds complexity. Every time daedalus interprets an instruction it has to check if a branch was taken, and if so set a flag to indicate that a branch delay instruction is due to be executed. When the branch delay instruction is executed, the flag is cleared and the emulator sets the program counter to the original target of the branch.

This is all fairly straight forward, but complications arise when exceptions fire as a branch delay slot is due to be executed. In the example above, what would happen if an interrupt fired immediately before the branch delay instruction 'li a1, 3' was executed? Normally once the interrupt has been handled, the operating system restores control by jumping to the instruction that was due to be executed, allowing the program to run from that point. If this happened in our example above, a1 would be loaded with the value of 3, but the jump to 'foo' that was originally delayed will never take place. The code would continue running without ever calling 'foo'!

In order to handle this situation, when an exception (or interrupt) is triggered, the CPU sets a flag in the 'cause' register. The operating system keeps track of this flag, along with the program counter where the exception fired and various other bits of information. When it's done handling the exception, the operating system uses this information to allow the CPU to correctly restore control to the code that was originally executing. In our example above, the CPU would see that the branch delay flag is set in the cause register, and restore control to the jal instruction immediately preceeding the instruction where the interrupt occured.

It should be fairly clear that there's quite a lot that must be taken care of by the emulator to make sure that the program executes as intended. This is all fairly simple to keep track of when processing instructions individually, but it becomes more complicated when you start to dynamically recompile the code.

The bugs that I mentioned at the start of this post were caused because I wasn't correctly setting the branch delay flag for instructions causing exceptions in branch delay slots. When I build fragments for the dynamic recompiler I avoid triggering certain types of interrupts (e.g. vertical blank and timer interrupts) in the middle of the fragment to reduce the overhead of having to add the code to handle these situations. Unfortunately there are many other types of exceptions that can occur in the middle of a fragment, such as page faults, I/O completion interrupts etc. It turns out these are very rare, but unfortunately not rare enough to save me from a full day of debugging :(

This was a little more detailed than I was originally planning. Branch delay instructions are quite a simple concept on the surface, but they can cause all sorts of complexities when it comes to dynamic recompilation. Fortunately I feel very confident that I've now fixed all the branch delay related bugs in the dynamic recompiler, so hopefully I shouldn't have to think too much about them again in the near future.

I'll follow up this post with a quick update on the state of the new dynarec engine.



Official Website: http://strmnnrmn.blogspot.com/
posted on Sunday, May 21, 2006 10:42 PM by Auri

Comments