Tidbits: Branch Delay Slot Fun with SPARC
Recently, while musing over some MIPS idiosyncrasies, I stumbled over this thread which was about a possible bug and a simple work around for a particular branch delay slot problem in a MIPS CPU, where there would be a page fault between the branch and delay slot execution. This article is not about this bug, but about the follow-up discussion involving SPARC.
Recap: Delay Slots
One of the ideas that sounded super cool in the 80ies and turned out to be just plain bad was delay slots - the idea was that you could manufacture your CPU super cheap and efficient if you ignored dependencies between instructions and let the compiler handle them.
The MIPS architecture made this into an art form, which is the reason why up to 30% of all instructions in a MIPS I binary are NOPs - when loading a word from memory, you needed to execute another instruction before you could use the value (... often a NOP), and likewise, the next instruction after a branch was always executed, whether the branch was taken or not (even more often a NOP). The credo (= "unfounded hope") was that compilers will be able to do something useful, and nobody will ever manufacture a successor CPU with different delay slot requirements.
Here is an example using the hypothetical
bz (branch if zero) instruction:
10: bz 40 20: foo 30: bar 40: baz
If the branch is taken, instruction 30 will be skipped, otherwise it will be executed. Instruction 20 will always be executed because it is in the branch delay slot.
The issue with the MIPS CPU in the mail thread I linked to was that the workaround wouldn't work if there were two branches in succession, which is fortunately neither legal nor working well, so it doesn't actually happen in reality and can be ignored.
SPARC is better
Somebody mentioned there was an architecture where this was actually well defined and even used in practice, and somebody else quickly stepped in saying it was probably SPARC. Here is the example:
10: b 90 20: b 30 30: foo ... 90: bar
The actual execution order is 10, 20, 90, 30.
I am too lazy to read up on this, so this is pure guessing on my part, but what happens here is likely this: the branch in 10 is (always) taken, redirecting the CPU to 90. The instruction executed next is the delay slot, which branches to 30. And the next instruction executed is the delay slot of this second branch: 90. Then everything reverts to normal and execution continues at 30.
The mind-boggling (for me) effect is that this way, you can execute single instructions just about anywhere, out of the blue, without having to create an <insn>/<return> sequence somewhere, and apparently this was used for exactly that purpose, e.g. in jump tables (but a few other applications come to mind as well, such as limited single-stepping and so on).
And this concludes this tidbit - wow, SPARC, wow, and I thought I have seen everything, or so :)