PowerPC assembly language
beginners guide.
Basic Optimisations
Chapter 8
This chapter moves away from application specifics and examines the processor and assembly language programming techniques in more detail. This section discusses a generic PowerPC processor.
The PowerPC architecture is described as "super scalar". This means in simple terms that the processor can execute more than one instruction at once. A very basic PowerPC processor will have at least one integer unit which handles the integer operations; addi for example, one floating point unit which deals with the floating point operations and one branch unit which, can you guess?, deals with branch instructions.
Up until now we have considered instructions coming from memory one at a time as needed for execution; this can be considered the Instruction Stream. However, this isn't how it happens in practice. The processor has an "instruction queue", abbreviated to IQ. Generally the instruction queue hold about 8 instructions waiting to be "despatched" for execution. As instructions are dispatched from the queue all the remaining instructions move towards the front of the queue and new instructions are fetched from memory.
Instructions can be dispatched from the first four locations in the queue (termed IQ0 to IQ3). On every clock cycle the processor tries to dispatch as many instructions as possible from the queue, so if the branch unit and integer unit are free but the floating point unit is not it would be possible for the processor to dispatch one integer and one branch instruction from IQ0 to IQ3 if those types of instructions are present.
This knowledge provides the assembly language programmer with a valuable optimisation technique; that of interleaving branch, floating and integer instructions to ensure an even mix of instruction types in the instruction queue (and indeed is why Anvil categorises instruction by group for syntax colouring purposes).
Generally instructions execute in one cycle but there are exceptions. Conditional branches can take anywhere from five down to zero cycles. If you remember back to the beginning we said that condition code flags were not implicitly set when data is moved or the result of a calculation become availalable. There's a very good reason for this, and it's not because the designers were lazy, or they just wanted to save silicon.
If you know anything about traditional CISC (Complex Instruction Set Computers) such as the 68000 family you may know that there is one set of condition flags; the flags typically indicate zero, carry, overflow etc. These are the flags that are tested when we execute a conditional branch. For example a subtract might be followed by a conditional branch if the result is zero (the zero flag is tested). The PowerPC processor being superscalar may actually try to execute a conditional branch before the flags it needs are ready (remember the dispatching discussion at the start of this chapter). The general rule of thumb is that a conditional branch needs to placed about five instructions down from the instruction that affects the flags. This is possible because the programmer has to explicitly use instructions that change the flags. Using this technique the PowerPC processor can effectively process conditional branches in zero cycles! To further enhance performance instead of having just one set of flags we have eight; they are refered to as the cr flags. This means conditional branches can be tagged with the specific set of flags they need to test. In Fantasm we identify the required set of flags with the identifier crx where x is a number between 0 and 7. If you don't specify a cr then Fantasm defaults to 0 for integer instructions and 1 for floating point instructions (Oh yes, the FPU supports its own conditional branches! fcmp). Typical usage would be:
bne cr3,fred
whereas
bne fred
really means bne cr0,fred
Having all these cr fields means we can do some funky stuff:
cmpwi cr0,r3,0 cmpwi cr1,r3,10 3 other instructions ble cr0,fail bgt cr1,fail
By doing the compares well before we need to determine if the data is in-bounds we effectively get the two conditional branches for free (or looking it the other way, we get the 3 other instructions for free). The condition flags can be set manually with the move to condition register instruction which can manually set flags in the cr register (naturally there's also a move from condition register instruction allowing you to read them). This can be used to great advanatage to set cr flags well before they are needed without doing a compare - for example the MacOS BlockMove call is optimised in this way to calculate if the data is aligned and if not how many halfs and bytes it needs to move before and after it can move words or double words (see FPU below). Very efficient.
So, by not implicitly setting flags the PowerPC designers gave us a very efficient method of dealing with conditional branches which always make up a large proportion of any program. Another optimisation the processor can make is speculative processing. This is looking down the instruction stream for conditional branches and trying to decide whether they will be taken. If so then the processor can start getting instructions from the new address (that which would be executed if the branch is taken). If it turns out that the processor was correct and the conditional branch is taken then the instructions are already to run. If however the processor is incorrect then all the new instructions have to be discarded and instructions fetched from the address following the conditional branch instruction. This may not seem such a good idea but later PowerPC processors have very efficient mechanisms for recovering if the processor predicted incorrectly. The programmer can help this speculative processing by giving a hint to the processor that the conditional branch will normally be taken. A typical use for this is in conditional branches at the end of a counting loop (wherever possible one should use the ctr register for this, but sometimes its not possible). Normally the branch will be taken back to the start of the loop so this would be a good place to provide a hint. Generally in Fantasm the "-" character is used if the branch is backwards and the "+" character if the conditional branch is forward (but see the Fantasm reference manual!). So:
loop: subic. r3,1 ;sets cr0 flags some instructions bne+ loop ;test the cr0 flags - specifically the zero flag
Is the correct way to optimise this loop (if you can't use the count register). Remember that the further up the instruction stream you can put the instruction that sets the cr flags, the faster it will run (with a ceiling of about five instructions).
Useful tip: Anvil's Help menu will contain language help for the current language in use. In assembly language it's very useful if you've forgotten the mnemonic for a certain instruction. I for example can never remember some of the more useful rotates; so a quick search for what I want in the Help window generally provides the answer.
FPU
The PowerPC processor also benefits from having a Floating Point Unit (FPU) bestowed upon it as part of the architecture specification. This means that all PowerPC processors MUST have a floating point unit and as such programmers can rely on it being there. You may think of the FPU as being just a number cruncher where in fact it's also a great mover of data. Most PowerPC Macs have 64 bit wide data busses. The integer unit caters for integer sized data and as such its registers are 32 bits wide. The FPU on the other hand caters for floating point sized data and as such its registers are 64 bits wide (it can work with both 32 bit, or single sized data, or 64 bit, or double sized data). Note the corrolation between the bus width of 64 bits and the FPU register width of 64 bits.
Thus the largest amount of data we can move with the integer unit is 32 bits, or four bytes, but with the FPU we can move 64 bits or 8 bytes. The FPU provides instructions for both reading and writing data from memory and luckily provides bith 32 bit and 64 bit derivatives af nearly all of its instructions. In Fantasm the size of the data to be worked on is typically specified by appending either "s" or "d" onto the end of the instruction. For example:
stfs f0,(r5)
Stores the single sized data (32 bits) in f0 in the address pointed to by r5
stfd f0,(r5)
Stores the double sized data (64 bits) in f0 in the address pointed to by r5
It's a similar situation for loading data into an FPU register.
lfs f0,(r5)
Loads the single sized data (32 bits) at the address pointed to by r5 into floating point register 0.
lfd f0,(r5)
Loads the double sized data (64 bits) at the address pointed to by r5 into floating point register 0. Examine this loop:
line_loop: lfd f0,(r6) addi r6,r6,8 stfdu f0,8(r5) bdnz line_loop
These four instructions copy data from the address at r6 to the address at r5. I've introduced another operation here. We've seen the bdnz instruction before. It decrements the count register and branches if it isn't equal to zero. The stfdu is a variant of store floating double in that it automatically adds 8 to r5 and stores the result back in r5 before each execution. This means that in one instruction we store the data AND update the pointer by the size of the data; in this case it's a double so that's 8 bytes. You may realise that for this to work we need an extra instruction before executing the loop; the one that decrements r5 by 8 before we start! Thus the full code looks like:
subi r5,r5,8 line_loop: lfd f0,(r6) addi r6,r6,8 stfdu f0,8(r5) bdnz line_loop
You can find out more about the FPU in the Lightsoft Article The Magical World of PowerPC Floating Point
Alignment
Takes on a whole new meaning when you start moving 8 bytes at
a time! If either your source or destination address are not octal
(8 byte) aligned when you start moving data 8 bytes at a time
the code will run slower than cold treacle. The PowerPC processor
rarely has hardware support for mis-aligned addressing and so
the OS takes over and has to move the data as best it can (this
means slowly). Octal alignment means that the address you are
reading from or writing to produces a remainder of zero when divided
by 8. Everybody should deliberatley try misaligning their data
now and again just to get a feel for it; it's easily spottable
once you seen it once!
Copyright ©Lightsoft Software (Tools) 2000.
Reproduction in whole or part prohibited without permission.