|1 Mar 99||
(c) 1998/9 Lightsoft Software. All rights reserved. Auth: Robert Probin and Stuart Ball
Introduction to AltiVec assembly language.
Doc Ref: LSA0042a
As a introductory guide it assumes a knowledge of PowerPC and PowerPC integer instructions to some basic level. A knowledge of PowerPC floating instructions is a definite advantage.
Lightsoft make no claim as to the accuracy of this information, and provide it on an as-is basis. Please check with the relevant official AltiVec documentation in all cases. As a guide we may distort the truth in order to convey information!
Also a warning - don't rely on the emulator as a sufficient testing ground for a real application. If you are going to release, try it on real AltiVec hardware.
Basic principles of operation
To the untrained eye, the PowerPC instructions can be a little cryptic, and the vector instructions only seem to complicate this.
So lets have some rules.
Nearly all AltiVec instructions have a "v" in them, most of them at the beginning. The only ones that don't have a "v" at the beginning are the vector load and store instructions which have the "v" in second location - "lv...." and third location"stv..." respectively.
The group which do NOT have "v" in them at all are the data stream (touch) instructions. These instructions allow you to let the processor know where to fetch the data from ahead of time. These minimize delays and stalls whilst data is brought into the processor. More about this later.
A simple AltiVec add instruction
This adds four words (32-bits) together simultaneously. The "target" is v0, and the "source" operands are v1 and v2. This is effectively v0=v1+v2 but remember this is FOUR instructions not one.
The word saturate means take the value to "maximum" or "minimum", but no more. So for a byte (signed) that's 127 maximum and -128 minimum. For 16-bit halfwords that's 32767 max and -32768 minimum, and for 32-bit instructions that's (2^31)-1 max to -(2^31) min.
that's why there is signed in the add instruction, something you may have missed. So when we are using saturate mode, we must specify signed or unsigned. Hence, there is an unsigned version of this instruction:-
This saturate mode is not normally supplied on a microprocessor. But when we are dealing with large chunks of information it becomes a real bottleneck checking for overflow.
The other mode is modulo. This effectively means rollover. These roll-over (or roll-under) as you would expect, with 255 plus 1 becoming 0. With this mode you can do both signed and unsigned instructions.
Lets look at a typical modulo add:-
Hang on! Why the unsigned, this instruction works with both signed and unsigned numbers? We expect the assembler mnemonic creators wanted to keep it fairly symmetrical looking. Note, however, there is no "signed" equivalent.
These instructions can also perform half-word and byte operations, so the full set looks like this...
vaddsbs v0,v1,v2 ; vector add signed integer byte saturate
There is an add instruction we have missed out.
This provides a way of getting the carry from several additions. The "amount carried" for each element after addition is stored in the target register (v0 in this case), in the relevant places.
Again this is available in byte, half and word flavors.
A Little Note About Notation
These can be any of the vector registers, v0,v1,v2, up to v31.
There are several AltiVec special purpose registers, but by far the most important is the one called VRSAVE, which is special purpose register (SPR 256).
The idea is fairly simple. When a context switch (such as a interrupt, or other preemptive task switch) occurs, the operating system must save all the registers currently being used. This is OK, but the amount of data being shipped out of the processor by the operating system can already be quite considerable, and with the vector registers this becomes even more. The entire AltiVec registers are 512 bytes of memory.
To minimize the overhead of the switch, the operating system needs only to save the registers currently being used, and this is where the VRSAVE register comes in.
Each bit represents whether a register is used or not used. Bit 0 of the 32-bit VRSAVE relates to v0, and bit 31 to v31. (Remember bits are defined from high to low in the PowerPC architecture, so bit 0 is the most significant bit!). A particular bit being zero means unused, and one means the particular vector register is being used.
As an assembler programmer this is maintained entirely by SOFTWARE (your program). If you do not mark the registers as in use, then the operating system will use them for other things, and ...bang... your program stops working. And it will turn out to be very difficult to find the problem. (If you are using the AltiVec unit from C, then the compiler automatically generates the necessary updates to VRSAVE).
You could of course define all registers as used, but that defeats the purpose of using the AltiVec unit, since you will just be slowing the processor down again.
The basic way to access the VRSAVE register to by the use of mfspr and mtspr instructions.
There are (will be!) several Fantasm macros which allow you define a particular register, or group of registers as used or unused. They also include a debug mode, which ensures that all registers are marked as used. There are some notes on these macros, included with the macros, which you should read before assuming anything.
Finally, where do I put the VRSAVE changes? At the start of my program, at the start (and end) of each routine, or at the start of each instruction that changes something. That very much depends upon how the volatility rules work inside your program in regards to your program. The idea, however, is to make it so simple that they can't go wrong...
Loading and Saving vector registers
The basic instructions are:-
Some notes about these basic instructions:
ALL loads and stores will be aligned!
The idea of the lvxl and stvxl "mark as least recently used" is to avoid data that you are only going to use once from hanging around in the cache and wasting valuable cache space that may be taken up by data that you will use more than once. Even in a fairly large cache, there aren't all that may sets of 16 bytes.
Also note, that all these instructions are INDEXED. This means that the address is fetched from (rA|0) + (rB), or in English the sum of rA and rB, unless r0 is selected for rA in which case it will be just rB.
A word about Alignment and Loads and Stores
All data must be aligned to its size, and if an unaligned address is supplied, the address will be aligned by ignoring the lowest bits, considering them 0 instead.
In the case of the LOAD ELEMENT and STORE ELEMENT instructions, If it's a byte, then its always aligned. If it's a half then the bottom bit is ignored (making it even). Words need to be 4 byte aligned.
With whole vectors, LOAD VECTOR and STORE VECTOR need 16 byte aligned addresses. Without them the bottom 4 bits will be considered as 0.
This "considered as 0" means shifted DOWN (towards lower address) in memory to an aligned area. As you can imagine, stores can make a mess if you don't consider them carefully.
This also applies to local variables and parameter variables, which we will take about more below.
Two More "Load" instructions
These instructions don't do any loading from conventional memory. They actually calculate a "shift permutation vector" for use with unaligned data.
Basically, the fact is that reading from memory is quite often a case of aligning the data into the vector processors registers. These instructions are used with the "vperm" instruction (see below) to format that data, based upon the lowest 4 bits of the "effective address" you used to read from (calculated as an index from rA and rB). This is called the shift.
This allows you to read in two sets of 16 bytes, and then easily extract the middle 16 bytes you are actually interested in.
The vperm instruction is an instruction that allow you to swap the bytes in a vector register based upon another vector register, which contains what order you require them in - the permutation. We talk more about it below.
An example of this is as follows
For a start the comparison instructions (both integer and floating point) are the only vector instruction that have a record bit, and can therefore set the condition register. Furthermore this setting of the record bit can only effect CR6.
The comparison instructions consist of two input vectors, and the output "true or false for each element" vector. True represented by all 1's in the bits of each element, and false by all 0's in each bit of each element.
The record bit gives a all elements combined result in CR6. For all elements true then bit CR6 is set, otherwise if all elements are false, then bit CR6 is set. Other bits CR6 and CR6 are not used and set to 0.
Notice: there is no vector instructions for other conditions, but with these, in addition to inverting result vectors, it is possible to synthesise any condition required (see PEM).
Other Arithmetic instructions.
It's time to go over several arithmetic instructions. Because we will just be giving a brief overview, we suggest you go back and re-read the section entitled "A simple AltiVec add instruction" which contains the basic rules for arithmetic instructions.
These are simpler than they appear. Three rules:-
1. Signed/Unsigned - defines the sign of the two source operands and the destination.
2. The destination is a size larger than the operands.
3. Even/Odd - Chooses which of the source elements are chosen as operands from the vector. Odd = element 1, 3, 5, etc. Even = element 0, 2, 4, etc. Remember elements, like bits, are numbered from MSB (most significant bit) to LSB (least significant bit). If you write the number down in traditional text format, that's from left to right.
The key bits are (a) they are multiply for half-words, (b) it does an add afterwards.
Basically these instructions form a way of multiplying two sets of 8 half-words, and adding a vector. The vmladduhm deals with the LOW part (16 bits) of the result (since two halfs multiplied together really need a word to store), and therefore can use either signed or unsigned data.
The HIGH instructions ignore the low part, and store the high part. The rounding version is used if you are not interested in the low part, and will round up by adding 0x00004000 (half of the least significant bit of the destination once reduced to 16 bits).
These instructions obviously use the saturate mode, but it's when ADDing that this comes into play. "vC" is effectively added to the destination once it has been reduced to 16 bits.
The next set are also multiply instructions, but this time its multiply them sum across then add. Take heed of the saturate and modulo (last character in the assembler mnemonic) however!
These all do fairly similar things.
Using this configuration they:-
The only other thing to watch is that the MIXED BYTE form assumes signed in "vA" and unsigned in "vB" and adds and stores as signed.
The sum instructions
And some Logical instructions
These are all similar - they perform logical operations on
vector data held in two source registers and place the result
in the destination register.
Packing, Unpacking, Merge, splat and permute
Vector Pack instructions. These truncate either
The instructions take the data from two vector registers and
place the result in the destination vector register.
vpkuwum, vpkuwus, vpkswus, vpksws - truncate the eight words
from two concatenated source operands producing a single result
of eight half words
A special form of pack, vpkpx, is provided that packs eight 32-bit (8/8/8/8) pixels from two concatenated source operands into a single result of eight 16-bit 1/5/5/5 aRGB pixels.
Vector Unpack Instructions. These are the opposite of Pack. They expand either byte or halfword sized data to either halfs or longs. Because the PowerPC instruction set never allows more than one destination, the instructions are split in to low order and high order operations.
vupkhsb,vupkhsh - Vector Unpack High Signed Integer. Each
signed integer element in the high order 8 bytes of vB is sign
extended to fill the MSBs in a signed integer and then is placed
vupklsb,vupklsh - Vector Unpack Low Signed Integer. Each signed
integer element in the low order 8 bytes of vB is sign extended
to fill the MSBs in a signed integer and then is placed into
A special purpose form of vector unpack is provided, the Vector Unpack Low Pixel (vupklpx) and the Vector Unpack High Pixel (vupkhpx) instructions for conversion of 1/5/5/5 aRGB pixels to four 32 bit pixels.
Vector Merge Instructions
Vector Merge High Integer
Each integer element in the high order 8 bytes of vA is placed into the low order integer element in vD. Each integer element in the high order 8 bytesof vB is placed into the high order integer element in vD.
Vector Merge Low Integer
Each integer element in the low order 8 bytes of vA is placed into the low order integer element in vD. Each integer element in the low order 8 bytes of vB is placed into the high order integer element in vD.
Vector Splat Instructions
These instructions allow you to load immediate data into all elements of a vector register. This is generally used where some constant value is needed in an arithmetic operation.
Vector Splat Integer
Vector Splat Immediate Signed Integer
Vector Permute Instruction - vperm
This is a very powerful instruction with applications ranging from data translation to alignment. This instruction allow any byte in any two source registers to be placed in any byte in the destination register. This instruction takes four vector registers as its operands:
vD is the destination register.
The bytes of vC specifies which bytes from vA and vB are copied and placed into vD from left to right. Thus if the most significant byte of vC is 0x10, then the most significant byte of vD is filled with the most significant byte of vB.
Vector Shift Left by Octet and Vector Shift Right by Octet
Vector Shift Left, Vector Shift Right
A pair of vslo and vsl or vsro and vsr instructions can be
used to shift the contents of a vector register left or right
by the number of bits specified in the shift count register (0-127):
Vector Rotate Left
All of the above shifts also have rotate versions, for example, Vector Rotate Left Half - vrlh.
There are also algebraic forms of vector shift right - vsrab is vector shift right algebraic byte. In this case bits are replicated in the most significant bit of the element.
A useful shift is Vector Shift Left Double by Octet Immediate.
Floating point instructions
The vector floating-point arithmetic instructions are split into these groups:
Vector floating-point arithmetic instructions
Floating Point arithmetic
Vector Add Floating Point - Add the four 32 bit floating point
elements in vA with the four 32 bit floating point elements in
vB. Round the results to the nearest single precision floating
point results and store in vD
Vector Subtract Floating Point - Subtract the four 32 bit
floating point elements in vB from the four 32 bit floating point
elements in vA. Round the results to the nearest single precision
floating point results and store in vD
Vector Maximum Floating Point - For each pair of elements
select the largest and store in vD
Vector Minimum Floating Point - For each pair of elements
select the smallest and store in vD
FP multiply-add instructions
The full product from the multiply stage is used in the add operation and then the result is rounded. This leads to greater accuracy.
To perform a simple multiply with AltiVec use zero as the add operand. Floating point division and square root operations require multiple instructions and are detailed in the PEM.
Division is accomplished by simply multiplying the dividend, (x/y = x* 1/y) and square-root by multiplying the original number, (sqr(x) = x * 1/sqr(x)).
Vector Multiply Add Floating Point - vmaddfp
Vector Negative Multiply Subtract Floating Point - vnmsubfp
Floating Point Rounding Instructions
All these rounding instructions round four floating point singles held in vB and store the results in vD
Vector Round to Floating Point Integer Nearest
Vector Round to Floating Point Integer toward zero
Vector Round to Floating Point Integer toward Positive Infinity
Floating Point Conversion Instructions
Integer to floating point
Vector Convert from Unsigned Fixed Point Word
Vector Convert from Signed Fixed Point Word
Floating point to integer
Vector Convert to Unsigned Fixed Point Word Saturate
Vector Convert to Signed Fixed Point Word Saturate
Floating Point Compares
All these instructions can accept the record, or dotted form. In this case cr6 is set as per the integer compare instructions.
Vector Compare Greater Than Floating-Point [Record]
Vector Compare Equal to Floating-Point [Record]
Vector Compare Greater Than or Equal to Floating-Point [Record]
Vector Compare Bounds Floating-Point [Record]
If Rc=1, CR6 is set to 1 when all four elements in vA are within the bounds specified by the corresponding element in vB
Floating Point Estimate Instructions
Vector Reciprocal Estimate Floating-Point
Vector Reciprocal Square Root Estimate Floating-Point
Vector Log2 Estimate Floating-Point
Vector 2 Raised to the Exponent Estimate Floating-Point
Data stream instructions and the Cache
The Data stream instructions allow you to let the processor know where to fetch vector data from. These minimise stalls whilst data is brought into the processor.
Why? Lets imagine we are processing a whole chunk of data. Probably that data will not be in the cache, and so it will have to fetched from main memory. This can be really slow, and in certain cases negate the advantage of having vector instructions. Hence, you can see the importance of getting the data into the cache.
For introduction purposes, the AltiVec Fact Sheet, and AltiVec White Paper give some ideas on the basic concepts.
For a reference, the AltiVec Technology PEM (programming environments manual) is fairly good.
Lightsoft's Assembler - Fantasm 5 - also supports AltiVec instructions.
All of these references and more are available on Lightsoft's AltiVec page.