AltiVec: Introduction to Programming (LSA0042a)

1 Mar 99
Introduction to AltiVec assembly language.

(c) 1998/9 Lightsoft Software. All rights reserved.

Auth: Robert Probin and Stuart Ball
Doc Ref: LSA0042a

Introduction
This guide is intended to provide an insight into programming the new exciting technology called AltiVec.

As a introductory guide it assumes a knowledge of PowerPC and PowerPC integer instructions to some basic level. A knowledge of PowerPC floating instructions is a definite advantage.

Lightsoft make no claim as to the accuracy of this information, and provide it on an as-is basis. Please check with the relevant official AltiVec documentation in all cases. As a guide we may distort the truth in order to convey information!

Also a warning - don't rely on the emulator as a sufficient testing ground for a real application. If you are going to release, try it on real AltiVec hardware.

The registers
There are 32 vector registers, that are each 128 bits wide. These are labeled v0 to v31. These EACH break down into either four 32-bit words, eight 16-bit words, or 16 bytes.

Basic principles of operation
As you probably know, the idea is to work on several sets of data simultaneously with one instruction. This is why one vector register "sub-divides" into multiple words. Each of these subdivided pieces of data are worked in parallel.

The Instructions
Rather than concentrating on instruction groupings (which you can get from the AltiVec PEM - see references), lets consider the format of the instructions.

To the untrained eye, the PowerPC instructions can be a little cryptic, and the vector instructions only seem to complicate this.

So lets have some rules.

Nearly all AltiVec instructions have a "v" in them, most of them at the beginning. The only ones that don't have a "v" at the beginning are the vector load and store instructions which have the "v" in second location - "lv...." and third location"stv..." respectively.

The group which do NOT have "v" in them at all are the data stream (touch) instructions. These instructions allow you to let the processor know where to fetch the data from ahead of time. These minimize delays and stalls whilst data is brought into the processor. More about this later.

A simple AltiVec add instruction
So lets get right down to it and examine a typical AltiVec instruction:-

`vaddsws v0,v1,v2 ; vector add signed integer word saturate`

This adds four words (32-bits) together simultaneously. The "target" is v0, and the "source" operands are v1 and v2. This is effectively v0=v1+v2 but remember this is FOUR instructions not one.

The word saturate means take the value to "maximum" or "minimum", but no more. So for a byte (signed) that's 127 maximum and -128 minimum. For 16-bit halfwords that's 32767 max and -32768 minimum, and for 32-bit instructions that's (2^31)-1 max to -(2^31) min.

that's why there is signed in the add instruction, something you may have missed. So when we are using saturate mode, we must specify signed or unsigned. Hence, there is an unsigned version of this instruction:-

`vadduws v0,v1,v2 ; vector add unsigned integer word saturate.`

This saturate mode is not normally supplied on a microprocessor. But when we are dealing with large chunks of information it becomes a real bottleneck checking for overflow.

The other mode is modulo. This effectively means rollover. These roll-over (or roll-under) as you would expect, with 255 plus 1 becoming 0. With this mode you can do both signed and unsigned instructions.

Lets look at a typical modulo add:-

`vadduwm v0,v1,v2 ;vector add unsigned integer word modulo.`

Hang on! Why the unsigned, this instruction works with both signed and unsigned numbers? We expect the assembler mnemonic creators wanted to keep it fairly symmetrical looking. Note, however, there is no "signed" equivalent.

These instructions can also perform half-word and byte operations, so the full set looks like this...

vaddsbs v0,v1,v2 ; vector add signed integer byte saturate vaddshs v0,v1,v2 ; vector add signed integer half saturate vaddsws v0,v1,v2 ; vector add signed integer word saturate vaddubs v0,v1,v2 ; vector add unsigned integer byte saturate. vadduhs v0,v1,v2 ; vector add unsigned integer half saturate. vadduws v0,v1,v2 ; vector add unsigned integer word saturate. vaddubm v0,v1,v2 ; vector add unsigned integer byte modulo. vadduhm v0,v1,v2 ; vector add unsigned integer half modulo. vadduwm v0,v1,v2 ; vector add unsigned integer word modulo.

There is an add instruction we have missed out.

`vaddcuw v0,v1,v2 ; vector add carry unsigned integer word`

This provides a way of getting the carry from several additions. The "amount carried" for each element after addition is stored in the target register (v0 in this case), in the relevant places.

Again this is available in byte, half and word flavors.

A Little Note About Notation
Notice, normally we'll specify vD (instead of v0) for the destination, and vS for the source, with vA,vB,vC as other operands.

These can be any of the vector registers, v0,v1,v2, up to v31.

VRSAVE
IMPORTANT! Without this your program will probably crash!

There are several AltiVec special purpose registers, but by far the most important is the one called VRSAVE, which is special purpose register (SPR 256).

The idea is fairly simple. When a context switch (such as a interrupt, or other preemptive task switch) occurs, the operating system must save all the registers currently being used. This is OK, but the amount of data being shipped out of the processor by the operating system can already be quite considerable, and with the vector registers this becomes even more. The entire AltiVec registers are 512 bytes of memory.

To minimize the overhead of the switch, the operating system needs only to save the registers currently being used, and this is where the VRSAVE register comes in.

Each bit represents whether a register is used or not used. Bit 0 of the 32-bit VRSAVE relates to v0, and bit 31 to v31. (Remember bits are defined from high to low in the PowerPC architecture, so bit 0 is the most significant bit!). A particular bit being zero means unused, and one means the particular vector register is being used.

As an assembler programmer this is maintained entirely by SOFTWARE (your program). If you do not mark the registers as in use, then the operating system will use them for other things, and ...bang... your program stops working. And it will turn out to be very difficult to find the problem. (If you are using the AltiVec unit from C, then the compiler automatically generates the necessary updates to VRSAVE).

You could of course define all registers as used, but that defeats the purpose of using the AltiVec unit, since you will just be slowing the processor down again.

The basic way to access the VRSAVE register to by the use of mfspr and mtspr instructions.

There are (will be!) several Fantasm macros which allow you define a particular register, or group of registers as used or unused. They also include a debug mode, which ensures that all registers are marked as used. There are some notes on these macros, included with the macros, which you should read before assuming anything.

Finally, where do I put the VRSAVE changes? At the start of my program, at the start (and end) of each routine, or at the start of each instruction that changes something. That very much depends upon how the volatility rules work inside your program in regards to your program. The idea, however, is to make it so simple that they can't go wrong...

Loading and Saving vector registers
Lets talk about getting data in and out of the processor. There are some rules about alignment you should know, but we'll leave that for a minute.

The basic instructions are:-

lvebx vD,rA,rB ; load vector element byte indexed stvebx vS,rA,rB ; store vector element byte indexed lvehx vD,rA,rB ; load vector element byte indexed stvehx vS,rA,rB ; store vector element byte indexed lvewx vD,rA,rB ; load vector element byte indexed stvewx vS,rA,rB ; store vector element byte indexed lvx vD,rA,rB ; load vector indexed stvx vS,rB,rB ; store vector indexed lvxl vD,rA,rB ; load vector indexed, but mark as least recently used stvxl vS,rA,rB ; store vector indexed, but mark as least recently used

Some notes about these basic instructions:
"vS" is vector source, "vD" is vector destination.
The LOAD ELEMENT and the STORE ELEMENT load and store the low-order bits of the vector register.
The LOAD VECTOR loads the entire 128 bit (16 bytes/8 halfs/4 words) vector from memory. The STORE VECTOR does the opposite.
The LOAD VECTOR ELEMENT puts undefined data in the other part of the vector. This means that if you load a byte, it puts that byte into the low-order part of a vector, and you cannot rely on the rest of the vector register containing anything useful.

ALL loads and stores will be aligned!

The idea of the lvxl and stvxl "mark as least recently used" is to avoid data that you are only going to use once from hanging around in the cache and wasting valuable cache space that may be taken up by data that you will use more than once. Even in a fairly large cache, there aren't all that may sets of 16 bytes.

Also note, that all these instructions are INDEXED. This means that the address is fetched from (rA|0) + (rB), or in English the sum of rA and rB, unless r0 is selected for rA in which case it will be just rB.

A word about Alignment and Loads and Stores
The vector unit is very rigid about is ability to accept unaligned data. It simply doesn't!

All data must be aligned to its size, and if an unaligned address is supplied, the address will be aligned by ignoring the lowest bits, considering them 0 instead.

In the case of the LOAD ELEMENT and STORE ELEMENT instructions, If it's a byte, then its always aligned. If it's a half then the bottom bit is ignored (making it even). Words need to be 4 byte aligned.

With whole vectors, LOAD VECTOR and STORE VECTOR need 16 byte aligned addresses. Without them the bottom 4 bits will be considered as 0.

This "considered as 0" means shifted DOWN (towards lower address) in memory to an aligned area. As you can imagine, stores can make a mess if you don't consider them carefully.

This also applies to local variables and parameter variables, which we will take about more below.

Two More "Load" instructions

`lvsl vD,rA,rB ; load vector for shift left lvsr vD,rA,rB ; load vector for shift right`

These instructions don't do any loading from conventional memory. They actually calculate a "shift permutation vector" for use with unaligned data.

Basically, the fact is that reading from memory is quite often a case of aligning the data into the vector processors registers. These instructions are used with the "vperm" instruction (see below) to format that data, based upon the lowest 4 bits of the "effective address" you used to read from (calculated as an index from rA and rB). This is called the shift.

This allows you to read in two sets of 16 bytes, and then easily extract the middle 16 bytes you are actually interested in.

The vperm instruction is an instruction that allow you to swap the bytes in a vector register based upon another vector register, which contains what order you require them in - the permutation. We talk more about it below.

An example of this is as follows
`; r3 contains the address of some target data. It isn't 16 byte aligned.`
`li r4,16 lvx v0,0,r3 lvx v1,r4,r3 lvsl v2,0,r3 vperm v3,v0,v1,v2 ; the data is now in v3 ; which is the same as`

`li r4,16 lvx v0,r4,r3 lvx v1,0,r3 lvsr v2,0,r3 vperm v3,v0,v1,v2`

Comparisons
Lets talk a bit about integer comparisons.

For a start the comparison instructions (both integer and floating point) are the only vector instruction that have a record bit, and can therefore set the condition register. Furthermore this setting of the record bit can only effect CR6.

`vcmpgtub[.] vD,vA,vB ; compare greater than unsigned byte (vA>vB) vcmpgtuh[.] vD,vA,vB ; compare greater than unsigned half vcmpgtuw[.] vD,vA,vB ; compare greater than unsigned word`

`vcmpgtsb[.] vD,vA,vB ; compare greater than signed byte vcmpgtsh[.] vD,vA,vB ; compare greater than signed half vcmpgtsw[.] vD,vA,vB ; compare greater than signed word`

`vcmpequb[.] vD,vA,vB ; compare equal to unsigned byte vcmpequh[.] vD,vA,vB ; compare equal to than unsigned half vcmpequw[.] vD,vA,vB ; compare equal to than unsigned word`

The comparison instructions consist of two input vectors, and the output "true or false for each element" vector. True represented by all 1's in the bits of each element, and false by all 0's in each bit of each element.

The record bit gives a all elements combined result in CR6. For all elements true then bit CR6[0] is set, otherwise if all elements are false, then bit CR6[2] is set. Other bits CR6[1] and CR6[3] are not used and set to 0.

Notice: there is no vector instructions for other conditions, but with these, in addition to inverting result vectors, it is possible to synthesise any condition required (see PEM).

Other Arithmetic instructions.

It's time to go over several arithmetic instructions. Because we will just be giving a brief overview, we suggest you go back and re-read the section entitled "A simple AltiVec add instruction" which contains the basic rules for arithmetic instructions.

Subtract

`vsubuwm vD,vA,vB ; vector subtract unsigned word modulo (both for signed and unsigned) vsubuws vD,vA,vB ; vector subtract unsigned word saturate vsubsws vD,vA,vB ; vector subtract signed word saturate vsubcuw vD,vA,vB ; vector subtract carry-out unsigned word`These subtract instructions are all available in byte and half sizes as well, and work identically to the add instructions (except they subtract rather than adding :-)

Multiplies

`vmuloub vD,vA,vB ; vector multiply odd unsigned byte modulo vmulouh vD,vA,vB ; vector multiply odd unsigned half modulo vmulosb vD,vA,vB ; vector multiply odd signed byte modulo vmulosh vD,vA,vB ; vector multiply odd signed half modulo vmuleub vD,vA,vB ; vector multiply even unsigned byte modulo vmuleuh vD,vA,vB ; vector multiply even unsigned half modulo vmulesb vD,vA,vB ; vector multiply even signed byte modulo vmulesh vD,vA,vB ; vector multiply even signed half modulo`

These are simpler than they appear. Three rules:-

1. Signed/Unsigned - defines the sign of the two source operands and the destination.

2. The destination is a size larger than the operands.
byte operands = half destination.
half operands = word destination.

3. Even/Odd - Chooses which of the source elements are chosen as operands from the vector. Odd = element 1, 3, 5, etc. Even = element 0, 2, 4, etc. Remember elements, like bits, are numbered from MSB (most significant bit) to LSB (least significant bit). If you write the number down in traditional text format, that's from left to right.

`vmhaddshs vD,vA,vB,vC ; vector multiply-high and add signed half-word saturate vmhraddshs vD,vA,vB,vC ; vector multiply-high round and add signed half-word saturate vmladduhm vD,vA,vB,vC ; vector multiply-low and add unsigned half-word saturate`

The key bits are (a) they are multiply for half-words, (b) it does an add afterwards.

Basically these instructions form a way of multiplying two sets of 8 half-words, and adding a vector. The vmladduhm deals with the LOW part (16 bits) of the result (since two halfs multiplied together really need a word to store), and therefore can use either signed or unsigned data.

The HIGH instructions ignore the low part, and store the high part. The rounding version is used if you are not interested in the low part, and will round up by adding 0x00004000 (half of the least significant bit of the destination once reduced to 16 bits).

These instructions obviously use the saturate mode, but it's when ADDing that this comes into play. "vC" is effectively added to the destination once it has been reduced to 16 bits.

The next set are also multiply instructions, but this time its multiply them sum across then add. Take heed of the saturate and modulo (last character in the assembler mnemonic) however!

`vmsumubm vD,vA,vB,vC ; vector multiply sum unsigned byte modulo vmsumuhm vD,vA,vB,vC ; vector multiply sum unsigned half modulo vmsumshs vD,vA,vB,vC ; vector multiply sum signed half saturate vmsumuhs vD,vA,vB,vC ; vector multiply unsigned half saturate vmsubmbm vD,vA,vB,vC ; vector multiply mixed byte modulo vmsubshm vD,vA,vB,vC ; vector multiply signed half modulo`

These all do fairly similar things.
(a) They group the vector into four sets of 2 (if half) or 4 (if byte) elements
(b) They deal with _each_of_the_four_sets_the_same, and...
(c) They all end up with four sets of words in the destination vector

Using this configuration they:-
(d) They multiply the elements (half/byte) between "vA" and "vB" and then...
(e) SUM all element results TOGETHER (as a word)
(f) finally add the respective word to the same word in "vC"
(g) and store it in "vD"

The only other thing to watch is that the MIXED BYTE form assumes signed in "vA" and unsigned in "vB" and adds and stores as signed.

The sum instructions
NOTE: word 0 is bits 0-31, word 1 is bits 32-63, word 2 is bits 64-95, word 3 is bits 96-127.
half 0 is bits 0-15, half 1 is bits 16-31, etc.
byte 0 is bits 0-7, byte 1 is bits 8-15, etc.

`vsumsws vD,vA,vB ; vector sum across signed word saturate`
This sums all of the words in "vA" and word 3 (bits 96-127) of "vB" and place into word 3 of "vD", zeroing the other words in "vD".

`vsum2sws vD,vA,vB ; vector sum across partial (1/2) signed word saturate`
Performs the following:-
vD[word 1] = vA[word 0] + vA[word 1] + vB[word 1]
vD[word 3] = vA[word 2] + vA[word 3] + vB[word 3]

`vsum4ubs vD,vA,vB ; vector sum across partial (1/4) unsigned byte saturate vsum4sbs vD,vA,vB ; vector sum across partial (1/4) signed byte saturate`These perform the following (talking care of the relevant sign):-
vD[word 0] = vA[byte 0] + vA[byte 1] + vA[byte 2] + vA[byte 3] +vB[word 0]
vD[word 1] = vA[byte 4] + vA[byte 5] + vA[byte 6] + vA[byte 7] +vB[word 1]
vD[word 2] = vA[byte 8] + vA[byte 9] + vA[byte 10] + vA[byte 11] +vB[word 2]
vD[word 3] = vA[byte 12] + vA[byte 13] + vA[byte 14] + vA[byte 15] +vB[word 3]

`vsum4shs vD,vA,vB ; vector sum across partial (1/4) signed half saturate`
vD[word 0] = vA[half 0] + vA[half 1] +vB[word 0]
vD[word 1] = vA[half 2] + vA[half 3] +vB[word 1]
vD[word 2] = vA[half 4] + vA[half 5] +vB[word 2]
vD[word 3] = vA[half 6] + vA[half 7] +vB[word 3]

`vavgub vD,vA,vB ; vector average unsigned byte vavguh vD,vA,vB ; vector average unsigned half vavguw vD,vA,vB ; vector average unsigned word vavgsb vD,vA,vB ; vector average signed byte vavgsh vD,vA,vB ; vector average signed half vavgsw vD,vA,vB ; vector average signed word`These instructions add vector A to vector B then add on 1, for each element. They then store the high-order result (in effect dividing the result by 2), so that an average value between the two is achieved.

`vmaxub vD,vA,vB ; vector maximum unsigned byte vmaxuh vD,vA,vB ; vector maximum unsigned half vmaxuw vD,vA,vB ; vector maximum unsigned word vmaxsb vD,vA,vB ; vector maximum signed byte vmaxsh vD,vA,vB ; vector maximum signed half vmaxsw vD,vA,vB ; vector maximum signed word`
For each element place the maximum from either "vA" or "vB" into "vD".

`vminub vD,vA,vB ; vector minimum unsigned byte vminuh vD,vA,vB ; vector minimum unsigned half vminuw vD,vA,vB ; vector minimum unsigned word vminsb vD,vA,vB ; vector minimum signed byte vminsh vD,vA,vB ; vector minimum signed half vminsw vD,vA,vB ; vector minimum signed word`
For each element place the minimum from "vA" and "vB" into "vD".

And some Logical instructions

These are all similar - they perform logical operations on vector data held in two source registers and place the result in the destination register.
`vand vD,vA,vB ;logical AND vor vD,vA,vB ;logical OR vxor vD,vA,vB ;logical exclusive OR vandc vD,vA,vB ;logical AND vA with the complement of vB vnor vD,vA,vB ;logical NOR`

Packing, Unpacking, Merge, splat and permute

Vector Pack instructions. These truncate either
a). half words to bytes
or
b). words to halfwords.

The instructions take the data from two vector registers and place the result in the destination vector register.
vpkuhum, vpkuhus, vpkshus, vpkshss - truncate the sixteen half words from two concatenated source operands producing a single result of sixteen bytes.
E.G. `vpkuhum vD,vA,vB`

vpkuwum, vpkuwus, vpkswus, vpksws - truncate the eight words from two concatenated source operands producing a single result of eight half words
E.G. `vpkuwum vD,vA,vB`

A special form of pack, vpkpx, is provided that packs eight 32-bit (8/8/8/8) pixels from two concatenated source operands into a single result of eight 16-bit 1/5/5/5 aRGB pixels.

Vector Unpack Instructions. These are the opposite of Pack. They expand either byte or halfword sized data to either halfs or longs. Because the PowerPC instruction set never allows more than one destination, the instructions are split in to low order and high order operations.

vupkhsb,vupkhsh - Vector Unpack High Signed Integer. Each signed integer element in the high order 8 bytes of vB is sign extended to fill the MSBs in a signed integer and then is placed into vD.
E.G. `vupkhsb vD,vB`

vupklsb,vupklsh - Vector Unpack Low Signed Integer. Each signed integer element in the low order 8 bytes of vB is sign extended to fill the MSBs in a signed integer and then is placed into vD.
E.G. `vupklsh vD,vB`

A special purpose form of vector unpack is provided, the Vector Unpack Low Pixel (vupklpx) and the Vector Unpack High Pixel (vupkhpx) instructions for conversion of 1/5/5/5 aRGB pixels to four 32 bit pixels.

Vector Merge Instructions

Vector Merge High Integer

`vmrghb vD,vA,vB vmrghh vD,vA,vB vmrghw vD,vA,vB`

Each integer element in the high order 8 bytes of vA is placed into the low order integer element in vD. Each integer element in the high order 8 bytesof vB is placed into the high order integer element in vD.

Vector Merge Low Integer
`vmrglb vD,vA,vB vmrglh vD,vA,vB vmrglw vD,vA,vB`

Each integer element in the low order 8 bytes of vA is placed into the low order integer element in vD. Each integer element in the low order 8 bytes of vB is placed into the high order integer element in vD.

Vector Splat Instructions

These instructions allow you to load immediate data into all elements of a vector register. This is generally used where some constant value is needed in an arithmetic operation.

Vector Splat Integer
`vspltb vD,vB,UIMM vsplth vD,vB,UIMM vspltw vD,vB,UIMM`
Replicate the contents of element UIMM (unsigned immediate value) in vB and place into each element in vD.

Vector Splat Immediate Signed Integer
`vspltisb vD,SIMM vspltish vD,SIMM vspltisw vD,SIMM`
Sign-extend the value of the SMM (signed immediate value) field to the length of the element and replicate that value and place into each element in vD.

Vector Permute Instruction - vperm

This is a very powerful instruction with applications ranging from data translation to alignment. This instruction allow any byte in any two source registers to be placed in any byte in the destination register. This instruction takes four vector registers as its operands:

`vperm vD,vA,vB,vC`

vD is the destination register.
vA and vB are the source registers and can be viewed as concatenated together, with vA on the left and vB on the right. The most significant byte of vA is byte number zero. The least significant byte of vB is byte number 1F.

The bytes of vC specifies which bytes from vA and vB are copied and placed into vD from left to right. Thus if the most significant byte of vC is 0x10, then the most significant byte of vD is filled with the most significant byte of vB.

Shifting, Rotating

Vector Shift Left by Octet and Vector Shift Right by Octet
(Why not just call it shift by byte?)

`vslo, vsro`
These instructions shift a vector register left or right by 0 to 15 bytes as specified in the 4 LSB's of another vector register:
`vslo vD,vA,vB`
vA is shifted left by the number of bytes specified in vB and the result stored in vD

Vector Shift Left, Vector Shift Right

`vsl,vsr`
These instructions shift a vector register left or right by up to 7 bits.
`vsr vD,vA,vB`
The shift count is specified in vB. Every byte in vB must have the same shift count, else the result is undefined.

A pair of vslo and vsl or vsro and vsr instructions can be used to shift the contents of a vector register left or right by the number of bits specified in the shift count register (0-127):
`vslo vZ,vX,vY vsl vZ,vZ,vY`
The PEM gives other examples.

Vector Rotate Left

`vrlb,vrlh,vrlw`
These rotate each element of vA left by the number of bits specified by the low order bits of each element in vB.

`vrlb vD,vA,vB`(Rotate right can be achieved with a rotate left)

`Vector Shift Left Byte, Vector Shift Right Byte, Vector Shift Left Half, Vector Shift Right Half, Vector Shift Left Word, Vector Shift Right Word`
These instructions shift elements of a given size as integers. The source data is vA, the shift count is in vB.

`vslb vD,vA,vB`
Each byte of vA is shifted left by the number of bits specified in vB.

All of the above shifts also have rotate versions, for example, Vector Rotate Left Half - vrlh.

There are also algebraic forms of vector shift right - vsrab is vector shift right algebraic byte. In this case bits are replicated in the most significant bit of the element.

A useful shift is Vector Shift Left Double by Octet Immediate.

`vsldoi vD,vA,vB,SH`
Shift vB left by the 3 least significant bits of SH then OR with vA and place the result in vD. The PEM shows various coding operations to allow vsldoi to perform various shifts and rotates

Floating point instructions

The vector floating-point arithmetic instructions are split into these groups:

Vector floating-point arithmetic instructions
Vector floating-point multiply/add instructions
Vector floating-point rounding and conversion instructions
Vector floating-point compare instruction
Vector floating-point estimate instructions

Floating Point arithmetic

Vector Add Floating Point - Add the four 32 bit floating point elements in vA with the four 32 bit floating point elements in vB. Round the results to the nearest single precision floating point results and store in vD
`vaddfp vD,vA,vB`

Vector Subtract Floating Point - Subtract the four 32 bit floating point elements in vB from the four 32 bit floating point elements in vA. Round the results to the nearest single precision floating point results and store in vD
`vsubfp vD,vA,vB`

Vector Maximum Floating Point - For each pair of elements select the largest and store in vD
`vmaxfp vD,vA,vB`

Vector Minimum Floating Point - For each pair of elements select the smallest and store in vD
`vminfp vD,vA,vB`

FP multiply-add instructions

The full product from the multiply stage is used in the add operation and then the result is rounded. This leads to greater accuracy.

To perform a simple multiply with AltiVec use zero as the add operand. Floating point division and square root operations require multiple instructions and are detailed in the PEM.

Division is accomplished by simply multiplying the dividend, (x/y = x* 1/y) and square-root by multiplying the original number, (sqr(x) = x * 1/sqr(x)).

Vector Multiply Add Floating Point - vmaddfp
Multiply the four word floating point elements of vA with the four word floating point elements of vC. Add the four word floating point elements of vB to the intermediate result and store the products in vD
`vmaddfp vD,vA,vC,vB`

Vector Negative Multiply Subtract Floating Point - vnmsubfp
Multiply the four word floating point elements of vA with the four word floating point elements of vC. Subtract the four word floating point elements of vB from the intermediate result and invert the sign. Store the results in vD
`vnmsubfp vD,vA,vC,vB`

Floating Point Rounding Instructions

All these rounding instructions round four floating point singles held in vB and store the results in vD

Vector Round to Floating Point Integer Nearest
`vrfin vD,vA`

Vector Round to Floating Point Integer toward zero
`vrfiz vD,vA`

Vector Round to Floating Point Integer toward Positive Infinity
`vrfip vD,vA`

Floating Point Conversion Instructions

Integer to floating point

Vector Convert from Unsigned Fixed Point Word
`vcfux vD,vB,UIMM`
Convert each of the four unsigned integer words in vB to the nearest single precision value. Divide the result by 2 to the power of UIMM and place the results in vD.

Vector Convert from Signed Fixed Point Word
`vcfsx vD,vB,UIMM`
Convert each of the four signed integer words in vB to the nearest single precision value. Divide the result by 2 to the power of UIMM and place the results in vD.

Floating point to integer

Vector Convert to Unsigned Fixed Point Word Saturate
`vctsxs vD,vB,UIMM`
Multiply each of the four single precision word elements of vB by 2 to the power of UIMM. Now convert the products to unsigned fixed point integers using the round towards zero mode with saturate. Place the results in vD.

Vector Convert to Signed Fixed Point Word Saturate
`vctuxs vD,vB,UIMM`
Multiply each of the four single precision word elements of vB by 2 to the power of UIMM. Now convert the products to signed fixed point integers using the round towards zero mode with saturate. Place the results in vD.

Floating Point Compares

All these instructions can accept the record, or dotted form. In this case cr6 is set as per the integer compare instructions.

Vector Compare Greater Than Floating-Point [Record]
`vcmpgtfp[.] vD,vA,vB`
Compare each of the 4 single-precision word elements in vA to the corresponding four single-precision word elements in vB. For each element, if vA > vB then set the corresponding element in vD to all 1's otherwise clear the element in vD to all 0's If the record bit (Rc = 1) is set in the vector compare instruction, then
vD ==1, (all elements true) then CR6[0] is set
vD ==0, (all elements false) then CR6[2] is set

Vector Compare Equal to Floating-Point [Record]
`vcmpeqfp[.] vD,vA,vB`
Compare each of the 4 single-precision word elements in vA to the corresponding 4 single-precision word elements in vB. For each element, if vA = vB then set the corresponding element in vD to all 1's otherwise clear the element in vD to all 0's If the record bit (Rc = 1) is set in the vector compare instruction then
vD ==1, (all elements true) then CR6[0] is set
vD ==0, (all elements false) then CR6[2] is set

Vector Compare Greater Than or Equal to Floating-Point [Record]
`vcmpgeqfp[.] vD,vA,vB`
Compare each of the 4 single-precision word elements in vA to the corresponding 4 single-precision word elements in vB. For each element, if vA >= vB then set the corresponding element in vD to all 1's otherwise clear the element in vD to all 0's
If the record bit (Rc = 1) is set in the vector compare instruction then
vD ==1, (all elements true) then CR6[0] is set
vD ==0, (all elements false) then CR6[2] is set

Vector Compare Bounds Floating-Point [Record]
`vcmpbfp[.] vD,vA,vB`
Compare each of the 4 single-precision word elements in vA to the corresponding single-precision word elements in vB. A 2-bit value isformed that indicates whether the element in vA is within the bounds specified by the element in vB, as follows.
Bit 0 of the two-bit value is cleared to 0 if the element in vA is <= to the element in vB, and is set otherwise.
Bit 1 of the two-bit value is cleared to 0 if the element in vA is >= to the negation of the element in vB, and is set otherwise.
The two-bit value is placed into the high-order two bits of the corresponding word element of vD and the remaining bits of the element are cleared to 0.

If Rc=1, CR6[2] is set to 1 when all four elements in vA are within the bounds specified by the corresponding element in vB

Floating Point Estimate Instructions

Vector Reciprocal Estimate Floating-Point
`vrefp vD,vB`
Place estimates of the reciprocal of each of the four word floating-point source elements in vB in the corresponding four word elements in vD. Can be used to perform a divide in conjunction with multiply.

Vector Reciprocal Square Root Estimate Floating-Point
`vrsqrtefp vD,vB`
Place estimates of the reciprocal square-root of each of the four word source elements in vB in the corresponding four word elements in vD. Can be used to perform square root in conjunction with multiply.

Vector Log2 Estimate Floating-Point
`vlogefp vD,vB`
Place estimates of the base 2 logarithm of each of the four word source elements in vB in the corresponding four word elements in vD.

Vector 2 Raised to the Exponent Estimate Floating-Point
`vexptefp vD,vB`
Place estimates of 2 raised to the power of each of the four word source elements in vB in the corresponding four word elements in vD.

Misc Instructions

`mtvscr vB`
Place the contents of the lower 32 bits of vB into the Vector Status and Control register

`mfvscr vB`
Place the contents of the Vector Status and Control register into the lower 32 bits of vB

`mfvrsave rB`
Place the contents of VRSAVE in rB

`mtvrsave rB`
Place the contents of rB in VRSAVE

Data stream instructions and the Cache

The Data stream instructions allow you to let the processor know where to fetch vector data from. These minimise stalls whilst data is brought into the processor.

Why? Lets imagine we are processing a whole chunk of data. Probably that data will not be in the cache, and so it will have to fetched from main memory. This can be really slow, and in certain cases negate the advantage of having vector instructions. Hence, you can see the importance of getting the data into the cache.

Where to get more information

There are several documents that are available from Motorola (www.motorola.com) and Apple (www.apple.com) that give more information.

For introduction purposes, the AltiVec Fact Sheet, and AltiVec White Paper give some ideas on the basic concepts.

For a reference, the AltiVec Technology PEM (programming environments manual) is fairly good.

The AltiVec emulator is currently available from Apple, as well as MrC, the PowerPC compiler, and the PowerMac debugger.

Lightsoft's Assembler - Fantasm 5 - also supports AltiVec instructions.

All of these references and more are available on Lightsoft's AltiVec page.