Frame Angle on a 386 Angle on a pentium
1 0 0
2 3 1
3 6 2
4 9 3
5 - 4
6 - 5
7 - 6
8 - 7
9 - 8
10 - 9Why? A pentium is fast enough to draw 10 fps when a 386 can make only four. Now your object rotates at the same speed with all computers, just more smoothly on a pentium than a 386.
Another technique: create an interrupt which updates all variables 70 times per second and the routines draw as fast as they can without going over that 70fps. Not bad, this one, either.
(text debugged by Chem and translated by Ica)
Optimizing assembly code became quite complicated when the pentium came out. This piece of text was written to clarify pentium tricks and to tell how we can produce code that can even be twice as fast as one could think to be possible.
With pentium came the concepts pairing and fast math coprocessor known in the PC world. Now I'll tell how pairing works, and how you can take the advantage of it in your own programs.
There are actually two parallel processors inside a pentium processor. Only one of these is complete. This is called the U pipe, the other (less complete) one is the V pipe. The V pipe can only perform jumps and some basic commands like mov, add, and lea. Pairing is these processors working simultaneously at the same clock frequency. Pairing works only in special circumstances.
Example. We suppose that the situation is neutral at the beginning of each series of commands. That means that the first command is performed in the U pipe (this can be achieved by placing an sti before the series).
Now the theory should be quite clear so it's time to try optimizing a real piece of code. In the following example we'll optimize a slow line drawing inner loop.
; Clocks Pipe Pairing Comment
@@inner:
lea edi,[edx+edx*4] ; 1 U UV edi=edx*5
mov bl,ah ; 0 V UV ebx=ax/256
shl edi,6 ; 1 U NU edi*=64
add edi,ebx ; 1 U UV edi+=ebx
add edi,0a0000h ; 0 U UV edi+=screenstart
mov [edi],b 10d ; 1? U UV [edi]=10
add edx,[yp] ; 0? V UV edx+=[yp]
add eax,[xp] ; 1? U UV eax+=[xp]
dec ecx ; 0 U UV ecx-=1
jnz short @@inner ; 1 V NV jump if not zero
; 6?Normally we use pmode in flat mode which means that we must add the starting point of the screen memory into edi. The loop is best used as short because the distance to @@inner is less than 128; we save two bytes from the compiled version of the command. Now we can get rid of 'add edi,ebx':n by changing the calculation to 'mov [edi+ebx+0a0000h],b 10d'. We arrange also the instructions in a way that they pair whenever possible.
@@inner:
lea edi,[edx+edx*4] ; 1 U UV edi=edx*5
mov bl,ah ; 0 V UV ebx=ax/256
shl edi,6 ; 1 U NU edi*=64
add edx,[yp] ; 0? V UV edx+=[yp]
mov [edi+ebx+0a0000h],b 10d ; 1? U UV []=10
add eax,[xp] ; 0? V UV eax+=[xp]
dec ecx ; 1 U UV ecx-=1
jnz short @@inner ; 0 V NV jump
; 4?The difference in speed is remarkable. The negative side is that the code gets more messy but it's a cheap price for speed. Well, the clocks of lines where registers are added by [variables] are very questionable and probably something very else than zeros. In any case, the loop is now pairing efficiently. I just find it useless to multiply y every time by 320. So:
@@inner:
mov bl,ah ; 1 U UV ebx=ax/256
add eax,[xp] ; 0? V UV eax+=[xp]
mov [edi+ebx],b 10d ; 1? U UV []=10
add edi,[yp] ; 0? V UV edi+=[yp]
dec cx ; 1 U UV cx-=1
jnz @@inner ; 0 V NV jump if not zero
; 3?Fast? This is just an illusion because there are many other things which have their effects on speed, a few examples of which are cache misses and interrupts which stop pairing and often cause also a cache miss. A cache miss is a situation in which the needed data is not found in the processor internal memory (level 1 cache) but it must be brought from the external cache (level 2 cache). This usually means couple of clock cycles more and the stoppage of pairing. If the data is not found even in the level 2 cache, it must be brought from the actual memory resulting in a loss of over 10 clocks. These fetch times are linearly dependent on the memory which the computer is using. In certain cases, the differences can be big. For example, it could take 5 clocks of 60ns multiaccess edo and 15 clocks of normal memory. With the level 2 cache, the difference between pipeline burst and some other can be two clocks. So at this stage we notice that everything is really not dependent on the code :(
Luckily we have a trick left: we can't affect the caches straight but we can speed up memory handling by arranging the data. For example, variables that are used in the same loop should be arranged so that they would be consecutively in the memory and in blocks of 32 bytes. This is because pentium moves data between cache and normal memory in 32-byte blocks, and a block like this is never split. So it's worth trying to align the code and the data to 32 bytes. Specially loops and variables that are used in them should be in as few blocks as possible. It's always worth using your imagination when coding critical loops to get the best possible use of level 1 cache.
7.3.2.1 An abstract approach
Abstractly, LK works as follows: the values of the picture or colors to be quantisized are imagined as spheres in a cubic-shaped color space (XYZ = RGB). The bigger amount of a color there is, the bigger is the sphere. Into the color space, we add palette spheres which can move freely contrary to the color spheres which don't move at all. The number of palette spheres is the same as the size of the desired palette (256 colors -> 256 spheres).
We perform the following process: every color sphere pulls the closest palette sphere. The bigger a color sphere is, the bigger is its pulling force. Now about all of the palette spheres are pulled by one or more color spheres. (The new coordinates of a palette sphere are calculated from the average of the color values of the color spheres, in other words the sum of colors divided by the number of color spheres). The palette spheres which are not pulled by any color sphere telewarp near some color sphere. Now the palette spheres move in the space like this until their movement is slowed under a defined level (trust me, it really slows down). Now the new palette can be read from the coordinates of the palette spheres.
7.3.2.2 A more technical approach
Let's use the following example: we have a truecolor picture which should be changed to 256 colors. First we create a histogram out of the picture. The histogram uses 15bit (or any other 3*x bit) numbers.
Now we create another table in which is the list of the colors the picture originally has. So we go through the histogram, and in every point where there is some color (the value being greater than zero), we put the color amount and value into this new table. The table can for example be like this:
unsigned long palette[256][3]; // 3: R,G & BWe need also three other variables:
unsigned long colorSum[256][3]; // 256 colors, 3 = R,G & B,(the following one could be attached to colorSum, too)
unsigned long colorCount[256],and then a variable in which we save change in the palette:
unsigned long variance;Now we go through the following steps:
Disclaimer: The information here is based on material I've read (can't remember any document names) and some sources I've poked at. As such I can't give any pointers to more information nor do I have the math background for this algorithm.
7.3.3.1 The definition
Since the palette we're going to use is not going to include every color in the colorspace, we only use a "subspace".
Let's say we cut the subspace in two.
The example image may give you slightly wrong idea of the algorithm. There will be a gap between the new subspaces, and they will most probably shrink in other dimensions as well.
7.3.3.2 The algorithm
Basically we do the following:
1. Analyze subspaces
You could cut the subspace in half (by leaving as many colors on one side as the other, or by leaving as many color intensities on one side as the other), but we'll cut it on the median of the color values. You calculated the sum of all color values of the component you sorted the subspace by. Now you'll need to find the median, or, the position where the sum of values reaches the middle point of the whole sum.
Example: We have the following values: 1, 5, 7, 9, 10, 11, 17, 21. The sum of these is 81. Half of the sum is 40.5. Now to find the median, we'll start calculating a new sum, until it reaches 40.5: 1+5+7+9+10+11=43, so the first half will have 5 values and the second, 3.
There are some problems, workarounds, and speed issues.
Since many pictures tend to have black in them, and if we are using VGA, we usually like to have the color 0 to be black, so the borders won't annoy us too much. I solved this by separating all black colors from the input values into separate list, and forced the color 0 to be black. Otherwise there will not be a completely black color!
If you wish to remap graphics (sprites, textures, whatever), performing a nearest-color search for every color you put into reduction is a bit waste. You could, instead, drag the original color index with the color itself (so that it will be sorted with the original color etc.) and then when averaging the subspaces you'd directly know the color indices that should be of that color.
Remember not to do any things repeatedly without a reason. (After analyzing, store the values and reanalyze only if there is need; also, sort only if it really is needed).
The biggest power-consuming part (about 60%+ in my implementation) in the algorithm is the sort. Killing off duplicates might help, but I didn't bother trying that.
My implementation used a singly-linked list for the colors since I wanted
it to be dynamic. There were couple problems with this approach: allocating
and freeing memory for 250000+ colors took more time than the color reduction
itself, which was easily solved by allocating enough memory for one whole
colormap at a time. Another problem was sorting, which I solved by making
a list of pointers and re-linking the list after the sort. Making a static
list of colors should be just as easy, and quite probably faster.
I used radix sorting. On a p150 it reduced 640*400 colors into 255
in about 4.5 seconds. Coding it took about one day (most of it wasted debugging).