3DICA Programming Tutorial

3DICA v2.21

- The Ultimate 3D Coding Tutorial (C) Ica /Hubris 1996,1997,1998

- Over 150k of pure sh...er, 3d coding power!

7. Some Other Nice Things

7.1 Frame skipping

Frame   Angle on a 386   Angle on a pentium

  1           0                  0

  2           3                  1

  3           6                  2

  4           9                  3

  5           -                  4

  6           -                  5

  7           -                  6

  8           -                  7

  9           -                  8

  10          -                  9

fstart=read_time

loop

< calculate >

< draw >

< flip >

< do whatever you want >

fend=read_time

angle=angle+speed*(fend-fstart)

fstart=fend

until 1=2

accurate

Another technique: create an interrupt which updates all variables 70 times per second and the routines draw as fast as they can without going over that 70fps. Not bad, this one, either.

7.2 Optimizing in Assembly

Author : Henri 'RoDeX' Tuhkanen

Email : henri.tuhkanen@mbnet.fi

Groups : CyberVision, Embrace, the Damned, Hard Spoiled, tAAt, Regeneration, Magic Visions and a couple of others I can't remember ;)

Achievements : 7th prize at asm'96 4k intro compo

A brief portrait : I'm a 19-year-old 3D/gfx coder and code about always. I can give away a lot of coding-related material but mainly algos; I don't like rippers. I'm not afraid of being wrong and I want to learn everything about coding and computers :)

(text debugged by Chem and translated by Ica)

Optimizing assembly code became quite complicated when the pentium came out. This piece of text was written to clarify pentium tricks and to tell how we can produce code that can even be twice as fast as one could think to be possible.

With pentium came the concepts pairing and fast math coprocessor known in the PC world. Now I'll tell how pairing works, and how you can take the advantage of it in your own programs.

There are actually two parallel processors inside a pentium processor. Only one of these is complete. This is called the U pipe, the other (less complete) one is the V pipe. The V pipe can only perform jumps and some basic commands like mov, add, and lea. Pairing is these processors working simultaneously at the same clock frequency. Pairing works only in special circumstances.

Example. We suppose that the situation is neutral at the beginning of each series of commands. That means that the first command is performed in the U pipe (this can be achieved by placing an sti before the series).

mov eax,ebx ; U

add ecx,edx ; V

mov eax,ebx ; U

add eax,2 ; U (doesn't pair -- needs the new result of eax)

mov eax,ebx ; 1 clock; U

add eax,ecx ; 1 clock; U

; total 2 clocks

add ebx,eax ; 1 clock; U

mov edx,eax ; 0 clocks; V

; total 1 clock

mov ebx,edx ; 1

shl eax,1 ; 1 (doesn't pair in V)

; total 2

shl eax,1 ; 1

mov ebx,edx ; 0

; total 1

cmp eax,ebx ; 1 U

jne (somewhere) ; 0 (pairs only in V)

Now the theory should be quite clear so it's time to try optimizing a real piece of code. In the following example we'll optimize a slow line drawing inner loop.

UV: pairs in both pipes
NU: pairs only in U
NV: pairs only in V
NP: doesn't pair at all
? : The speed of memory references depends on many things. We suppose that the situation is ideal.

eax = start_x*256
edx = start_y
[xp]=x_coefficient*256 (dd, 32 bit)
[yp]=y_coefficient (dd, 32 bit)
ebx = 0

                    ; Clocks Pipe Pairing Comment

@@inner:

lea edi,[edx+edx*4] ;    1    U     UV    edi=edx*5

mov bl,ah           ;    0    V     UV    ebx=ax/256

shl edi,6           ;    1    U     NU    edi*=64

add edi,ebx         ;    1    U     UV    edi+=ebx

add edi,0a0000h     ;    0    U     UV    edi+=screenstart

mov [edi],b 10d     ;    1?   U     UV    [edi]=10

add edx,[yp]        ;    0?   V     UV    edx+=[yp]

add eax,[xp]        ;    1?   U     UV    eax+=[xp]

dec ecx             ;    0    U     UV    ecx-=1

jnz short @@inner   ;    1    V     NV    jump if not zero

                    ;    6?

@@inner:

lea edi,[edx+edx*4]         ; 1   U  UV  edi=edx*5

mov bl,ah                   ; 0   V  UV  ebx=ax/256

shl edi,6                   ; 1   U  NU  edi*=64

add edx,[yp]                ; 0?  V  UV  edx+=[yp]

mov [edi+ebx+0a0000h],b 10d ; 1?  U  UV  []=10

add eax,[xp]                ; 0?  V  UV  eax+=[xp]

dec ecx                     ; 1   U  UV  ecx-=1

jnz short @@inner           ; 0   V  NV  jump

                            ; 4?

eax=start_x*256
edi=start_y*320
[xp]=x_coeff*256
[yp]=y_coeff*320
edi=0a0000h
ebx=0

@@inner:

mov bl,ah           ; 1   U  UV  ebx=ax/256

add eax,[xp]        ; 0?  V  UV  eax+=[xp]

mov [edi+ebx],b 10d ; 1?  U  UV  []=10

add edi,[yp]        ; 0?  V  UV  edi+=[yp]

dec cx              ; 1   U  UV  cx-=1

jnz @@inner         ; 0   V  NV  jump if not zero

                    ; 3?

Luckily we have a trick left: we can't affect the caches straight but we can speed up memory handling by arranging the data. For example, variables that are used in the same loop should be arranged so that they would be consecutively in the memory and in blocks of 32 bytes. This is because pentium moves data between cache and normal memory in 32-byte blocks, and a block like this is never split. So it's worth trying to align the code and the data to 32 bytes. Specially loops and variables that are used in them should be in as few blocks as possible. It's always worth using your imagination when coding critical loops to get the best possible use of level 1 cache.

7.3 Palette Quantisizing

7.3.1 What is it actually?

Palette quantisizing

7.3.2 Local K Mean

7.3.2.1 An abstract approach

Abstractly, LK works as follows: the values of the picture or colors to be quantisized are imagined as spheres in a cubic-shaped color space (XYZ = RGB). The bigger amount of a color there is, the bigger is the sphere. Into the color space, we add palette spheres which can move freely contrary to the color spheres which don't move at all. The number of palette spheres is the same as the size of the desired palette (256 colors -> 256 spheres).

We perform the following process: every color sphere pulls the closest palette sphere. The bigger a color sphere is, the bigger is its pulling force. Now about all of the palette spheres are pulled by one or more color spheres. (The new coordinates of a palette sphere are calculated from the average of the color values of the color spheres, in other words the sum of colors divided by the number of color spheres). The palette spheres which are not pulled by any color sphere telewarp near some color sphere. Now the palette spheres move in the space like this until their movement is slowed under a defined level (trust me, it really slows down). Now the new palette can be read from the coordinates of the palette spheres.

7.3.2.2 A more technical approach

Let's use the following example: we have a truecolor picture which should be changed to 256 colors. First we create a histogram out of the picture. The histogram uses 15bit (or any other 3*x bit) numbers.

Now we create another table in which is the list of the colors the picture originally has. So we go through the histogram, and in every point where there is some color (the value being greater than zero), we put the color amount and value into this new table. The table can for example be like this:

typedef struct

{

unsigned char R,G,B; // color values

unsigned long count; // number of colors in the pic

} colorListStruct;

colorListStruct colorList[32768];

unsigned long palette[256][3]; // 3: R,G & B

unsigned long colorSum[256][3]; // 256 colors, 3 = R,G & B,

colorSum

unsigned long colorCount[256],

unsigned long variance;

1) Reset colorSum and colorCount (all zeros), and fill palette with the colors at the beginning of colorList

2) Go through all colors in colorList (c = 0..colorListCount)

a) take color c from colorList

b) find the closest color in palette for it (we get a number x=0..255)

c) add this color into colorSum, for example

colorSum[x][0] += colorList[c].R;

colorSum[x][1] += colorList[c].G;

colorSum[x][2] += colorList[c].B;

d) increment colorCount at the point x (colorCount[x]++;)

3) variance=0

4) Go through all colors in the basic palette (c = 0..255)

a) if colorCount > 0 calculate the R, G, and B values with the help of colorSum and colorCount (average color):

R = colorSum[c][0] / colorCount[c];

etc. else take a random number from colorList

R,G & B <- colorList[RANDOM]

b) calculate the variance:

temp = abs(R-palette[c][0]); //variance in red

variance+=temp; //save it

etc.

c) save the new color:

palette[c][0] = R

etc.

5) reset colorSum and colorCount

6) if variance > MAX_VARIANCE goto 2 (MAX_VARIANCE is the border when the palette is ready. The smaller number, the slower process.)

7.3.3 Median Cut

solar@compart.fi

Disclaimer: The information here is based on material I've read (can't remember any document names) and some sources I've poked at. As such I can't give any pointers to more information nor do I have the math background for this algorithm.

7.3.3.1 The definition

RGB colorspace cube

Since the palette we're going to use is not going to include every color in the colorspace, we only use a "subspace".

Colorspace with a subspace in gray

Let's say we cut the subspace in two.

Colorspace with two subspaces

The example image may give you slightly wrong idea of the algorithm. There will be a gap between the new subspaces, and they will most probably shrink in other dimensions as well.

7.3.3.2 The algorithm

Basically we do the following:

1. Analyze subspaces

Find out information about the subspace that we need. This practically means checking what color component (R, G or B) has largest range (max-min) and what is the size of the subspace. This size may mean the number of colors in the subspace, sum of the values of the biggest component, or just the largest component range. You'll need the sum of the values of the biggest component later in any case.

This is simple; just check the values from each subspace and find out which is the largest.

This is by far the most power-eating part of this algorithm. You'll need to sort the colors by the component you found out in the analyze phase.

You could cut the subspace in half (by leaving as many colors on one side as the other, or by leaving as many color intensities on one side as the other), but we'll cut it on the median of the color values. You calculated the sum of all color values of the component you sorted the subspace by. Now you'll need to find the median, or, the position where the sum of values reaches the middle point of the whole sum.

Example: We have the following values: 1, 5, 7, 9, 10, 11, 17, 21. The sum of these is 81. Half of the sum is 40.5. Now to find the median, we'll start calculating a new sum, until it reaches 40.5: 1+5+7+9+10+11=43, so the first half will have 5 values and the second, 3.

As we now have one new subspace, we need to analyze it, and reanalyze the one it was clipped from. When we have enough subspaces for our needs, we go on.

Just plain & simple averaging.

7.3.3.3 Implementation hints

There are some problems, workarounds, and speed issues.

Since many pictures tend to have black in them, and if we are using VGA, we usually like to have the color 0 to be black, so the borders won't annoy us too much. I solved this by separating all black colors from the input values into separate list, and forced the color 0 to be black. Otherwise there will not be a completely black color!

If you wish to remap graphics (sprites, textures, whatever), performing a nearest-color search for every color you put into reduction is a bit waste. You could, instead, drag the original color index with the color itself (so that it will be sorted with the original color etc.) and then when averaging the subspaces you'd directly know the color indices that should be of that color.

Remember not to do any things repeatedly without a reason. (After analyzing, store the values and reanalyze only if there is need; also, sort only if it really is needed).

The biggest power-consuming part (about 60%+ in my implementation) in the algorithm is the sort. Killing off duplicates might help, but I didn't bother trying that.

My implementation used a singly-linked list for the colors since I wanted it to be dynamic. There were couple problems with this approach: allocating and freeing memory for 250000+ colors took more time than the color reduction itself, which was easily solved by allocating enough memory for one whole colormap at a time. Another problem was sorting, which I solved by making a list of pointers and re-linking the list after the sort. Making a static list of colors should be just as easy, and quite probably faster.
I used radix sorting. On a p150 it reduced 640*400 colors into 255 in about 4.5 seconds. Coding it took about one day (most of it wasted debugging).

Back to the index