Journal:    Dr. Dobb's Journal  Dec 1992 v17 n12 p143(5)
-------------------------------------------------------------------------
Title:     Moving, faster lines, and page flipping. (Graphics
           Programming) (Column)
Author:    Abrash, Michael


Abstract:  Graphics programming can be optimized, however if the program
           is running well it may be better not to try and change it.
           Serge Mathieu of Concepteva Inc has a useful twist on
           animation.  Serge's system is a combination of page flipping
           and dirty-rectangle animation: set the start address to
           display page (), draw to page 1, set the start address to
           display page 1, wait for leading edge of vertical sync, copy
           from page 1 to page 0, set the address to display page 0, now
           identical to page 1, wait for leading edge of vertical sync.
           This method makes it necessary to maintain only page and
           eliminates the complications involved with maintaining two
           separate pages.
-------------------------------------------------------------------------
Full Text:

As I write this, the wife, the kid, and I are in the throes of yet
another lightning-quick transcontinental move, this time to Redmond to
work for You Know Who.  Moving is never fun, but what makes it worse for
us is the pets.  Getting them into kennels and to the airport is hard;
there's always the possibility that they might not be allowed to fly
because of the weather; and, worst of all, they might not make it.
Animals don't usually end up injured or dead, but it does happen.

In a (not notably successful) effort to cheer me up about the prospect of
shipping my animals, a friend told me the following story, which he
swears actually happened to a friend of his.  I don't know--to me, it has
the sound of an urban legend, which is to say it makes a good story, but
you can never track down the person it really happened to; it's always a
friend of a friend.  But maybe it is true, and anyway, it's a good story.

This friend of a friend (henceforth referred to as FOF), worked in an
air-freight terminal.  Consequently, he handled a lot of animals, which
was fine by him, because he liked animals; in fact, he had quite a few
cats at home.

You can imagine his dismay when, one day, he took a kennel off the plane
to find that the cat it carried was quite thoroughly dead.  (No, it
wasn't resting; this cat was bloody deceased.)

FOF knew how upset the owner would be, and came up with a plan to make
everything better.  At home, he had a cat of the same size, shape, and
markings.  He would substitute that cat, and since all cats treat all
humans with equal disdain, the owner would never know the difference, and
would never suffer the trauma of the loss of her cat.  So FOF drove home,
got his cat, put it in the kennel, and waited for the owner to show
up--at which point, she took one look at the kennel and said, "This isn't
my cat.  My cat is dead."

As it turned out, she had shipped her recently deceased feline home to be
buried.  History does not record how FOF dug himself out of this one.

Okay, but what's the point?  The point is, if it isn't broken, don't fix
it.  And if it is broken, maybe that's all right, too.  Which brings us,
neat as a pin, to the topic of drawing lines in a serious hurry.

Fast Run-length Slice Line Drawing

Last month, we examined the principles of run-length slice line drawing,
which draws lines a run rather than a pixel at a time, a run being a
series of pixels along the major (longer) axis.  I concluded by promising
a fast assembler version for this month.  Listing One (page 159) is the
promised code, in a form that's plug-compatible with the C code from last
month.

Your first question is likely to be the following:  Just how fast is
Listing One? Is it optimized to the hilt, or just pretty fast?  The quick
answer is:  It's fast.  Listing One draws lines at a rate of nearly 1
million pixels per second on my 486/33, and is capable of still faster
drawing, as I'll discuss shortly.  (The heavily optimized AutoCAD
line-drawing code that I mentioned last month drew 150,000 pixels per
second on an EGA in a 386/16, and I thought I had died and gone to
Heaven.  Such is progress.)  The full answer is a more complicated one,
and ties in to the principle that if it is broken, maybe that's okay--and
to the principle of looking before you leap, also known as profiling
before you optimize.

When I went to speed up run-length slice lines, I initially manually
converted the C code from last month into assembler.  Then I streamlined
the register usage and used REP STOS wherever possible.  Listing One is
that code.  At that point, line drawing was surely faster, although I
didn't know exactly how much faster.  Equally surely, there were
significant optimizations yet to be made, and I was itching to get on to
them, for they were a lot more interesting than a basic C-to-assembler
port.

Ego intervened at this point, however.  I wanted to know how much of a
speed-up I had already gotten, so I timed the performance of the C code
vs.  the assembler code.  To my horror, I found that I had not gotten
even a two-times improvement!  I couldn't understand how that could
be--the C code was decidedly unoptimized--until I hit on the idea of
measuring the maximum memory speed of the VGA to which I was drawing.

Bingo.  The Paradise VGA in my 486/33 is fast for a single display-memory
write, because it buffers the data, lets the CPU go on its merry way, and
finishes the write when display memory is ready.  However, the maximum
rate at which data can be written to the adapter turns out to be no more
than one byte every microsecond.  Put another way, you can only write one
byte to this adapter every 33 clock cycles on a 486/33.  Therefore, no
matter how fast I made the line-drawing code, it could never draw more
than 1,000,000 pixels per second in 256-color mode in my system.  The C
code was already drawing at about half that rate, so the potential
speed-up for the assembler code was limited to a maximum of two times,
which is pretty close to what Listing One did, in fact, achieve.  When I
compared the C and assembler implementations drawing to normal system
(nondisplay) memory, I found that the assembler code was actually four
times as fast as the C code.

In fact, Listing One draws lines at about 92 percent of the maximum
possible rate in my system--that is, it draws very nearly as fast as the
VGA hardware will allow.  All the optimization in the world would get me
less than 10 percent faster line drawing--and that only if I eliminated
all overhead, an unlikely proposition at best.  The code isn't fully
optimized, but so what?

Now it's true that faster line-drawing code would likely be more
beneficial on faster VGAs, especially local-bus VGAs, and in slower
systems.  For that reason, I'll list a variety of potential optimizations
to Listing One.  On the other hand, it's also true that Listing One is
capable of drawing lines at a rate of 2.2 million pixels per second on a
486/33, given fast enough VGA memory, so it should be able to drive
almost any non-local-bus VGA at nearly full speed.  In short, Listing One
is very fast, and, in many systems, further optimization is basically a
waste of time.

Profile before you optimize.

Further Optimizations

Following is a quick tour of some of the many possible further
optimizations to Listing One.

The run-handling loops could be unrolled more than the current two times.
 However, bear in mind that a two-times unrolling gets more than half the
maximum unrolling benefit with less overhead than a more heavily unrolled
loop.

BX could be freed up in the Y-major code by breaking out separate loops
for X advances of 1 and -1.  DX could be freed up by using AH as the
counter for the run loops, although this would limit the maximum line
length that could be handled.  The freed registers could be used to keep
more of the whole-step and error variables in registers.  Alternatively,
the freed registers could be used to implement more esoteric approaches
like unrolling the Y-major inner loop; such unrolling could take
advantage of the knowledge that only two run lengths are possible for any
given line.  Strangely enough, on the 486 it might also be worth
unrolling the X-major inner loop, which consists of REP STOSB, because of
the slow start-up time of REP relative to the speed of branching on that
processor.

Special code could be implemented for lines with integral slopes, because
all runs are exactly the same length in such lines.  Also, the X-major
code could try to write an aligned word at a time to display memory
whenever possible; this would improve the maximum possible performance on
some 16-bit VGAs.

One weakness of Listing One is that for lines with slopes between 0.5 and
2, the average run length is less than two, rendering run-length slicing
ineffective.  This can be remedied by viewing lines in that range as
being composed of diagonal, rather than horizontal or vertical runs.  I
haven't space to discuss this, but it's not very complicated, and it
guarantees a minimum run length of 2.  That renders run drawing
considerably more efficient, and makes techniques such as unrolling the
inner run-drawing loops more attractive.

Finally, be aware that run-length slice drawing is best for long lines,
because it has more and slower setup than standard Bresenham's, including
a divide.  Run-length slice is great for 100-pixel lines, but not
necessarily for 20-pixel lines, and it's a sure thing that it's not
terrific for 3-pixel lines.  Both approaches will work, but if
line-drawing performance is critical, whether you'll want to use
run-length slice or standard Bresenham's depends on the typical lengths
of the lines you'll be drawing.  For lines of widely varying lengths, you
might want to implement both approaches, and choose the best one for each
line, depending on the line length--assuming, of course, that your
display memory is fast enough and your application demanding enough to
make that level of optimization worthwhile.

An Interesting Twist on Page Flipping

I've spent a fair amount of time exploring various ways to do animation.
(See, for example, my July, August, and September 1991 DDJ columns, as
well as those in the January through April 1992 issues.)  I thought I had
pegged all the possible ways to do animation:  exclusive-ORing; simply
drawing and erasing objects; drawing objects with a blank fringe to erase
them at their old locations as they're drawn; page flipping; and,
finally, drawing to local memory and copying the dirty (modified)
rectangles to the screen.

To my surprise, someone threw me an interesting and useful twist on
animation the other day, a cross between page flipping and
dirty-rectangle animation.  That someone was Serge Mathieu of Concepteva
Inc., in Rosemere, Quebec, who informed me that he designs everything
"from a game 'point de vue'."

In normal page flipping, you display one page while you update the other
page.  Then you display the new page while you update the other.  This
works fine, but the need to keep two pages current can make for a lot of
bookkeeping and possibly extra drawing, especially in applications where
only some of the objects are redrawn each time.

Serge didn't care to do all that bookkeeping in his animation
applications, so he came up with the following approach (which I've
reworded, amplified, and slightly modified):

1.  Set the start address to display page 0.

2.  Draw to page 1.

3.  Set the start address to display page 1 (the newly drawn page), then
wait for the leading edge of vertical sync, at which point the page has
flipped and it's safe to modify page 0.

4.  Copy, via the latches, from page 1 to page 0 the areas that changed
from the last screen to the current one.

5.  Set the start address to display page 0, which is now identical to
page 1, then wait for the leading edge of vertical sync, at which point
the page has flipped and it's safe to modify page 1.

6.  Go to step 2.

The great benefit of Serge's approach is that the only page that is ever
actually drawn to (as opposed to block-copied to) is page 1.  Only one
page needs to be maintained, and the complications of maintaining two
separate pages vanish entirely.  The performance of Serge's approach may
be better or worse than standard page flipping, depending on whether a
lot of extra work is required to maintain two pages or not.  My guess is
that Serge's approach will usually be slower, owing to the considerable
amount of display-memory copying involved, and also to the double
page-flip per frame.  There's no doubt, however, that Serge's approach is
simpler, and the resultant display quality is every bit as good as
standard page flipping.  Given page flipping's fair degree of
complication, this approach is a valuable tool, especially for
less-experienced animation programmers.

An interesting variation on Serge's approach doesn't page flip or wait
for vertical sync:

1.  Set the start address to display page 0.

2.  Draw to page 1.

3.  Copy, via the latches, the areas that changed from the last screen to
the current one from page 1 to page 0.

4.  Go to step 2.

This approach totally eliminates page flipping, which can consume a great
deal of time.  The downside is that images may shear for one frame if
they're only partially copied when the raster beam reaches them.  This
approach is basically a standard dirty-rectangle approach, except that
the drawing buffer is stored in display memory, rather than in system
memory.  Whether this technique is faster than drawing to system memory
depends on whether the benefit you get from the VGA's hardware, such as
the Bit Mask, the ALUs, and especially the latches (for copying the dirty
rectangles) is sufficient to outweigh the extra display-memory accesses
involved in drawing, since display memory is notoriously slow.

Finally, I'd like to point out that in any scheme that involves changing
the display-memory start address, a clever trick can potentially reduce
the time spent waiting for pages to flip.  Normally, it's necessary to
wait for display enable to be active, then set the two start address
registers, and finally wait for vertical sync to be active, so you know
the new start address has taken effect.  The start-address registers must
never be set around the time vertical sync is active (the new start
address is accepted at either the start or end of vertical sync on the
EGAs and VGAs I'm familiar with), because it would then be possible to
load a half-changed start address (one register loaded, the other not yet
loaded), and the screen would jump for a frame.  Avoiding this condition
is the motivation for waiting for display enable, because display enable
is active only when vertical sync is not active and will not become
active for a long while.

Suppose, however, that you arrange your page start addresses so that they
both have a low-byte value of 0 (page 0 starts at 0000h, and page 1
starts at 8000h, for example).  Page flipping can then be done simply by
setting the new high byte of the start address, then waiting for the
leading edge of vertical sync.  This eliminates the need to wait for
display enable (the two bytes of the start address can never be
mismatched); page flipping will often involve less waiting, because
display enable becomes inactive long before vertical sync becomes active.
 Using the above approach reclaims all the time between the end of
display enable and the start of vertical sync for doing useful work.
(The steps I've given for Serge's animation approach assume that the
single-byte approach is in use; that's why display enable is never waited
for.)

Thanks

I took another bundle of reader contributions from the X-Sharp careware
over to the Vermont Association for the Blind the other day.  They were
very grateful.  Thanks to all of you who have helped so far.