Journal: Dr. Dobb's Journal Dec 1992 v17 n12 p143(5) ------------------------------------------------------------------------- Title: Moving, faster lines, and page flipping. (Graphics Programming) (Column) Author: Abrash, Michael Abstract: Graphics programming can be optimized, however if the program is running well it may be better not to try and change it. Serge Mathieu of Concepteva Inc has a useful twist on animation. Serge's system is a combination of page flipping and dirty-rectangle animation: set the start address to display page (), draw to page 1, set the start address to display page 1, wait for leading edge of vertical sync, copy from page 1 to page 0, set the address to display page 0, now identical to page 1, wait for leading edge of vertical sync. This method makes it necessary to maintain only page and eliminates the complications involved with maintaining two separate pages. ------------------------------------------------------------------------- Full Text: As I write this, the wife, the kid, and I are in the throes of yet another lightning-quick transcontinental move, this time to Redmond to work for You Know Who. Moving is never fun, but what makes it worse for us is the pets. Getting them into kennels and to the airport is hard; there's always the possibility that they might not be allowed to fly because of the weather; and, worst of all, they might not make it. Animals don't usually end up injured or dead, but it does happen. In a (not notably successful) effort to cheer me up about the prospect of shipping my animals, a friend told me the following story, which he swears actually happened to a friend of his. I don't know--to me, it has the sound of an urban legend, which is to say it makes a good story, but you can never track down the person it really happened to; it's always a friend of a friend. But maybe it is true, and anyway, it's a good story. This friend of a friend (henceforth referred to as FOF), worked in an air-freight terminal. Consequently, he handled a lot of animals, which was fine by him, because he liked animals; in fact, he had quite a few cats at home. You can imagine his dismay when, one day, he took a kennel off the plane to find that the cat it carried was quite thoroughly dead. (No, it wasn't resting; this cat was bloody deceased.) FOF knew how upset the owner would be, and came up with a plan to make everything better. At home, he had a cat of the same size, shape, and markings. He would substitute that cat, and since all cats treat all humans with equal disdain, the owner would never know the difference, and would never suffer the trauma of the loss of her cat. So FOF drove home, got his cat, put it in the kennel, and waited for the owner to show up--at which point, she took one look at the kennel and said, "This isn't my cat. My cat is dead." As it turned out, she had shipped her recently deceased feline home to be buried. History does not record how FOF dug himself out of this one. Okay, but what's the point? The point is, if it isn't broken, don't fix it. And if it is broken, maybe that's all right, too. Which brings us, neat as a pin, to the topic of drawing lines in a serious hurry. Fast Run-length Slice Line Drawing Last month, we examined the principles of run-length slice line drawing, which draws lines a run rather than a pixel at a time, a run being a series of pixels along the major (longer) axis. I concluded by promising a fast assembler version for this month. Listing One (page 159) is the promised code, in a form that's plug-compatible with the C code from last month. Your first question is likely to be the following: Just how fast is Listing One? Is it optimized to the hilt, or just pretty fast? The quick answer is: It's fast. Listing One draws lines at a rate of nearly 1 million pixels per second on my 486/33, and is capable of still faster drawing, as I'll discuss shortly. (The heavily optimized AutoCAD line-drawing code that I mentioned last month drew 150,000 pixels per second on an EGA in a 386/16, and I thought I had died and gone to Heaven. Such is progress.) The full answer is a more complicated one, and ties in to the principle that if it is broken, maybe that's okay--and to the principle of looking before you leap, also known as profiling before you optimize. When I went to speed up run-length slice lines, I initially manually converted the C code from last month into assembler. Then I streamlined the register usage and used REP STOS wherever possible. Listing One is that code. At that point, line drawing was surely faster, although I didn't know exactly how much faster. Equally surely, there were significant optimizations yet to be made, and I was itching to get on to them, for they were a lot more interesting than a basic C-to-assembler port. Ego intervened at this point, however. I wanted to know how much of a speed-up I had already gotten, so I timed the performance of the C code vs. the assembler code. To my horror, I found that I had not gotten even a two-times improvement! I couldn't understand how that could be--the C code was decidedly unoptimized--until I hit on the idea of measuring the maximum memory speed of the VGA to which I was drawing. Bingo. The Paradise VGA in my 486/33 is fast for a single display-memory write, because it buffers the data, lets the CPU go on its merry way, and finishes the write when display memory is ready. However, the maximum rate at which data can be written to the adapter turns out to be no more than one byte every microsecond. Put another way, you can only write one byte to this adapter every 33 clock cycles on a 486/33. Therefore, no matter how fast I made the line-drawing code, it could never draw more than 1,000,000 pixels per second in 256-color mode in my system. The C code was already drawing at about half that rate, so the potential speed-up for the assembler code was limited to a maximum of two times, which is pretty close to what Listing One did, in fact, achieve. When I compared the C and assembler implementations drawing to normal system (nondisplay) memory, I found that the assembler code was actually four times as fast as the C code. In fact, Listing One draws lines at about 92 percent of the maximum possible rate in my system--that is, it draws very nearly as fast as the VGA hardware will allow. All the optimization in the world would get me less than 10 percent faster line drawing--and that only if I eliminated all overhead, an unlikely proposition at best. The code isn't fully optimized, but so what? Now it's true that faster line-drawing code would likely be more beneficial on faster VGAs, especially local-bus VGAs, and in slower systems. For that reason, I'll list a variety of potential optimizations to Listing One. On the other hand, it's also true that Listing One is capable of drawing lines at a rate of 2.2 million pixels per second on a 486/33, given fast enough VGA memory, so it should be able to drive almost any non-local-bus VGA at nearly full speed. In short, Listing One is very fast, and, in many systems, further optimization is basically a waste of time. Profile before you optimize. Further Optimizations Following is a quick tour of some of the many possible further optimizations to Listing One. The run-handling loops could be unrolled more than the current two times. However, bear in mind that a two-times unrolling gets more than half the maximum unrolling benefit with less overhead than a more heavily unrolled loop. BX could be freed up in the Y-major code by breaking out separate loops for X advances of 1 and -1. DX could be freed up by using AH as the counter for the run loops, although this would limit the maximum line length that could be handled. The freed registers could be used to keep more of the whole-step and error variables in registers. Alternatively, the freed registers could be used to implement more esoteric approaches like unrolling the Y-major inner loop; such unrolling could take advantage of the knowledge that only two run lengths are possible for any given line. Strangely enough, on the 486 it might also be worth unrolling the X-major inner loop, which consists of REP STOSB, because of the slow start-up time of REP relative to the speed of branching on that processor. Special code could be implemented for lines with integral slopes, because all runs are exactly the same length in such lines. Also, the X-major code could try to write an aligned word at a time to display memory whenever possible; this would improve the maximum possible performance on some 16-bit VGAs. One weakness of Listing One is that for lines with slopes between 0.5 and 2, the average run length is less than two, rendering run-length slicing ineffective. This can be remedied by viewing lines in that range as being composed of diagonal, rather than horizontal or vertical runs. I haven't space to discuss this, but it's not very complicated, and it guarantees a minimum run length of 2. That renders run drawing considerably more efficient, and makes techniques such as unrolling the inner run-drawing loops more attractive. Finally, be aware that run-length slice drawing is best for long lines, because it has more and slower setup than standard Bresenham's, including a divide. Run-length slice is great for 100-pixel lines, but not necessarily for 20-pixel lines, and it's a sure thing that it's not terrific for 3-pixel lines. Both approaches will work, but if line-drawing performance is critical, whether you'll want to use run-length slice or standard Bresenham's depends on the typical lengths of the lines you'll be drawing. For lines of widely varying lengths, you might want to implement both approaches, and choose the best one for each line, depending on the line length--assuming, of course, that your display memory is fast enough and your application demanding enough to make that level of optimization worthwhile. An Interesting Twist on Page Flipping I've spent a fair amount of time exploring various ways to do animation. (See, for example, my July, August, and September 1991 DDJ columns, as well as those in the January through April 1992 issues.) I thought I had pegged all the possible ways to do animation: exclusive-ORing; simply drawing and erasing objects; drawing objects with a blank fringe to erase them at their old locations as they're drawn; page flipping; and, finally, drawing to local memory and copying the dirty (modified) rectangles to the screen. To my surprise, someone threw me an interesting and useful twist on animation the other day, a cross between page flipping and dirty-rectangle animation. That someone was Serge Mathieu of Concepteva Inc., in Rosemere, Quebec, who informed me that he designs everything "from a game 'point de vue'." In normal page flipping, you display one page while you update the other page. Then you display the new page while you update the other. This works fine, but the need to keep two pages current can make for a lot of bookkeeping and possibly extra drawing, especially in applications where only some of the objects are redrawn each time. Serge didn't care to do all that bookkeeping in his animation applications, so he came up with the following approach (which I've reworded, amplified, and slightly modified): 1. Set the start address to display page 0. 2. Draw to page 1. 3. Set the start address to display page 1 (the newly drawn page), then wait for the leading edge of vertical sync, at which point the page has flipped and it's safe to modify page 0. 4. Copy, via the latches, from page 1 to page 0 the areas that changed from the last screen to the current one. 5. Set the start address to display page 0, which is now identical to page 1, then wait for the leading edge of vertical sync, at which point the page has flipped and it's safe to modify page 1. 6. Go to step 2. The great benefit of Serge's approach is that the only page that is ever actually drawn to (as opposed to block-copied to) is page 1. Only one page needs to be maintained, and the complications of maintaining two separate pages vanish entirely. The performance of Serge's approach may be better or worse than standard page flipping, depending on whether a lot of extra work is required to maintain two pages or not. My guess is that Serge's approach will usually be slower, owing to the considerable amount of display-memory copying involved, and also to the double page-flip per frame. There's no doubt, however, that Serge's approach is simpler, and the resultant display quality is every bit as good as standard page flipping. Given page flipping's fair degree of complication, this approach is a valuable tool, especially for less-experienced animation programmers. An interesting variation on Serge's approach doesn't page flip or wait for vertical sync: 1. Set the start address to display page 0. 2. Draw to page 1. 3. Copy, via the latches, the areas that changed from the last screen to the current one from page 1 to page 0. 4. Go to step 2. This approach totally eliminates page flipping, which can consume a great deal of time. The downside is that images may shear for one frame if they're only partially copied when the raster beam reaches them. This approach is basically a standard dirty-rectangle approach, except that the drawing buffer is stored in display memory, rather than in system memory. Whether this technique is faster than drawing to system memory depends on whether the benefit you get from the VGA's hardware, such as the Bit Mask, the ALUs, and especially the latches (for copying the dirty rectangles) is sufficient to outweigh the extra display-memory accesses involved in drawing, since display memory is notoriously slow. Finally, I'd like to point out that in any scheme that involves changing the display-memory start address, a clever trick can potentially reduce the time spent waiting for pages to flip. Normally, it's necessary to wait for display enable to be active, then set the two start address registers, and finally wait for vertical sync to be active, so you know the new start address has taken effect. The start-address registers must never be set around the time vertical sync is active (the new start address is accepted at either the start or end of vertical sync on the EGAs and VGAs I'm familiar with), because it would then be possible to load a half-changed start address (one register loaded, the other not yet loaded), and the screen would jump for a frame. Avoiding this condition is the motivation for waiting for display enable, because display enable is active only when vertical sync is not active and will not become active for a long while. Suppose, however, that you arrange your page start addresses so that they both have a low-byte value of 0 (page 0 starts at 0000h, and page 1 starts at 8000h, for example). Page flipping can then be done simply by setting the new high byte of the start address, then waiting for the leading edge of vertical sync. This eliminates the need to wait for display enable (the two bytes of the start address can never be mismatched); page flipping will often involve less waiting, because display enable becomes inactive long before vertical sync becomes active. Using the above approach reclaims all the time between the end of display enable and the start of vertical sync for doing useful work. (The steps I've given for Serge's animation approach assume that the single-byte approach is in use; that's why display enable is never waited for.) Thanks I took another bundle of reader contributions from the X-Sharp careware over to the Vermont Association for the Blind the other day. They were very grateful. Thanks to all of you who have helped so far.