Display functions optimization

Libraries, utilities, bootloaders...

Display functions optimization

Postby rodot » Sun Jan 11, 2015 6:48 pm

I start this topic to discuss optimization of the graphical functions of the library, like drawPixel and drawBitmap, because it's getting critical if you are trying to draw a screen full of tiles. This conversation initially started in here.

The two main bottlenecks to draw a screen full of tiles seems to be the drawPixel and drawBitmapFunctions.

A few cpu load measurement I did using the getCpuLoad() function at the default 20FP (50ms/frame, so 1% is 0.5ms) :

- blank screen : 10% (which represents the time to send the buffer to the screen. Nokia 5110 maximum SPI bus frequency is the limitation here)
- gb.display.fillScreen() : 10% (it uses memset so it's almost instant)
- fillRect() to fill the screen : 76%
- gb.display.drawPixel() in a for loop to fill the screen : 50%
Code: Select all
for(byte x = 0; x < LCDWIDTH; x++){
        for(byte y = 0; y < LCDHEIGHT; y++){
          gb.display.drawPixel(x,y);
        }
      }


Right now, fillRect() call drawFastVline (which is not optimized at all) plenty of times to actually fill the rectangle, that's why it's slower than the for loop calling drawPixel.
I got a small improvement (76% to 69%) with the following change the function drawFastVline:
Code: Select all
void Display::drawFastVLine(int8_t x, int8_t y, int8_t h) {
    // stupidest/slowest version :
    //drawLine(x, y, x, y + h - 1);
    //new version :
   for(int8_t i=0; i<h; i++){
      drawPixel(x,y+i);
   }
}

(I will commit that change to the beta branch soon)

- I tried to place the drawFastVline function inline but it didn't change anything in this case
- drawPixel have be previously made inline for performance improvement, I tried to make it not inline and it doesn't make any difference this time. Maybe because of the optimization done by avr-gcc

What is surprising is that if I change the fillRect function to a for loop with drawPixel instead of a for loop of drawFastVline, it is slower ! (74%). Doing a for loop of drawPixel() directly in the .ino only takes 50%. I don't understand where the difference comes from.

- Replacing (y/8) by (y>>3) in the drawPixel doesn't affect performance, it seems that the optimizer does its job here.
- removing the INVERT color makes a 5% improvement when filling the whole screen. Not a big difference, and this color is pretty handy.
User avatar
rodot
Site Admin
 
Posts: 1290
Joined: Mon Nov 19, 2012 11:54 pm
Location: France

Re: Display functions optimization

Postby Myndale » Sun Jan 11, 2015 9:23 pm

From the tests I did a few months ago the biggest slowdowns seemed to be due to testing against the screen boundary and the overhead of the loop mechanics. Testing of screen boundary can be eliminated by adjusting the source and destination rectangles before the blt so that you only draw pixels in the overlapping region. Loop mechanics can be minimized by using loop unrolling e.g. something like Duff's Device.

The fastest speeds, however, can be obtained by storing the bitmap in the same pixel format as the screen i.e. rows 8 pixels high with one byte per column. If the destination y is a multiple of 8 then a naive implementation would look something like this:

Code: Select all
  for (int y=y1; y<y2; y++, dst+=84, src+=w)
   for (int x=x1; x<x2; x++)
      dst[x] |= src[x];


If destination y isn't a multiple of 8 then the pixels needed to be shifted up (down?) into position, which means you also need to include pixels from the next 8-pixel "row" as well:

Code: Select all
  for (int y=0; y<h; y++, dst+=84, src+=w)
   for (int x=0; x<w; x++)
      dst[x] |= (src[x] << shift1) | (src[x+w] >> shift2);


In conjunction with loop unrolling this would normally result in 8 pixels drawn for every 16 or so assembly instructions, or about a million pixels a second. Unfortunately the AVRs have very poor shift support, and shift operators like the ones I've just used are actually implemented with a loop. To get maximum performance you'd therefore have to replace those shifts with integer multiplications, something like this:

Code: Select all
  int scale1 = 1 << (shift1 + 8);
  int scale2 = 1 << (8 - shift2);
  for (int y=0; y<h; y++, dst+=84, src+=w)
   for (int x=0; x<w; x++)
      dst[x] |= (src[x] * scale1 | src[x+w] * scale2) >> 8;


The one problem with this latter method is that it would only work with sprites that haven't been rotated or mirrored in the y direction, otherwise you're back to doing it the per-pixel way. The question is whether you think it's worth adding such an optimized routine just for that specific case.
Myndale
 
Posts: 507
Joined: Sat Mar 01, 2014 1:25 am

Re: Display functions optimization

Postby rodot » Sun Jan 11, 2015 9:52 pm

Thanks for the explanation Myndale. It would be interesting to benchmark this new routine... do you have any idea of the performance gain we would get in the case of a y non multiple of 8?

Your routine wouldn't take much space, and we can add a setting using preprocessor instructions to disable it if someone doesn't use it and need a few extra byte of flash memory. I think it's really worth adding it (depending on the benchmark results of course).

A case where the y destination is always 8 would be an horizontal scrolling tilemap... which correspond exactly to Jamish's game, which is at the origin of this discussion. So again, it's worth it in my opinion.
User avatar
rodot
Site Admin
 
Posts: 1290
Joined: Mon Nov 19, 2012 11:54 pm
Location: France

Re: Display functions optimization

Postby Myndale » Mon Jan 12, 2015 12:34 am

Here's an example of using loop unrolling in the drawBitmap function, only took 10 minutes or so to throw together but it almost doubles the speed of the function. I'm not doing any screen bounds checking, that's a fringe case though so you could probably revert to the regular version for those. It also only supports color=BLACK, you'd need to add 2 extra loops for WHITE (i.e. ptr[0] &= ~mask) and INVERT (i.e. ptr[0] ^= mask):

Code: Select all
#include <SPI.h>
#include <Gamebuino.h>
Gamebuino gb;

const byte sprite[] PROGMEM = {
  16, 16, 0x1f,0xf8,0x1f,0xf8,0x1f,0xfc,0x1f,0xff,0x1f,0xff,0xf,0xff,0xf,0xff,0x7,0xff,0x87,0xff,0x3,0xff,0x1,0xff,0x0,0x7f,0x2,0x1f,0x0,0x0,0x0,0x0,0x40,0x0,};

void setup(){
  gb.begin();
}

void loop(){
  long start, finish;
  int drawTime;

  if(gb.update()){
    const int count = 100;

    start = millis();
    for (int i=0; i<count; i++)
      gb.display.drawBitmap(1, 31, sprite);
    finish = millis();
    drawTime = 1000L*(finish-start)/count;
    gb.display.print(F("drawBitmap: "));
    gb.display.print(drawTime);
    gb.display.println(F("ns"));

    start = millis();
    for (int i=0; i<count; i++)
      drawBitmapUnrolled(17, 31, sprite);
    finish = millis();
    drawTime = 1000L*(finish-start)/count;
    gb.display.print(F("unrolled: "));
    gb.display.print(drawTime);
    gb.display.println(F("ns"));
  }
}

inline void drawPixelFast(int8_t x, int8_t y) {
  gb.display.getBuffer()[x + (y / 8) * LCDWIDTH_NOROT] |= _BV(y % 8);
}

void drawBitmapUnrolled(int8_t x, int8_t y, const uint8_t *bitmap) {
  int8_t w = pgm_read_byte(bitmap);
  int8_t h = pgm_read_byte(bitmap + 1);
  bitmap = bitmap + 2; //add an offset to the pointer to start after the width and height
  int8_t i, j, byteWidth = (w + 7) / 8;
  uint8_t * line = gb.display.getBuffer() + x;
  const uint8_t * bitmap_line = bitmap;
  Display display = gb.display; 
 
  for (j = 0; j < h; j++, bitmap_line+=byteWidth) {
    int dsty = y + j;
    int ofs = (dsty / 8) * LCDWIDTH_NOROT;
    int dstx = x;
    const uint8_t * bitmap_src = bitmap_line;
    uint8_t * ptr = line + ofs;
    uint8_t mask = _BV(dsty % 8);
    int i = w;
    while (i >= 8)
    {
      unsigned pixels = pgm_read_byte(bitmap_src++);
      if (pixels & 0x80)
        ptr[0] |= mask;
      if (pixels & 0x40)
        ptr[1] |= mask;
      if (pixels & 0x20)
        ptr[2] |= mask;
      if (pixels & 0x10)
        ptr[3] |= mask;
      if (pixels & 0x08)
        ptr[4] |= mask;
      if (pixels & 0x04)
        ptr[5] |= mask;
      if (pixels & 0x02)
        ptr[6] |= mask;
      if (pixels & 0x01)
        ptr[7] |= mask;
      ptr += 8;
      i -= 8;
      dstx += 8;
    }
    if (i)
    {
      unsigned pixels = pgm_read_byte(bitmap_src);
      while (i--)
      {
        if (pixels & 0x80)
          gb.display.drawPixel(dstx, dsty);
        pixels <<= 1;
        dst++;
      }
    }
  }
}
Myndale
 
Posts: 507
Joined: Sat Mar 01, 2014 1:25 am

Re: Display functions optimization

Postby jonnection » Mon Jan 12, 2015 3:03 pm

Code: Select all
      int scale1 = 1 << (shift1 + 8);
      int scale2 = 1 << (8 - shift2);
      for (int y=0; y<h; y++, dst+=84, src+=w)
       for (int x=0; x<w; x++)
          dst[x] |= (src[x] * scale1 | src[x+w] * scale2) >> 8;


This was very very interesting. Again, thank you for a good tip Myndale, about a detail I was totally unaware of (ie. lack of dedicated shifter in AVR hardware). I took it for granted, having more experience in other kinds of CPU's.

I need to go over my code now.
User avatar
jonnection
 
Posts: 317
Joined: Sun May 04, 2014 8:21 pm

Re: Display functions optimization

Postby Myndale » Mon Jan 12, 2015 10:15 pm

Yeah, I was surprised about that myself, not shifting by another register I can understand but not even constant shift? The AVRs must be really pushing propagation delay to the limit or something.

Also I notice I did put a ">> 8" in the code above, I haven't specifically checked it but I would hope that the compiler is smart enough to do the right thing there. Worse case scenario we'd have to revert to a small amount of assembly.
Myndale
 
Posts: 507
Joined: Sat Mar 01, 2014 1:25 am

Re: Display functions optimization

Postby Myndale » Tue Jan 13, 2015 12:25 am

A few more optimizations, this version of the drawBitmap routine runs ~6.5 times faster than the library version (i.e. ~330ns vs ~2155ns):

Code: Select all
#include <SPI.h>
#include <Gamebuino.h>
Gamebuino gb;

const byte sprite[] PROGMEM = {
  16, 16, 0x1f,0xf8,0x1f,0xf8,0x1f,0xfc,0x1f,0xff,0x1f,0xff,0xf,0xff,0xf,0xff,0x7,0xff,0x87,0xff,0x3,0xff,0x1,0xff,0x0,0x7f,0x2,0x1f,0x0,0x0,0x0,0x0,0x40,0x0,};

void setup(){
  gb.begin();
}

void loop(){
  long start, finish;
  int drawTime;

  if(gb.update()){
    const int count = 100;

    start = millis();
     for (int i=0; i<count; i++)
     gb.display.drawBitmap(1, 31, sprite);
     finish = millis();
     drawTime = 1000L*(finish-start)/count;
     gb.display.print(F("drawBitmap: "));
     gb.display.print(drawTime);
     gb.display.println(F("ns"));

    start = millis();
    for (int i=0; i<count; i++)
      drawBitmapUnrolled(17, 31, sprite);
    finish = millis();
    drawTime = 1000L*(finish-start)/count;
    gb.display.print(F("unrolled: "));
    gb.display.print(drawTime);
    gb.display.println(F("ns"));
  }
}

void drawBitmapUnrolled(int8_t x, int8_t y, const uint8_t *bitmap) {

  int8_t w = pgm_read_byte(bitmap);
  int8_t h = pgm_read_byte(bitmap + 1);
  bitmap = bitmap + 2; //add an offset to the pointer to start after the width and height
  int8_t i, j, byteWidth = (w + 7) >> 3;
  uint8_t * screen_line = gb.display.getBuffer() + x;
  const uint8_t * bitmap_line = bitmap;

  uint8_t dsty = y;
  uint8_t mask = _BV(y & 7);
  for (j = 0; j < h; j++, dsty++, bitmap_line+=byteWidth, mask = (mask & 0x80) ? 1 : (mask<<1))
  {
    int ofs = (dsty >> 3) * LCDWIDTH_NOROT;
    int8_t dstx = x;
    const uint8_t * bitmap_src = bitmap_line;
    uint8_t * ptr = screen_line + ofs;   
    int8_t i = w;
    while (i >= 8)
    {
      uint8_t pixels = pgm_read_byte(bitmap_src++);
      if (pixels & 0x80)
        ptr[0] |= mask;
      if (pixels & 0x40)
        ptr[1] |= mask;
      if (pixels & 0x20)
        ptr[2] |= mask;
      if (pixels & 0x10)
        ptr[3] |= mask;
      if (pixels & 0x08)
        ptr[4] |= mask;
      if (pixels & 0x04)
        ptr[5] |= mask;
      if (pixels & 0x02)
        ptr[6] |= mask;
      if (pixels & 0x01)
        ptr[7] |= mask;
       
      ptr += 8;
      i -= 8;
      dstx += 8;     
    }
    if (i)
    {
       uint8_t pixels = pgm_read_byte(bitmap_src);
       while (i--)
       {
         if (pixels & 0x80)
         gb.display.drawPixel(dstx, dsty);
         pixels <<= 1;
       }
     }     
  }

}

Myndale
 
Posts: 507
Joined: Sat Mar 01, 2014 1:25 am

Re: Display functions optimization

Postby Jamish » Tue Jan 13, 2015 4:46 pm

Myndale wrote:A few more optimizations, this version of the drawBitmap routine runs ~6.5 times faster than the library version (i.e. ~330ns vs ~2155ns):


Image

Incredible. And you're drawing those at y = 31, meaning Y values don't have to be a multiple of 8. Thanks for introducing me to loop unrolling, I had no idea.

I wonder if the same kind of pointer arithmetic COULD be used to speed up the other cases, at least the flips. FLIPH seems easy enough. Just reverse mask and flip the order of
Code: Select all
 if (pixels & 0x80)
        ptr[0] |= mask;
      if (pixels & 0x40)
        ptr[1] |= mask;
      if (pixels & 0x20)
        ptr[2] |= mask;
[...]


to
Code: Select all
 if (pixels & 0x80)
        ptr[7] |= mask;
      if (pixels & 0x40)
        ptr[6] |= mask;
      if (pixels & 0x20)
        ptr[5] |= mask;

[...]


Then... in the "if (i)" case...

Uh...

I'm at work at the moment. Can't think straight, let alone fully grasp what your code is doing right now :lol:
User avatar
Jamish
 
Posts: 73
Joined: Wed Dec 17, 2014 6:52 pm
Location: California

Re: Display functions optimization

Postby jonnection » Wed Jan 14, 2015 1:24 am

I just ran Myndale's latest code snippet and put in my version of drawbitmap from the rboy lib I'm developing for the "Isle of Maniax".
I printed the results out from Serial.print as I only had a Arduino Nano at hand to test (the absence of a LCD makes no difference in this case).

drawBitmap: 1640 ns
unrolled: 340 ns
rboy: 220 ns

Thank you Myndale for a valuable tip on the AVR shifting slowness. I went from around 800 ns to 220 ns by cutting out the shifting. I had no idea that it was so bad.

The difference in speed (even vs. Myndales optimized putpixel) comes from the fact that I don't draw bit by bit. I operate on bytes at a time using shifts and masks. This means that all my drawing routines draw bytes at a time ie. there is no putpixel. It also means that I have different routines for horizontal and vertical orientation and that clipping with screen boundaries is more complicated.

As to why I'm not putting my code here is because its not a piece of singular code, its a lib of its own. In any case I want to release the work first in the form of a game. Unfortunately, I don't have much time to work on it at the moment.

Image
User avatar
jonnection
 
Posts: 317
Joined: Sun May 04, 2014 8:21 pm

Re: Display functions optimization

Postby NicK » Wed Jan 14, 2015 2:22 am

I would be really interested in getting some assembly language games going - if not just purely to understand the exact architecture of the Gamebuino. Any chance that anyone (Myndale??) on this board may be able to do a tutorial to explain the basics? I've read a fair few books on assembler for the AVR but also many for the old ZX Spectrum. Maybe I'm just trying to relive the old glory days.......
NicK
 
Posts: 27
Joined: Tue Dec 23, 2014 11:25 pm

Next

Return to Software Development

Who is online

Users browsing this forum: No registered users and 48 guests

cron