Having fun optimizing a sprite routine

Libraries, utilities, bootloaders...

Re: Having fun optimizing a sprite routine

Postby Myndale » Mon Aug 10, 2015 12:31 am

This is one of those cases where looking at the assembly that the compiler generates can help tremendously, even if you decide to stick with C. You can get the assembly by enabling verbose output for compile in the IDE and see where the temporary ELF is being generated, then run that through avr-objdump:

Code: Select all
C:/arduino/hardware/tools/avr/bin/avr-objdump -h -S  C:\Users\YOUR_USER_NAME_\AppData\Local\Temp\build12345678_or_something.tmp/yourapp.cpp.elf > yourapp.lst


This shows you the assembly locations of all the functions in your code, so you can load your HEX into Simbuino, place breakpoints at relevant points and time the exact number of cycles your code takes to run. To get accurate timing you need to disable interrupts while your code is running:

Code: Select all
noInterrupts();
ultraDraw3(face, x, y);
interrupts();


A quick search of the LST file shows where this is located in memory:

Code: Select all
     3be:   f8 94          cli
     3c0:   4d 2f          mov   r20, r29
     3c2:   6c 2f          mov   r22, r28
     3c4:   8a e0          ldi   r24, 0x0A   ; 10
     3c6:   91 e0          ldi   r25, 0x01   ; 1
     3c8:   0e 94 93 01    call   0x326   ; 0x326 <_Z10ultraDraw3Phcc>
     3cc:   78 94          sei


As for optimization, a few things stick out immediately:

  • the test for "if(y>=0)" is being done inside the loop, it only needs to be done at the start.
  • this loop body is only a few instructions long, so loop mechanics are going to have significant overhead. given that the loop executes exactly 8 times you'll be better off unrolling it.
  • keeping separate pointers for buf and buf2 isn't as useful on AVR as it is on other architectures due to the limited number of CPU registers that can be used for memory access. given that buf2 is always exactly 84 bytes ahead of buf you're better off using a single pointer with constant indexing.
  • arbitrary shifts on the AVRs are slooooow!

This last point is significant, look at the assembly shift statement for "*(buf2) |= data[i] >> (8-y%8);":

Code: Select all
     394:   8d 91          ld   r24, X+      ; load data[i] into r24
     396:   90 e0          ldi   r25, 0x00   ; signed-extend to 16-bits
     398:   02 2e          mov   r0, r18     ; load r0 with (8-y%8)
     39a:   02 c0          rjmp   .+4        ; skip next instruction (WHY??? pseudo-optimization perhaps?)
     39c:   95 95          asr   r25         ; shift r25:r24 right by one bit
     39e:   87 95          ror   r24
     3a0:   0a 94          dec   r0          ; we done with the loop yet
     3a2:   e2 f7          brpl   .-8        ; if not then keep going


AVRs can only shift one bit at a time, so the code has to use a loop to shift by (8-y%8). Secondly, the compiler is extending "data[i]" to 16-bits and doing the shift on that even though the top byte will always be 0. On AVRs you're much better off doing shifts with a multiplication instruction which takes only 2 cycles, and given that there are only 8 possible values we may as well store it in a look-up table:

Code: Select all
     const static byte rol_lut[] = {1<<0, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1<<7}; // store this somewhere globally
     ....
     buf[0] |= *data * rol_lut[y & 7]; // do this once per pixel


This can also be used for right shifts, although in this case the result will be 16-bit and we need the upper 8 bits, which the compiler is smart enough to implement with a byte swap:

Code: Select all
     const static unsigned ror_lut[] = {0x100>>0, 0x100>>1, 0x100>>2, 0x100>>3, 0x100>>4, 0x100>>5, 0x100>>6, 0x100>>7};
     ....
     buf[84+0] |= (data[0] * ror_lut[y & 7]) >> 8; // do this once per pixel


So put this all together and here's a new version:

Code: Select all
void ultraDraw5(register byte * data, const char x, register char y) {
    if(y>=0) {
      register uint8_t* buf = ((y&0xF8)>>1) * 21 + gb.display.getBuffer() + x;
      y &= 7;
      register uint8_t scale1 = ror_lut[y];
      register uint16_t scale2 = ror_lut[y];
      buf[0] |= data[0] * scale1;
      buf[84+0] |= (data[0] * scale2) >> 8;
      buf[1] |= data[1] * scale1;
      buf[84+1] |= (data[1] * scale2) >> 8;
      buf[2] |= data[2] * scale1;
      buf[84+2] |= (data[2] * scale2) >> 8;
      buf[3] |= data[3] * scale1;
      buf[84+3] |= (data[3] * scale2) >> 8;
      buf[4] |= data[4] * scale1;
      buf[84+4] |= (data[4] * scale2) >> 8;
      buf[5] |= data[5] * scale1;
      buf[84+5] |= (data[5] * scale2) >> 8;
      buf[6] |= data[6] * scale1;
      buf[84+6] |= (data[6] * scale2) >> 8;
      buf[7] |= data[7] * scale1;
      buf[84+7] |= (data[7] * scale2) >> 8;
    }
}


According to Simbuino the original ultraDraw3 function executes in 621 cycles. With these changes alone the function now executes in 313.
Myndale
 
Posts: 507
Joined: Sat Mar 01, 2014 1:25 am

Re: Having fun optimizing a sprite routine

Postby Sorunome » Mon Aug 10, 2015 8:18 am

Sounds nice!
I knew unrolling would save some more but I tried to do it without unrolling :)

Wouldn't it be better for you to do data++ though, else it has to add always 1 to it, then 2 etc. always twice so that if you do data++ and then &(data)

Also in my asm version I do have enough registers for three pointers, i use X for buf, Y for buf2 and Z for data, so it all fits.

EDIT: oh, also, to those interested, I made my asm routine support full clipping, I bet it isn't that optimized now, though.
Code: Select all
void ultraDraw4(byte data[], char x, char y){
        uint8_t* buf = (((y+8)&0xF8)>>1) * 21 + x + gb.display.getBuffer();
        asm volatile(
        "mov R20,%[y]\n\t"
        "ldi R17,7\n\t"
        "add R20,R17\n\t"
        "brmi End\n\t"
        "cpi %[y],48\n\t"
        "brpl End\n\t"
       
        "inc R20\n\t"
        "ldi R16,8\n\t"
        "andi R20,7\n\t"
        "cpi R20,0\n\t"
        "breq LoopAligned\n"
        "LoopStart:\n\t"
         
          "tst %[x]\n\t"
          "brmi LoopSkip\n\t"
          "cpi %[x],84\n\t"
          "brcc LoopSkip\n\t"
         
          "ld R17,Z\n\t"
          "eor R18,R18\n\t"
          "mov R19,R20\n\t"
          "clc\n\t"
          "LoopShift:\n\t" // carry is still reset from the cpi instruction or from the dec
            "rol R17\n\t"
            "rol R18\n\t"
            "dec R19\n\t"
            "brne LoopShift\n\t"

          "tst %[y]\n\t"
          "brmi LoopSkipPart\n\t"
         
          "ld R19,X\n\t"
          "eor R19,R17\n\t"
          "st X,R19\n\t"
        "LoopSkipPart:\n\t"

          "cpi %[y],40\n\t"
          "brpl LoopSkip\n\t"
         
          "ld R19,Y\n\t"
          "eor R19,R18\n\t"
          "st Y,R19\n\t"

        "LoopSkip:\n\t"
          "eor R18,R18\n\t"
          "ldi R19,1\n\t"
          "add R26,R19\n\t" // INC DOESN'T CHANGE CARRY!
          "adc R27,R18\n\t"
          "add R28,R19\n\t"
          "adc R29,R18\n\t"
          "add R30,R19\n\t"
          "adc R31,R18\n\t"
         
          "inc %[x]\n\t"
          "dec R16\n\t"
          "brne LoopStart\n\t"
        "rjmp End\n"
        "LoopAligned:\n\t"
          "tst %[x]\n\t"
          "brmi LoopAlignSkip\n\t"
       
          "cpi %[x],84\n\t"
          "brcc LoopAlignSkip\n\t"
         
          "ld R17,Z\n\t"
          "ld R18,X\n\t"
          "eor R18,R17\n\t"
          "st X,R18\n\t"
        "LoopAlignSkip:\n\t"
          "ldi R18,1\n\t"
          "add R26,R18\n\t"
          "adc R27,R20\n\t"
          "add R30,R18\n\t"
          "adc R31,R20\n\t"
          "inc %[x]\n\t"
          "dec R16\n\t"
          "brne LoopAligned\n"
        "End:\n\t"
        ::"x" (buf - 84),"y" (buf),"z" (data),[y] "r" (y),[x] "r" (x):"r16","r17","r18","r19","r20");
}
Last edited by Sorunome on Mon Mar 21, 2016 6:02 pm, edited 1 time in total.
User avatar
Sorunome
 
Posts: 629
Joined: Sun Mar 01, 2015 1:58 pm

Re: Having fun optimizing a sprite routine

Postby Myndale » Tue Aug 11, 2015 12:39 am

Sorunome wrote:Wouldn't it be better for you to do data++ though, else it has to add always 1 to it, then 2 etc. always twice so that if you do data++ and then &(data)


You'd think so, wouldn't you? But this is GCC/AVR, which is a poor compiler at the best of times. In practice it generates virtually identical code in both cases except for this little gem:

Code: Select all
    3be:   23 2f         mov   r18, r19         ; previous *buf is in r19, "save" it to r18
    3c0:   33 27         eor   r19, r19         ; clear r19...umm...for what purpose, exactly?
    3c2:   24 2b         or   r18, r20          ; "or" *data pixels with *buf pixels
    3c4:   28 83         st   Y, r18            ; save result back into *buf
    3c6:   12 96         adiw   r26, 0x02       ; advance data ptr
    3c8:   3c 91         ld   r19, X            ; load next *buf pixel into r19. (wtf?)


This looks very much to me like a bug in the optimizer. For some reason the compiler thinks it needs to clear r19 (which wastes 1 cycle) so it transfers its value into r18 (which wastes another). Net effect is a total of 16 wasted cycles. You'll also notice that it advances data (r26) by 2 bytes instead of 1, if you look at the code you'll see it actually modifies the value of data in your original version a total of 4 times: twice by +2, once by -2 and once more by -1 for a net total of +1 all up. And then it compensates for the ptr being out by 2 bytes by using the offset forms of LDD which are 3 cycles instead of 1. Madness!
Myndale
 
Posts: 507
Joined: Sat Mar 01, 2014 1:25 am

Re: Having fun optimizing a sprite routine

Postby jonnection » Tue Aug 11, 2015 5:13 am

Wow. horror.

I wonder if the fact that avr has no positive accumulation in hardware has any effect

If you set *data to data[8] and then data-- (draw bytes in reverse, right to left) will the compiler just dec the pointer or do fancy dance moves again? I'd try but unfortunately no time for that :(
User avatar
jonnection
 
Posts: 317
Joined: Sun May 04, 2014 8:21 pm

Re: Having fun optimizing a sprite routine

Postby Sorunome » Tue Aug 11, 2015 8:44 am

Myndale wrote:
Sorunome wrote:Wouldn't it be better for you to do data++ though, else it has to add always 1 to it, then 2 etc. always twice so that if you do data++ and then &(data)


You'd think so, wouldn't you? But this is GCC/AVR, which is a poor compiler at the best of times. In practice it generates virtually identical code in both cases except for this little gem:

Code: Select all
    3be:   23 2f         mov   r18, r19         ; previous *buf is in r19, "save" it to r18
    3c0:   33 27         eor   r19, r19         ; clear r19...umm...for what purpose, exactly?
    3c2:   24 2b         or   r18, r20          ; "or" *data pixels with *buf pixels
    3c4:   28 83         st   Y, r18            ; save result back into *buf
    3c6:   12 96         adiw   r26, 0x02       ; advance data ptr
    3c8:   3c 91         ld   r19, X            ; load next *buf pixel into r19. (wtf?)


This looks very much to me like a bug in the optimizer. For some reason the compiler thinks it needs to clear r19 (which wastes 1 cycle) so it transfers its value into r18 (which wastes another). Net effect is a total of 16 wasted cycles. You'll also notice that it advances data (r26) by 2 bytes instead of 1, if you look at the code you'll see it actually modifies the value of data in your original version a total of 4 times: twice by +2, once by -2 and once more by -1 for a net total of +1 all up. And then it compensates for the ptr being out by 2 bytes by using the offset forms of LDD which are 3 cycles instead of 1. Madness!

The compiler could have just used st Y+,r18 that way Y would increase in one cycle..... :P
User avatar
Sorunome
 
Posts: 629
Joined: Sun Mar 01, 2015 1:58 pm

Re: Having fun optimizing a sprite routine

Postby deeph » Wed Dec 23, 2015 4:22 pm

Is there an easy way to use it like that too :

Code: Select all
const byte tiles[]={
  0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
  0x00,0x00,0x04,0x20,0x00,0x00,0x08,0x00,
  0x40,0x82,0x14,0x20,0x00,0x11,0x22,0x00,
  0x00,0x00,0x00,0x00,0x20,0x12,0x54,0x00,
  0x00,0x00,0x00,0x00,0x10,0x12,0x54,0x00,
  0xff,0xfd,0xdf,0xaf,0xfb,0xf5,0xff,0xdf,
  0xfd,0xdf,0xaf,0xfb,0xf5,0xff,0xdf,0xff,
  0x18,0x24,0x4a,0x67,0xdb,0xbf,0xdf,0x7e,
  0x7e,0x81,0xa9,0xb5,0x81,0x7e,0x18,0x18,
  0x03,0x1c,0x28,0x20,0x54,0x48,0x60,0x80,
  0xc1,0xb0,0x6a,0x57,0x3b,0x1e,0x04,0x03,
  0xc0,0x38,0x14,0x04,0x0a,0xa2,0x16,0xab,
  0x55,0x29,0x56,0xbe,0xec,0x78,0x20,0xc0 };

  [...]
  ultraDraw4(tiles[tile_num*TILE_HEIGHT], x, y);
  [...]

I'm trying to speed up my tilemapper, and as I don't know avr asm this could really be useful :)
deeph
 
Posts: 52
Joined: Mon Jul 13, 2015 6:09 am
Location: France

Re: Having fun optimizing a sprite routine

Postby Sorunome » Mon Dec 28, 2015 3:08 am

No, sorry, this sprite routine is only working for 8x8 sprites, everything else would require a whole different routine, this routine is optimized ONLY for drawing 8x8 sprites ;)


Or what do you mean with "something like that"?
User avatar
Sorunome
 
Posts: 629
Joined: Sun Mar 01, 2015 1:58 pm

Re: Having fun optimizing a sprite routine

Postby deeph » Mon Dec 28, 2015 8:35 am

The sprites (tiles) are 8*8 but I'd like to give an array with an offset as a parameter of the routine, instead of a pointer. And by the way, I think it should be easy to allow the use of a custom height, no ?
deeph
 
Posts: 52
Joined: Mon Jul 13, 2015 6:09 am
Location: France

Re: Having fun optimizing a sprite routine

Postby Sorunome » Mon Dec 28, 2015 2:23 pm

Ironically a custom width would be easy, the way the screen is built.
As per multiple 8x8 prites, you should be able do do something like pointer+8 to get the next sprite.
User avatar
Sorunome
 
Posts: 629
Joined: Sun Mar 01, 2015 1:58 pm

Re: Having fun optimizing a sprite routine

Postby deeph » Sat Jan 02, 2016 12:36 pm

I get several errors while trying to compile :

Assembler messages:
Error: register number above 15 required
Error: symbol `LoopStart' is already defined
Error: register number above 15 required
Error: symbol `LoopShift' is already defined
Error: symbol `LoopSkipPart' is already defined
Error: register number above 15 required
Error: symbol `LoopSkip' is already defined
Error: symbol `LoopAligned' is already defined
Error: register number above 15 required
Error: symbol `LoopAlignSkip' is already defined
Error: symbol `End' is already defined
Error: register number above 15 required
Error: symbol `LoopStart' is already defined
Error: register number above 15 required
Error: symbol `LoopShift' is already defined
Error: symbol `LoopSkipPart' is already defined
Error: register number above 15 required
Error: symbol `LoopSkip' is already defined
Error: symbol `LoopAligned' is already defined
Error: register number above 15 required
Error: symbol `LoopAlignSkip' is already defined
Error: symbol `End' is already defined


It seems that some instructions are not understood ?

edit : I solved the labels errors using this : http://forum.arduino.cc/index.php?topic=51165.0 , so now I have this :

Code: Select all
void draw_sprite(const byte data[], char x, char y){  // routine by Sorunome
  uint8_t* buf = (((y+8)&0xF8)>>1) * 21 + x + gb.display.getBuffer();
  asm volatile(
  "mov R20,%[y]\n\t"
  "ldi R17,7\n\t"
  "add R20,R17\n\t"
  "brmi 6f\n\t"
  "cpi %[y],48\n\t"
  "brpl 6f\n\t"
 
  "inc R20\n\t"
  "ldi R16,8\n\t"
  "andi R20,7\n\t"
  "cpi R20,0\n\t"
  "breq 4f\n"
  "0:\n\t"
   
   "tst %[x]\n\t"
   "brmi 3f\n\t"
   "cpi %[x],84\n\t"
   "brcc 3f\n\t"
   
   "ld R17,Z\n\t"
   "eor R18,R18\n\t"
   "mov R19,R20\n\t"
   "clc\n\t"
   "1:\n\t" // carry is still reset from the cpi instruction or from the dec
   "rol R17\n\t"
   "rol R18\n\t"
   "dec R19\n\t"
   "brne 1b\n\t"

   "tst %[y]\n\t"
   "brmi 2f\n\t"
   
   "ld R19,X\n\t"
   "or R19,R17\n\t"
   "st X,R19\n\t"
  "2:\n\t"

   "cpi %[y],48\n\t"
   "brpl 3f\n\t"
   
   "ld R19,Y\n\t"
   "or R19,R18\n\t"
   "st Y,R19\n\t"

  "3:\n\t"
   "eor R18,R18\n\t"
   "ldi R19,1\n\t"
   "add R26,R19\n\t" // INC DOESN'T CHANGE CARRY!
   "adc R27,R18\n\t"
   "add R28,R19\n\t"
   "adc R29,R18\n\t"
   "add R30,R19\n\t"
   "adc R31,R18\n\t"
   
   "inc %[x]\n\t"
   "dec R16\n\t"
   "brne 0b\n\t"
  "rjmp 6f\n"
  "4:\n\t"
   "tst %[x]\n\t"
   "brmi 5f\n\t"
 
   "cpi %[x],84\n\t"
   "brcc 5f\n\t"
   
   "ld R17,Z\n\t"
   "ld R18,X\n\t"
   "eor R18,R17\n\t"
   "st X,R18\n\t"
  "5:\n\t"
   "ldi R18,1\n\t"
   "add R26,R18\n\t"
   "adc R27,R20\n\t"
   "add R30,R18\n\t"
   "adc R31,R20\n\t"
   "inc %[x]\n\t"
   "dec R16\n\t"
   "brne 4b\n"
  "6:\n\t"
  ::"x" (buf - 84),"y" (buf),"z" (data),[y] "r" (y),[x] "r" (x):"r16","r17","r18","r19","r20");
}


But I still get the register numbers errors... I think that the parameters x & y are put in a register numbered under 15, so is there a way to bypass this, knowing that every other register seems to be already used in the routine ?
deeph
 
Posts: 52
Joined: Mon Jul 13, 2015 6:09 am
Location: France

PreviousNext

Return to Software Development

Who is online

Users browsing this forum: No registered users and 2 guests

cron