I want you all to be aware that there is another limiting factor that needs to be taken into consideration before any big work on optimizations are done, especially optimizations that will affect everything from bitmap bit order to screen rotation etc.
That factor is the response time of a twisted-nematic liquid crystal display (= the Nokia 5110 LCD), which according to a Fujitsu specsheet I have (all TN displays are similar in this respect) is around 60 ms. http://www.fujitsu.com/downloads/MICRO/fma/pdf/LCD_Backgrounder.pdf
60 ms is 16.666666666667 Hz = approx 17 frames per second.
I have seen this also in practise: in "Isle of maniax" I am drawing white houses on top of the black horizon. The end result is that the black horizon is "showing through" the buildings. This is because the LCD response time (60 ms) means that the black pixels do not have time to "turn off" before they are supposed to be white. The result is a blurry mess and it doesn't look nice. You will not see this effect on the Simbuino emulator. Only on the real hardware.
What I am basically getting at here is that at 350 ns (16x16 drawbitmap with Myndale's optimized putpixel routine), you can draw the Gamebuino screen (84x48 pixels) over so many times, that the speed of the routine does not really have any practical meaning
screen total pixels=84*48=4032
testbmp pixels = 16*16=256
paint whole screen once = pixels/bmppixels=15.75
to paint screen with 16x16 bitmaps takes = paintonce*350ns=0.00000551sec
fps limited by 16x16 bmp painting routine = 1s/topaintscreentakes=181405.89569161 fps
So, even with a 350 ns routine, you could achieve a theoretical FPS of 181000 frames per second. Clearly, your LCD is not kind of going to keep up with it. At this point (really, I am not kidding) whether you have a 350 ns or 150 ns drawing routine doesn't make any difference. You wont be able to use that speed in a meaningful way.
These calculations are all assuming Myndales timing measurement in his demo is correct. Which I think they are.