This has been a rather slow week, my exams are starting up, and I only got in some time to start looking into the slowness that still persists with this engine.
The major source of slowness at the moment, stems from the fact that the SDL-backend in ScummVM seems to want to down-convert 32bpp to 16bpp, and since the engine is still doing full-screen updates, that means a full-screen down-conversion for every frame, which takes quite a bit of time. The problems from this can be avoided by using the OpenGL-backend, although that still has problems of it's own (like not supporting the PixelFormat I originally started out with (RGBA)), it does a decent job as a way of testing the other parts of the drawing pipeline without drowning them out in conversion.
Getting the down-conversion out of the way, revealed that the single blitting-function I'd been using was indeed rather heavy, and I'd split it out earlier to two separate functions, where one supported colour-masking, and the other didn't, making it easier to see in a profiler how much of the drawing actually used colour-masking (very little). I added in an additional function for handling opaque drawing, which helped quite a bit on the speed (skipping atleast 3 multiplications, 3 shifts and 6 additions per pixel).
For Dirty Split the numbers I get on my computer are roughly (Using the OpenGL-backend):
27 % Opaque-draws
20 % Scale
9 % Alpha-blitting
Where percentages are of the total CPU-load for the process (which now actually can be different from the CPU-load for the entire core it runs on, at max FPS). Now, there is obviously room for improvement here, in particular, the scaling, which as a result of the changes I did to implement dirty rects, ended up being done for every draw (basically, the non-dirty rect version would create renderTickets that were instantly drawn, and then discarded, and since the renderTickets were responsible for keeping the scaled copies, they were rescaled every frame). Thus I ended up partially adding renderTickets to the full-screen update system:
Upon a render call, the render-queue is checked for a matching render-ticket, if one exists, that ticket is used for drawing, otherwise, a new ticket is generated, drawn and added to the render-queue. Any unused ticket in the queue (a misnomer really, as it's really a list, but hey...) is deleted when the screen-buffer is flipped onto the screen.
This gives the benefit of not having to rescale every frame, now the results after doing that weren't exactly comparable, as I didn't exactly set forward a specific enough test (simply watching a bit of the intro movie, and then letting it settle for a while in the first frame), but, atleast the scaling is almost nowhere to be seen:
25.8 % Alpha-blits
This makes for quite an improvement, and should possibly make the games playable on a bit slower computers than what has been the case so far.