8bit->32bit blit in 2.66 cycles/pixel on Pentium!
Author |
Message |
Jan Vondr #1 / 10
|
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!
Quote: >I've been optimizing a function that blits from an 8-bit bitmap to a >32-bit bitmap using a clut, and I've come up with the following code, >optimized for the Pentium. >I _think_ I've got the inner loop down to 8 cycles per group of 3 pixels, >or 2.66 cycles/pixel, but this is the first time I've tried to optimize >for pairing, so I'm not sure my calculations are correct, especially >since I haven't done any instruction-level timing.
I think your code pairs almost perfectly, with the exception of the end of the loop: dec ecx jgz ... doesn't pair on a P5. The only instructions pairing with a conditional jump despite its dependency on their result are add & cmp. So you're better off using an add or cmp instead of dec, or rearranging the code so as to make the jump pair with an independent instruction. Quote: >Can anyone do this faster ?
The conversion code itself requires 1.5 cycles per pixel (load, convert, store). Unrolling the loop 4 times, I can do it in 1.75 cycles/pixel: ; EAX = 8-bit pixel #1 ; EBX = 8-bit pixel #2 ; ECX = negative counter ; EDX = 32-bit pixel ; ESI = source ; EDI = destination ; EBP = CLUT xor eax, eax mov al, [esi+ecx] xor ebx,ebx mov bl, [esi+ecx+1] mov edx, [ebp+eax*4] next4: mov [edi+ecx*4], edx ; cycle 1 mov al, [esi+ecx+2] mov edx, [ebp+ebx*4] ; cycle 2 mov bl, [esi+ecx+3] mov [edi+ecx*4+4], edx ; cycle 3 mov edx, [ebp+eax*4] mov al, [esi+ecx+4] ; cycle 4 mov [edi+ecx*4+8], edx mov edx, [ebp+ebx*4] ; cycle 5 mov bl, [esi+ecx+5] mov [edi+ecx*4+12], edx ; cycle 6 add ecx, 4 mov edx, [ebp+eax*4] ; cycle 7 jnz next4 -- Jan Vondrak
|
Fri, 28 May 1999 03:00:00 GMT |
|
 |
Martin Sino #2 / 10
|
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!
Quote: > I think your code pairs almost perfectly, with the exception of the > end of the loop: > dec ecx > jgz ... > doesn't pair on a P5. The only instructions pairing with a conditional > jump despite its dependency on their result are add & cmp. > So you're better off using an add or cmp instead of dec, or > rearranging the code so as to make the jump pair with an independent > instruction.
According to Intel's spec, any instruction pairable in the u-pipe writing to the flags can be paired with a Jcc instruction in the v-pipe. To prove this, I ran the following test on a Pentium 90: mov ecx,90000000; L1: add eax,1; add edx,2; dec ecx jnz L1; which took about 2 seconds to complete, as expected (the loop takes 2 cycles if all instructions are pairable, times 90e6 = 180e6 cycles, which takes 2 seconds on a Pentium 90). Changing dec ecx into add ecx,-1 didn't make any difference (instructions still pairable). If dec ecx were not pairable with jgz, it would be a 3-cycle loop (and thus, this program takes 3 seconds), because Jcc is only pairable in the v-pipe, but it would end up in the u-pipe because of it's nonpairability with dec ecx. -- Martin Sinot
|
Fri, 28 May 1999 03:00:00 GMT |
|
 |
Juan Carlos Arevalo Bae #3 / 10
|
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!
Quote:
>I _think_ I've got the inner loop down to 8 cycles per group of 3 pixels, >or 2.66 cycles/pixel, but this is the first time I've tried to optimize >for pairing, so I'm not sure my calculations are correct, especially >since I haven't done any instruction-level timing.
Cool. Quote: >Can anyone do this faster?
Not that I know. But on the PPro it would suck! (no offense intended) ---8<---------------------- dothree: ; Pentium pipes PPro uops mov eax, [ebp+eax*4] ; U 1 1 (2) xor ebx, ebx ; 1 (01) mov bl, [esi+1] ; U 1 1 (2) xor edx, edx ; 1 (01) mov dl, [esi+2] ; U 1 1 (2) mov [edi], eax ; 1 (3,4) mov ebx, [ebp+ebx*4] ; U 1 1 (2) add esi, 3 ; 1 (01) mov [edi+4], ebx ; U 1 1 (3,4) mov edx, [ebp+edx*4] ; 1 (2) mov [edi+8], edx ; U 1 1 (3,4) xor eax, eax ; 1 (01) mov al, [esi] ; U 1 1 (2) add edi, 12 ; 1 (01) dec ecx ; U 1 1 (01) jg dothree ; 1 (1) ---8<---------------------- This would do better in the PPro: 1 - The byte operations use une of the hardcoded-optimized sequences (otherwise it's awfully slow, probably worse than a Pentium of the same clock speed). 2 - Execution unit bottleneck is #2, with 6 uops per loop (6 cycles minimum), but there's 19 total uops (that's 6.3 cycles minimum), so there's no actual bottleneck. That means it's 2.1 c/pix on a PPro. Not so good, huh? :) 3 - A PPro-specific version would reduce to 3.75 the uses of EU #2 (loads) by loading a DWORD at a time, and using EU #0 to move bytes into the pointers, using MOVZX instructions. Anybody wants to add or comment anything? I've been (evidently) messing around with the PPro, and although the documentation is far worse than for the Pentium, I think I figured out this much right.
|
Sat, 29 May 1999 03:00:00 GMT |
|
 |
Jan Vondr #4 / 10
|
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!
Quote: >According to Intel's spec, any instruction pairable in the u-pipe writing >to the flags can be paired with a Jcc instruction in the v-pipe. To prove >this, >I ran the following test on a Pentium 90: > mov ecx,90000000; >L1: add eax,1; > add edx,2; > dec ecx > jnz L1; >which took about 2 seconds to complete, as expected
Hmm, my apologies. I ran the same test and indeed, jnz does pair with dec, as well as some other instructions. May I ask what spec's you're referring to ? I was misled by the document "Optimizations for Intel's 32-bit processors" from Intel where only cmp+jcc & add+jne are mentioned as special pairs. -- Jan Vondrak
|
Sun, 30 May 1999 03:00:00 GMT |
|
 |
Terje Mathise #5 / 10
|
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!
Quote:
> I've been optimizing a function that blits from an 8-bit bitmap to a > 32-bit bitmap using a clut, and I've come up with the following code, > optimized for the Pentium. > I _think_ I've got the inner loop down to 8 cycles per group of 3 pixels, > or 2.66 cycles/pixel, but this is the first time I've tried to optimize > for pairing, so I'm not sure my calculations are correct, especially > since I haven't done any instruction-level timing. > Can anyone do this faster? [snip] > dothree: > mov eax, [ebp+eax*4] ; 1 lookup pix1 color > mov bl, [esi+1] ; 0 load pix2 > mov [edi], eax ; 1 store pix1 > xor eax, eax ; 0 clear pix1 for next time > mov dl, [esi+2] ; 1 load pix3 > mov ebx, [ebp+ebx*4] ; 0 lookup pix2 color > add esi, 3 ; 1 bump source pointer 3 pixels > mov [edi+4], ebx ; 0 store pix2 > mov edx, [ebp+edx*4] ; 1 lookup pix3 color > xor ebx, ebx ; 0 clear pix2 for next time > mov [edi+8], edx ; 1 store pix3 > xor edx, edx ; 0 clear pix3 for next time > mov al, [esi] ; 1 lookup pix1 color > add edi, 12 ; 0 bump dest pointer 3 pixels > dec ecx ; 1 dec loop counter > jg dothree ; 0 repeat if not done
Your inner loop does use 16 paired ops/8 cycles. To figure out the best possible results, we must check out what you're actually doing: 1. load source pixel 2. Lookup 32-bit value 3. Store 32-bit result Since this is 3 simple operations, it should be possible to get very close to 1.5 cycles/pixel: xor eax,eax xor ebx,ebx next_block: SRC_OFFSET = 0 DST_OFFSET = 0 REPT 4 mov [edi+DST_OFFSET],ebp ; U Save a destination pixel mov ebp,clut[eax*4] ; V Lookup next 32-bit pixel (AL has 8-bit value) mov al,[esi+SRC_OFFSET] ; U Load the following pixel mov [edi+DST_OFFSET+4],edx ; V Next dest mov edx,clut[ebx*4] ; U Next CLUT value mov bl,[esi+SRC_OFFSET+1] ; V second source pixel SRC_OFFSET = SRC_OFFSET + 2 DST_OFFSET = DST_OFFSET + 8 ENDM lea esi,[esi+SRC_OFFSET] lea edi,[edi+DST_OFFSET] sub ecx,SRC_OFFSET jae next_block This version handles 8 (N*2) pixels/iteration of the inner loop, it uses N*3 + 2 = 14 cycles, so this corresponds to 1.75 cycles/pixel. The key difference from your version is that I zero the byte-pixel registers once, outside the inner loop. It is also possible to make the inner block handle 6 instea of 2 pixels, this way the load-use distance is increased further. The portable C code would look like this: uint32 b0, b1, *dst; uint8 a, b, c, *src; do { dst[0] = b0; b0 = clut[a]; a = src[0]; dst[1] = b1; b1 = clut[b]; b = src[1]; dst[2] = b0; b0 = clut[c]; c = src[2]; dst[3] = b1; b1 = clut[a]; a = src[3]; dst[4] = b0; b0 = clut[b]; b = src[4]; dst[5] = b1; b1 = clut[c]; c = src[5]; src += 6; dst += 6; } while ((len -= 6) >= 0); The key bottleneck in this algorithm is the time from a source pixel is loaded until it can be used for addressing in the 8->32 CLUT. On a Pentium this is just a single cycle, but a PPro as well as (almost?) all RISC cpus require multiple cycles even when hitting the L1 cache. Terje --
Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"
|
Sun, 30 May 1999 03:00:00 GMT |
|
 |
Terje Mathise #6 / 10
|
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!
[snip] Quote: > This would do better in the PPro: > 1 - The byte operations use une of the hardcoded-optimized sequences > (otherwise it's awfully slow, probably worse than a Pentium of the > same clock speed). > 2 - Execution unit bottleneck is #2, with 6 uops per loop (6 cycles > minimum), but there's 19 total uops (that's 6.3 cycles minimum), so > there's no actual bottleneck. That means it's 2.1 c/pix on a PPro. Not > so good, huh? :) > 3 - A PPro-specific version would reduce to 3.75 the uses of EU #2 > (loads) by loading a DWORD at a time, and using EU #0 to move bytes > into the pointers, using MOVZX instructions. > Anybody wants to add or comment anything? I've been (evidently) > messing around with the PPro, and although the documentation is far > worse than for the Pentium, I think I figured out this much right.
You're right, almost all heavily optimized table-driven code is load-limited on a PPro. It is unfortunately impossible to get around this in a way that is still Pentium-optimized, but combining four source pixels into a single load is a good start. I'd do something like this: next4: mov eax,[esi] xor ebx,ebx mov bl,al xor ecx,ecx mov cl,ah lea esi,[esi+4] shr eax,16 xor edx,edx mov dl,ah mov ebx,clut[ebx*4] and eax,255 ; Source pixels are in EBX, ECX, EAX, EDX mov ecx,clut[ecx*4] mov [edi],ebx mov edx,clut[edx*4] mov [edi+4],ecx mov eax,clut[eax*4] mov [edi+8],edx sub ebp,4 ; Loop counter, subtract 4 pixels mov [edi+12],eax lea edi,[edi+16] jae next4 On a Pentium, this version uses 17+4 = 21 ops (11) cycles for 4 pixels (2.75 cyc/pix), unrolling one more time would bring it down to (34+4)/8 = 38/8 ops/pixel = 2.275 cycles/pixel. With 4 pixels in the inner loop, I have 5 load operations, 4 stores, 2 address updates, 1 subract/branch, 3 register-to-register MOVs, 4 logical ops (XOR and AND) and a single SHR. The 4 store instructions generate 2 micro-ops each, the rest generates simple micro-ops, so I get a total of 25 micro-ops. The 3 micro-ops/cycle retirement rate means that the loop must take more than 8 cycles/iteration, or slightly over 2 cycles/pixel. With some more unrolling, and using MOVZX instead of separate XOR and MOV operations, the inner block can be reduced to 18 micro-ops plus the loop overhead, so it is theoretically possible to get down to 6 cycles/4 pixels on a PPro, just like a Pentium, but it is impossible to write code that is within 90% of optimal on both platforms simultaneously. Terje --
Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"
|
Sun, 30 May 1999 03:00:00 GMT |
|
 |
Juan Carlos Arevalo Bae #7 / 10
|
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!
Quote:
>The conversion code itself requires 1.5 cycles per pixel >(load, convert, store). Unrolling the loop 4 times, >I can do it in 1.75 cycles/pixel:
:) Good. Isn't it funny, in the PPro, the minimum (in theory) would be 2 cpp (load and convert both use EU #2), which is slower than your Pentium loop. It's anoying. At least the PPro has better memory management, but...
|
Mon, 31 May 1999 03:00:00 GMT |
|
 |
Jan Vondr #8 / 10
|
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!
Quote:
>>The conversion code itself requires 1.5 cycles per pixel >>(load, convert, store). Unrolling the loop 4 times, >>I can do it in 1.75 cycles/pixel: > :) Good. Isn't it funny, in the PPro, the minimum (in theory) would >be 2 cpp (load and convert both use EU #2), which is slower than your >Pentium loop.
Well, I think someone in this thread showed how to do the conversion in 1.5 cpp on a P6; you must load a dword at a time, and then use MOVZX to expand the bytes into pointers; this removes the EU #2 bottleneck. But it's annoying indeed that the P5-optimal code performs disastrously on a P6 (due to partial register stalls), while the P6-optimal code sucks on a P5 (slow MOVZX). The best choice for both is probably something like: xor eax,eax mov al,[esi+ecx] xor ebx,ebx mov bl,[esi+ecx+1] mov eax,[ebp+eax*4] l: mov [edi+ecx*4],eax xor eax,eax mov al,[esi+ecx+2] mov ebx,[ebp+ebx*4] mov [edi+ecx*4+4],ebx xor ebx,ebx mov bl,[esi+ecx+3] ... add ecx,N mov eax,[ebp+eax*4] jnz l which takes 2 cycles per pixel + 1 cycle overhead per loop on a P5, and about the same number of cycles on a P6. -- Jan Vondrak
|
Fri, 04 Jun 1999 03:00:00 GMT |
|
 |
Terje Mathise #9 / 10
|
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!
Quote:
> Well, I think someone in this thread showed how to do the conversion > in 1.5 cpp on a P6; you must load a dword at a time, and then use > MOVZX to expand the bytes into pointers; this removes the EU #2 > bottleneck. But it's annoying indeed that the P5-optimal code performs > disastrously on a P6 (due to partial register stalls), while the > P6-optimal code sucks on a P5 (slow MOVZX). The best choice for both > is probably something like: [snip] > l: mov [edi+ecx*4],eax > xor eax,eax > mov al,[esi+ecx+2] > mov ebx,[ebp+ebx*4] > mov [edi+ecx*4+4],ebx > xor ebx,ebx > mov bl,[esi+ecx+3] > ... > add ecx,N > mov eax,[ebp+eax*4] > jnz l > which takes 2 cycles per pixel + 1 cycle overhead per loop > on a P5, and about the same number of cycles on a P6.
You can do a bit better by moving the XOR's out of the main loop: xor eax,eax xor ebx,ebx ... next: mov [edi],ecx ; U 1 mov ecx,clut[eax*4] ; V 1 mov al,[esi] ; U 2 mov [edi+4],edx ; V 2 mov edx,clut[ebx*4] ; U 3 mov bl,[esi+1] ; V 3 mov [edi+8],ecx ; U 4 mov ecx,clut[eax*4] ; V 4 mov al,[esi+2] ; U 5 mov [edi+12],edx ; V 5 mov edx,clut[ebx*4] ; U 6 mov bl,[esi+3] ; V 6 This version have the same limitation on a P6; it needs 1.5 loads/cycle, so on a P6 it will use 2 cycles/pixel, but it will still run optimally on a P5. As noted above, a version that loads 4 pixels and split them can be faster on a P6. Terje --
Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"
|
Fri, 04 Jun 1999 03:00:00 GMT |
|
|
|