8bit->32bit blit in 2.66 cycles/pixel on Pentium! 
Author Message
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!


Quote:

>I've been optimizing a function that blits from an 8-bit bitmap to a
>32-bit bitmap using a clut, and I've come up with the following code,
>optimized for the Pentium.  

>I _think_ I've got the inner loop down to 8 cycles per group of 3 pixels,
>or 2.66 cycles/pixel, but this is the first time I've tried to optimize
>for pairing, so I'm not sure my calculations are correct, especially
>since I haven't done any instruction-level timing.

I think your code pairs almost perfectly, with the exception of the
end of the loop:

        dec ecx
        jgz ...

doesn't pair on a P5. The only instructions pairing with a conditional
jump despite its dependency on their result are add & cmp.
So you're better off using an add or cmp instead of dec, or
rearranging the code so as to make the jump pair with an independent
instruction.

Quote:
>Can anyone do this faster ?

The conversion code itself requires 1.5 cycles per pixel
(load, convert, store). Unrolling the loop 4 times,
I can do it in 1.75 cycles/pixel:

        ; EAX = 8-bit pixel #1
        ; EBX = 8-bit pixel #2
        ; ECX = negative counter
        ; EDX = 32-bit pixel
        ; ESI = source
        ; EDI = destination
        ; EBP = CLUT

        xor eax, eax
        mov al, [esi+ecx]
        xor ebx,ebx
        mov bl, [esi+ecx+1]
        mov edx, [ebp+eax*4]

next4:  mov [edi+ecx*4], edx    ; cycle 1
        mov al, [esi+ecx+2]
        mov edx, [ebp+ebx*4]    ; cycle 2
        mov bl, [esi+ecx+3]
        mov [edi+ecx*4+4], edx  ; cycle 3
        mov edx, [ebp+eax*4]
        mov al, [esi+ecx+4]     ; cycle 4
        mov [edi+ecx*4+8], edx
        mov edx, [ebp+ebx*4]    ; cycle 5
        mov bl, [esi+ecx+5]
        mov [edi+ecx*4+12], edx ; cycle 6
        add ecx, 4
        mov edx, [ebp+eax*4]    ; cycle 7
        jnz next4

-- Jan Vondrak



Fri, 28 May 1999 03:00:00 GMT  
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!



Quote:
> I think your code pairs almost perfectly, with the exception of the
> end of the loop:

>    dec ecx
>    jgz ...

> doesn't pair on a P5. The only instructions pairing with a conditional
> jump despite its dependency on their result are add & cmp.
> So you're better off using an add or cmp instead of dec, or
> rearranging the code so as to make the jump pair with an independent
> instruction.

According to Intel's spec, any instruction pairable in the u-pipe writing
to the flags can be paired with a Jcc instruction in the v-pipe. To prove
this,
I ran the following test on a Pentium 90:

        mov ecx,90000000;
L1:     add eax,1;
        add edx,2;
        dec ecx
        jnz L1;

which took about 2 seconds to complete, as expected (the loop takes 2
cycles if
all instructions are pairable, times 90e6 = 180e6 cycles, which takes 2
seconds
on a Pentium 90). Changing dec ecx into add ecx,-1 didn't make any
difference
(instructions still pairable). If dec ecx were not pairable with jgz, it
would be a
3-cycle loop (and thus, this program takes 3 seconds), because Jcc is only
pairable in the v-pipe, but it would end up in the u-pipe because of it's
nonpairability
with dec ecx.

--
Martin Sinot



Fri, 28 May 1999 03:00:00 GMT  
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!

Quote:

>I _think_ I've got the inner loop down to 8 cycles per group of 3 pixels,
>or 2.66 cycles/pixel, but this is the first time I've tried to optimize
>for pairing, so I'm not sure my calculations are correct, especially
>since I haven't done any instruction-level timing.

   Cool.

Quote:
>Can anyone do this faster?

   Not that I know. But on the PPro it would suck! (no offense
intended)

---8<----------------------
    dothree:                    ; Pentium pipes         PPro uops
        mov eax, [ebp+eax*4]    ; U     1       1       (2)
        xor ebx, ebx            ;       1               (01)
        mov bl, [esi+1]         ; U     1       1       (2)
        xor edx, edx            ;       1               (01)
        mov dl, [esi+2]         ; U     1       1       (2)
        mov [edi], eax          ;       1               (3,4)
        mov ebx, [ebp+ebx*4]    ; U     1       1       (2)
        add esi, 3              ;       1               (01)
        mov [edi+4], ebx        ; U     1       1       (3,4)
        mov edx, [ebp+edx*4]    ;       1               (2)
        mov [edi+8], edx        ; U     1       1       (3,4)
        xor eax, eax            ;       1               (01)
        mov al, [esi]           ; U     1       1       (2)
        add edi, 12             ;       1               (01)
        dec ecx                 ; U     1       1       (01)
        jg dothree              ;       1               (1)
---8<----------------------

   This would do better in the PPro:

        1 - The byte operations use une of the hardcoded-optimized sequences
(otherwise it's awfully slow, probably worse than a Pentium of the
same clock speed).

        2 - Execution unit bottleneck is #2, with 6 uops per loop (6 cycles
minimum), but there's 19 total uops (that's 6.3 cycles minimum), so
there's no actual bottleneck. That means it's 2.1 c/pix on a PPro. Not
so good, huh? :)

        3 - A PPro-specific version would reduce to 3.75 the uses of EU #2
(loads) by loading a DWORD at a time, and using EU #0 to move bytes
into the pointers, using MOVZX instructions.

   Anybody wants to add or comment anything? I've been (evidently)
messing around with the PPro, and although the documentation is far
worse than for the Pentium, I think I figured out this much right.



Sat, 29 May 1999 03:00:00 GMT  
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!


Quote:
>According to Intel's spec, any instruction pairable in the u-pipe writing
>to the flags can be paired with a Jcc instruction in the v-pipe. To prove
>this,
>I ran the following test on a Pentium 90:

>        mov ecx,90000000;
>L1:     add eax,1;
>        add edx,2;
>        dec ecx
>        jnz L1;

>which took about 2 seconds to complete, as expected

Hmm, my apologies.  I ran the same test and indeed, jnz does
pair with dec, as well as some other instructions.

May I ask what spec's you're referring to ?  I was misled by the document
"Optimizations for Intel's 32-bit processors" from Intel where only
cmp+jcc & add+jne are mentioned as special pairs.

-- Jan Vondrak



Sun, 30 May 1999 03:00:00 GMT  
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!

Quote:

> I've been optimizing a function that blits from an 8-bit bitmap to a
> 32-bit bitmap using a clut, and I've come up with the following code,
> optimized for the Pentium.

> I _think_ I've got the inner loop down to 8 cycles per group of 3 pixels,
> or 2.66 cycles/pixel, but this is the first time I've tried to optimize
> for pairing, so I'm not sure my calculations are correct, especially
> since I haven't done any instruction-level timing.

> Can anyone do this faster?
[snip]
>     dothree:
>         mov eax, [ebp+eax*4]    ; 1  lookup pix1 color
>         mov bl, [esi+1]         ; 0  load pix2
>         mov [edi], eax          ; 1  store pix1
>         xor eax, eax            ; 0  clear pix1 for next time
>         mov dl, [esi+2]         ; 1  load pix3
>         mov ebx, [ebp+ebx*4]    ; 0  lookup pix2 color
>         add esi, 3              ; 1  bump source pointer 3 pixels
>         mov [edi+4], ebx        ; 0  store pix2
>         mov edx, [ebp+edx*4]    ; 1  lookup pix3 color
>         xor ebx, ebx            ; 0  clear pix2 for next time
>         mov [edi+8], edx        ; 1  store pix3
>         xor edx, edx            ; 0  clear pix3 for next time
>         mov al, [esi]           ; 1  lookup pix1 color
>         add edi, 12             ; 0  bump dest pointer 3 pixels
>         dec ecx                 ; 1  dec loop counter
>         jg dothree              ; 0  repeat if not done

Your inner loop does use 16 paired ops/8 cycles.

To figure out the best possible results, we must check out what you're
actually doing:

1. load source pixel
2. Lookup 32-bit value
3. Store 32-bit result

Since this is 3 simple operations, it should be possible to get very
close to 1.5 cycles/pixel:

  xor eax,eax
  xor ebx,ebx

next_block:
  SRC_OFFSET = 0
  DST_OFFSET = 0
  REPT 4
    mov [edi+DST_OFFSET],ebp    ; U Save a destination pixel
    mov ebp,clut[eax*4]         ; V Lookup next 32-bit pixel (AL has 8-bit
value)

    mov al,[esi+SRC_OFFSET]     ; U Load the following pixel
    mov [edi+DST_OFFSET+4],edx  ; V Next dest

    mov edx,clut[ebx*4]         ; U Next CLUT value
    mov bl,[esi+SRC_OFFSET+1]   ; V second source pixel

    SRC_OFFSET = SRC_OFFSET + 2
    DST_OFFSET = DST_OFFSET + 8
  ENDM

  lea esi,[esi+SRC_OFFSET]
  lea edi,[edi+DST_OFFSET]

  sub ecx,SRC_OFFSET
   jae next_block

This version handles 8 (N*2) pixels/iteration of the inner loop, it uses
N*3 + 2 = 14 cycles, so this corresponds to 1.75 cycles/pixel.

The key difference from your version is that I zero the byte-pixel
registers once, outside the inner loop.

It is also possible to make the inner block handle 6 instea of 2 pixels,
this way the load-use distance is increased further.

The portable C code would look like this:

  uint32 b0, b1, *dst;
  uint8 a, b, c, *src;

  do {
    dst[0] = b0;
    b0 = clut[a];
    a = src[0];

    dst[1] = b1;
    b1 = clut[b];
    b = src[1];

    dst[2] = b0;
    b0 = clut[c];
    c = src[2];

    dst[3] = b1;
    b1 = clut[a];
    a = src[3];

    dst[4] = b0;
    b0 = clut[b];
    b = src[4];

    dst[5] = b1;
    b1 = clut[c];
    c = src[5];

    src += 6;
    dst += 6;
  } while ((len -= 6) >= 0);

The key bottleneck in this algorithm is the time from a source pixel is
loaded until it can be used for addressing in the 8->32 CLUT. On a
Pentium this is just a single cycle, but a PPro as well as (almost?) all
RISC cpus require multiple cycles even when hitting the L1 cache.

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Sun, 30 May 1999 03:00:00 GMT  
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!


[snip]

Quote:
>    This would do better in the PPro:

>         1 - The byte operations use une of the hardcoded-optimized sequences
> (otherwise it's awfully slow, probably worse than a Pentium of the
> same clock speed).

>         2 - Execution unit bottleneck is #2, with 6 uops per loop (6 cycles
> minimum), but there's 19 total uops (that's 6.3 cycles minimum), so
> there's no actual bottleneck. That means it's 2.1 c/pix on a PPro. Not
> so good, huh? :)

>         3 - A PPro-specific version would reduce to 3.75 the uses of EU #2
> (loads) by loading a DWORD at a time, and using EU #0 to move bytes
> into the pointers, using MOVZX instructions.

>    Anybody wants to add or comment anything? I've been (evidently)
> messing around with the PPro, and although the documentation is far
> worse than for the Pentium, I think I figured out this much right.

You're right, almost all heavily optimized table-driven code is
load-limited on a PPro.

It is unfortunately impossible to get around this in a way that is still
Pentium-optimized, but combining four source pixels into a single load
is a good start.

I'd do something like this:

next4:
  mov eax,[esi]
  xor ebx,ebx

  mov bl,al
  xor ecx,ecx

  mov cl,ah
  lea esi,[esi+4]

  shr eax,16
  xor edx,edx

  mov dl,ah
  mov ebx,clut[ebx*4]

  and eax,255           ; Source pixels are in EBX, ECX, EAX, EDX
  mov ecx,clut[ecx*4]

  mov [edi],ebx
  mov edx,clut[edx*4]

  mov [edi+4],ecx
  mov eax,clut[eax*4]

  mov [edi+8],edx
  sub ebp,4             ; Loop counter, subtract 4 pixels

  mov [edi+12],eax
  lea edi,[edi+16]

  jae next4

On a Pentium, this version uses 17+4 = 21 ops (11) cycles for 4 pixels
(2.75 cyc/pix), unrolling one more time would bring it  down to (34+4)/8
= 38/8 ops/pixel = 2.275 cycles/pixel.

With 4 pixels in the inner loop, I have 5 load operations, 4 stores, 2
address updates, 1 subract/branch, 3 register-to-register MOVs, 4
logical ops (XOR and AND) and a single SHR.

The 4 store instructions generate 2 micro-ops each, the rest generates
simple micro-ops, so I get a total of 25 micro-ops. The 3
micro-ops/cycle retirement rate means that the loop must take more than
8 cycles/iteration, or slightly over 2 cycles/pixel.

With some more unrolling, and using MOVZX instead of separate XOR and
MOV operations, the inner block can be reduced to 18 micro-ops plus the
loop overhead, so it is theoretically possible to get down to 6 cycles/4
pixels on a PPro, just like a Pentium, but it is impossible to write
code that is within 90% of optimal on both platforms simultaneously.

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Sun, 30 May 1999 03:00:00 GMT  
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!

Quote:

>The conversion code itself requires 1.5 cycles per pixel
>(load, convert, store). Unrolling the loop 4 times,
>I can do it in 1.75 cycles/pixel:

   :) Good. Isn't it funny, in the PPro, the minimum (in theory) would
be 2 cpp (load and convert both use EU #2), which is slower than your
Pentium loop.

   It's anoying. At least the PPro has better memory management,
but...



Mon, 31 May 1999 03:00:00 GMT  
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!


Quote:


>>The conversion code itself requires 1.5 cycles per pixel
>>(load, convert, store). Unrolling the loop 4 times,
>>I can do it in 1.75 cycles/pixel:

>   :) Good. Isn't it funny, in the PPro, the minimum (in theory) would
>be 2 cpp (load and convert both use EU #2), which is slower than your
>Pentium loop.

Well, I think someone in this thread showed how to do the conversion
in 1.5 cpp on a P6; you must load a dword at a time, and then use
MOVZX to expand the bytes into pointers; this removes the EU #2
bottleneck. But it's annoying indeed that the P5-optimal code performs
disastrously on a P6 (due to partial register stalls), while the
P6-optimal code sucks on a P5 (slow MOVZX). The best choice for both
is probably something like:

        xor eax,eax
        mov al,[esi+ecx]
        xor ebx,ebx
        mov bl,[esi+ecx+1]
        mov eax,[ebp+eax*4]

l:      mov [edi+ecx*4],eax
        xor eax,eax
        mov al,[esi+ecx+2]
        mov ebx,[ebp+ebx*4]
        mov [edi+ecx*4+4],ebx
        xor ebx,ebx
        mov bl,[esi+ecx+3]
        ...
        add ecx,N
        mov eax,[ebp+eax*4]
        jnz l

which takes 2 cycles per pixel + 1 cycle overhead per loop
on a P5, and about the same number of cycles on a P6.

-- Jan Vondrak



Fri, 04 Jun 1999 03:00:00 GMT  
 8bit->32bit blit in 2.66 cycles/pixel on Pentium!

Quote:

> Well, I think someone in this thread showed how to do the conversion
> in 1.5 cpp on a P6; you must load a dword at a time, and then use
> MOVZX to expand the bytes into pointers; this removes the EU #2
> bottleneck. But it's annoying indeed that the P5-optimal code performs
> disastrously on a P6 (due to partial register stalls), while the
> P6-optimal code sucks on a P5 (slow MOVZX). The best choice for both
> is probably something like:
[snip]
> l:      mov [edi+ecx*4],eax
>         xor eax,eax
>         mov al,[esi+ecx+2]
>         mov ebx,[ebp+ebx*4]
>         mov [edi+ecx*4+4],ebx
>         xor ebx,ebx
>         mov bl,[esi+ecx+3]
>         ...
>         add ecx,N
>         mov eax,[ebp+eax*4]
>         jnz l

> which takes 2 cycles per pixel + 1 cycle overhead per loop
> on a P5, and about the same number of cycles on a P6.

You can do a bit better by moving the XOR's out of the main loop:

        xor eax,eax
        xor ebx,ebx
        ...

next:
        mov [edi],ecx           ; U 1
        mov ecx,clut[eax*4]     ; V 1
        mov al,[esi]            ; U 2

        mov [edi+4],edx         ; V 2
        mov edx,clut[ebx*4]     ; U 3
        mov bl,[esi+1]          ; V 3

        mov [edi+8],ecx         ; U 4
        mov ecx,clut[eax*4]     ; V 4
        mov al,[esi+2]          ; U 5

        mov [edi+12],edx        ; V 5
        mov edx,clut[ebx*4]     ; U 6
        mov bl,[esi+3]          ; V 6

This version have the same limitation on a P6; it needs 1.5 loads/cycle,
so on a P6 it will use 2 cycles/pixel, but it will still run optimally
on a P5.

As noted above, a version that loads 4 pixels and split them can be
faster on a P6.

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Fri, 04 Jun 1999 03:00:00 GMT  
 
 [ 10 post ] 

 Relevant Pages 

1. 8bit->32bit blit in 2.66 cycles/pixel on Pentium!

2. Counting Clock Cycles on Pentium Under Linux

3. Pentium MMX Cycle counts

4. Cycle-exact Pentium timing routine for MSVC 4.2

5. Number of clock cycles for instructions on pentium

6. Pentium Cycle counts: where?

7. 8bit -> 4bit conversion

8. vw20 w/ >8bit color

9. LOGO-L> Identifying Pixels

10. Win32 ASM programming of Sound and Painting to screen Pixel by pixel

11. Win32 ASM programming of Sound and Painting to screen Pixel by pixel

12. Direct pixel-by-pixel drawing (with Tk)

 

 
Powered by phpBB® Forum Software