word-wise memcmp for relatively unaligned arguments? 
Author Message
 word-wise memcmp for relatively unaligned arguments?

Since word-wise operations are so much faster than byte-wise operations,
but since often from C strings are not defined to be on word boundaries,
has anyone looked much at bit rotation tricks to maybe speed things up?
I'm looking to do memcmp() with 32bit-wise memory access or better.

The kind of thing I am thinking of is this:  Check the alignment of the
two memory operands before hand.  Based on that if alignment is
favorable (1/4 of the time) do the simple word-wise loop with the
standard pre-loop for bytes which aren't specifically on a word boundary
for the 3/4 cases where relative alignment is favorable but arguments
don't start on a word boundary.  If the relative alignment is
unfavorable jump to whichever of 3 different inner loops that do a
ror/rol to move one operand into relative alignment.  Each of those
would need their own pre-loops.

I could be overlooking something that requires 16 branches instead of 4,
but the basic idea should be obvious enough.  The only question I have
(before doing the crazy logic to figure out all those branches) is
whether this would really be faster.  Since "repz cmpsb" is doing 4
times as many loops to do things byte-wise and probably doing unaligned
cache access, my intuition says the rotate-compare inner loop should be
able to be pretty slow and still be a lot faster.  At least on Pentia+.

I know that the HP PA-RISC has instructions to directly support this
kind of trick.  I don't know the timing details for high-end Intel
chips.  I really only care about Pentia and Pentium Pros myself but
would be interested in the timings for lesser chips, too, though I
suspect superscalar chips are where this trick would really pay off.

Regards,
Chuck



Fri, 11 Dec 1998 03:00:00 GMT  
 word-wise memcmp for relatively unaligned arguments?

Quote:

>Since word-wise operations are so much faster than byte-wise operations,
>but since often from C strings are not defined to be on word boundaries,
>has anyone looked much at bit rotation tricks to maybe speed things up?
>I'm looking to do memcmp() with 32bit-wise memory access or better.
>The kind of thing I am thinking of is this:  Check the alignment of the
>two memory operands before hand.  Based on that if alignment is
>favorable (1/4 of the time) do the simple word-wise loop with the
>standard pre-loop for bytes which aren't specifically on a word boundary
>for the 3/4 cases where relative alignment is favorable but arguments
>don't start on a word boundary.  If the relative alignment is
>unfavorable jump to whichever of 3 different inner loops that do a
>ror/rol to move one operand into relative alignment.  Each of those
>would need their own pre-loops.
>I could be overlooking something that requires 16 branches instead of 4,
>but the basic idea should be obvious enough.  The only question I have
>(before doing the crazy logic to figure out all those branches) is
>whether this would really be faster.  Since "repz cmpsb" is doing 4
>times as many loops to do things byte-wise and probably doing unaligned
>cache access, my intuition says the rotate-compare inner loop should be
>able to be pretty slow and still be a lot faster.  At least on Pentia+.
>I know that the HP PA-RISC has instructions to directly support this
>kind of trick.  I don't know the timing details for high-end Intel
>chips.  I really only care about Pentia and Pentium Pros myself but
>would be interested in the timings for lesser chips, too, though I
>suspect superscalar chips are where this trick would really pay off.
>Regards,
>Chuck

If you are doing a memcpy() alignment would be a problem, but I don't
think it will be a problem for memcmp().  Reason: Memcmp does no
writes.  On a 486 or higher, the CPU won't be reading individual
bytes/longwords anyway; it will be reading from the cache, which will
be filled on longword aligned lines no matter what you do.  Any
attempt to align the arguments would probably slow down your code and
certainly be a waste of time.

The exception, however, is when you are doing a memcmp() with
non-cacheable memory, such as video memory.  In that case, you
certainly should align the compare to the slow memory.  All you would
have to do is just test the first few odd bytes, start the compare on
a longword boundary of the non-cached memory, then compare the last
few bytes if necessary.




Sat, 12 Dec 1998 03:00:00 GMT  
 word-wise memcmp for relatively unaligned arguments?

Quote:

> Since word-wise operations are so much faster than byte-wise operations,
> but since often from C strings are not defined to be on word boundaries,
> has anyone looked much at bit rotation tricks to maybe speed things up?
> I'm looking to do memcmp() with 32bit-wise memory access or better.

> The kind of thing I am thinking of is this:  Check the alignment of the
> two memory operands before hand.  Based on that if alignment is
> favorable (1/4 of the time) do the simple word-wise loop with the
> standard pre-loop for bytes which aren't specifically on a word boundary
> for the 3/4 cases where relative alignment is favorable but arguments
> don't start on a word boundary.  If the relative alignment is
> unfavorable jump to whichever of 3 different inner loops that do a
> ror/rol to move one operand into relative alignment.  Each of those
> would need their own pre-loops.

You have a good idea, however it turns out to be easier than you're suspecting
to get good performance: Just align your targets!

Even misaligned reads are pretty fast since most of them don't straddle cache
line boundaries, so a simple up-front loop to align the target, followed by REP
MOVSD for the main block, possibly followed by another loop for the tail, will
be the fastest you can get on both 486and Pentium, when the destination area is
already in the cache.

For uncached data on a Pentium, it can be faster to use the fp unit to handle
aligned 64-bit elements, using FILD and FISTP, possibly including a prefetch
(read) of a single byte from every 32-byte block. This depends mostly on
whether you're going to reuse the destination data soon or not.

If you really want to try rotates to align both source and destination, this
can still be fairly easy, you just need four versions of your code, one for
each relative alignment, or you can use the same code, using CL to encode the
shift count (8, 16 or 24 bits), with code somewhat like this:

  mov eax,[esi]
  add esi,4
next:
  mov edx,[esi]
  shrd eax,edx,8
  mov [edi],eax
  mov eax,[esi+4]
  add esi,8
  shrd edx,eax,8
  mov [edi+4],edx
  add edi,8
  sub ecx,8
   jae next

However, since SHRD is a fairly slow, non-pairable, opcode, it might be faster
to use regular shifts and masks to align the data:

next:
  shr eax,8
  mov edx,[esi]

  shl edx,24
  mov ebx,[esi]

  shr ebx,8
  or eax,edx

  mov edx,[esi+4]
  mov [edi],eax

  shl edx,24
  mov eax,[esi+4]

  or ebx,edx
  add esi,8

  mov [edi+4],ebx
  add edi,8

  sub ecx,8
   jae next

The loop above seems to use 8 cycles to move 8 bytes, or 4 cycles/dword.

As long as the penalty for a misaligned REP MOVSD averages less than 3
cycles/dword, which it does as long as the destination is aligned, it will be
better to use the compact string opcode instead.

The SHRD version might have been useful for moving things around in uncacheable
memory, like within the frame buffer, but the true answer is that you should
never read from the frame buffer at all.

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Sat, 12 Dec 1998 03:00:00 GMT  
 word-wise memcmp for relatively unaligned arguments?


Quote:

>Since word-wise operations are so much faster than byte-wise operations,
>but since often from C strings are not defined to be on word boundaries,
>has anyone looked much at bit rotation tricks to maybe speed things up?
>I'm looking to do memcmp() with 32bit-wise memory access or better.

>The kind of thing I am thinking of is this:  Check the alignment of the
>two memory operands before hand.  Based on that if alignment is
>favorable (1/4 of the time) do the simple word-wise loop with the
>standard pre-loop for bytes which aren't specifically on a word boundary
>for the 3/4 cases where relative alignment is favorable but arguments
>don't start on a word boundary.  If the relative alignment is
>unfavorable jump to whichever of 3 different inner loops that do a
>ror/rol to move one operand into relative alignment.  Each of those
>would need their own pre-loops.

>I could be overlooking something that requires 16 branches instead of 4,
>but the basic idea should be obvious enough.  The only question I have
>(before doing the crazy logic to figure out all those branches) is
>whether this would really be faster.  Since "repz cmpsb" is doing 4
>times as many loops to do things byte-wise and probably doing unaligned
>cache access, my intuition says the rotate-compare inner loop should be
>able to be pretty slow and still be a lot faster.  At least on Pentia+.

>I know that the HP PA-RISC has instructions to directly support this
>kind of trick.  I don't know the timing details for high-end Intel
>chips.  I really only care about Pentia and Pentium Pros myself but
>would be interested in the timings for lesser chips, too, though I
>suspect superscalar chips are where this trick would really pay off.

The mis-alignment penalty is, typically, around 25% - 50% for memcpy.
Probably less for memcmp. You won't do better with shifts. Shifts pair
only in the U pipe, and you'll need a MOV and two shifts, as a minimum
for each dword.

Mike Schmit

-------------------------------------------------------------------

408-244-6826             http://www.quantasm.com
-------------------------------------------------------------------



Sun, 13 Dec 1998 03:00:00 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. A Word Is Enough For The Wise!

2. Wise Installation Software - Reply From Wise re:Clarion Support

3. row-wise versus column-wise arrays

4. memcmp() using SSE2

5. Relatively New

6. Efficient Unaligned Bit-substring Search in Bitvectors

7. Efficient Unaligned Bit-substring Search in Bitvectors

8. Good and (relatively) cheap Ada code browsing software?

9. URGENT: NEED FOR A RELATIVELY CHEAP VHDL COMPILER FOR LINUX OR UNIX

10. (Relatively) new IBM COBOL compiler option

11. WAS: IVF continues to disappoint (relatively)

12. IVF continues to disappoint (relatively)

 

 
Powered by phpBB® Forum Software