Optimal SSE code for dot producting 4 vectors with another vector for the Athlon 
Author Message
 Optimal SSE code for dot producting 4 vectors with another vector for the Athlon

Hi,
   I'm new to assembly.. but have been learning for about a month and
a half.  I'm building a matrix * matrix  ( AxB ) optimized math
library for the Athlon.  However, I'm having problems getting high
throughput.  I thought maybe someone here could help me out.

The kernel of my code.. involves multiplying a 64x64 submatrix of A
times a 64x64 submatrix of B.  The submatrices are prefetched into
cache.. and this kernel should fly at the speed of light.  Both
submatrix A and B are in L1.

I multiply 4 rows of submatrix A at a single time times a column of
submatrix B.Then I move to the next 4 rows of submatrix A... and so
on.  The entire multiplication of submatrix A times a "single" column
of B is completely unrolled.  Then I loop over the columns of B.

It's pivotal that I get "stellar performance" in the dot product 4
rows of submatrix A upon the 64 floats in the column of B submatrix.
The data is arranged as such:

** register "edi" points to the first element of submatrix A

** register "esi" points to the column of submatrix B

Notes:
======
I bias the edi and esi registers by 128 bytes.. so I can sweep through
the entire 64 floats (256 bytes) of each row of A.  In this format:

[edi-128] == address of first element of first row of submatrix A
[edi+112] == address of last element of first row of submatrix A

SSE uses xmm registers and each contains 4 floats.. or 16 bytes.  So I
load 16 bytes at a time into the xmm registers.

Ok.. the code goes something like this:
=========================================================================
.
..
...

add edi,128
add esi,128
mov eax,256 ; size in bytes of a single row of submatrix A
mov ebx,768 ; size in bytes of a 3 rows of submatrix A

xorps xmm5,xmm5
xorps xmm6,xmm6
xorps xmm7,xmm7
xorps xmm8,xmm8

movaps xmm1,XMMWORD PTR [edi-128]      ; First 4 floats of row 1 of A
movaps xmm2,XMMWORD PTR [edi+eax-128]  ; First 4 floats of row 2 of A
movaps xmm3,XMMWORD PTR [edi+eax*2-128]; First 4 floats of row 3 of A
movaps xmm4,XMMWORD PTR [edi+ebx-128]  ; First 4 floats of row 4 of A
mulps xmm1,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 1 with col
mulps xmm2,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 2 with col
mulps xmm3,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 3 with col
mulps xmm4,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 4 with col
addps xmm5,xmm1            ; accumulate dot product of row 1 with col
addps xmm6,xmm2            ; accumulate dot product of row 2 with col
addps xmm7,xmm3            ; accumulate dot product of row 3 with col
addps xmm8,xmm4            ; accumulate dot product of row 4 with col

;  WE HAVE HANDLED 4 FLOATS now.. so we must load xmm registers
;      with data 16 bytes in front of our previous accesses

movaps xmm1,XMMWORD PTR [edi-112]
movaps xmm2,XMMWORD PTR [edi+eax-112]
movaps xmm3,XMMWORD PTR [edi+eax*2-112]
movaps xmm4,XMMWORD PTR [edi+ebx-112]
mulps xmm1,XMMWORD PTR [esi-112]
mulps xmm2,XMMWORD PTR [esi-112]
mulps xmm3,XMMWORD PTR [esi-112]
mulps xmm4,XMMWORD PTR [esi-112]
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
addps xmm8,xmm4

movaps xmm1,XMMWORD PTR [edi-96]
movaps xmm2,XMMWORD PTR [edi+eax-96]
movaps xmm3,XMMWORD PTR [edi+eax*2-96]
movaps xmm4,XMMWORD PTR [edi+ebx-96]
mulps xmm1,XMMWORD PTR [esi-96]
mulps xmm2,XMMWORD PTR [esi-96]
mulps xmm3,XMMWORD PTR [esi-96]
mulps xmm4,XMMWORD PTR [esi-96]
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
addps xmm8,xmm4

movaps xmm1,XMMWORD PTR [edi-80]
movaps xmm2,XMMWORD PTR [edi+eax-80]
movaps xmm3,XMMWORD PTR [edi+eax*2-80]
movaps xmm4,XMMWORD PTR [edi+ebx-80]
mulps xmm1,XMMWORD PTR [esi-80]
mulps xmm2,XMMWORD PTR [esi-80]
mulps xmm3,XMMWORD PTR [esi-80]
mulps xmm4,XMMWORD PTR [esi-80]
addps xmm5,xmm1
addps xmm6,xmm2
addps xmm7,xmm3
addps xmm8,xmm4

.
..
...
=========================================================================

i'm not getting stellar performance in each package above.  In each
package.. there are 32 floating point operations.  it's taking me.. I
believe 13 cycles to execute each of these instructions.
Consequently.. the maximum throughput in FLOPS/CYCLE would be 32/13 =
2.46 FLOPS/CYCLE.  This is much to low.  does anybody see anything
wrong with how I've set up these instructions.  I do realize that the
first move instruction in each package is 4 bytes.. the other 3 are 5
bytes.. which means I can not decode more than 1 in any given clock
cycle.  Is this a problem, or can the Athlon only decode AND EXECUTE 1
movaps instruction per clock cycle.

Thanks for any assistance...

tim wilkens



Fri, 09 Apr 2004 15:04:12 GMT  
 Optimal SSE code for dot producting 4 vectors with another vector for the Athlon

<SNIP>
Quote:
> i'm not getting stellar performance in each package above.  In each
> package.. there are 32 floating point operations.  it's taking me.. I
> believe 13 cycles to execute each of these instructions.
> Consequently.. the maximum throughput in FLOPS/CYCLE would be 32/13 =
> 2.46 FLOPS/CYCLE.  This is much to low.  does anybody see anything
> wrong with how I've set up these instructions.  I do realize that the
> first move instruction in each package is 4 bytes.. the other 3 are 5
> bytes.. which means I can not decode more than 1 in any given clock
> cycle.  Is this a problem, or can the Athlon only decode AND EXECUTE 1
> movaps instruction per clock cycle.

To quote the Athlon optimisation manual:

"Up to three DirectPath instructions can be selected for decode per cycle. Only
one VectorPath instruction can be selected for decode per cycle. DirectPath
instructions and VectorPath instructions cannot be simultaneously decoded."

movaps is a VectorPath instruction, so only one can be decoded per cycle.

So is xorps, mulps, addps, ... any instruction ending in "ps" is a vecotr
instruction, and hance only one *ps instruction can be decoded per cycle.

I would be very wary about using these instructions, as they have only been on
the Athlon die since late last month, so I doubt many people have these chips
yet ... Hell, how are you even getting them into your code, or are you
assembling by hand????

Note that many compilers will be interpreting them as the SSE instructions by
the same name.

For the time being, I would stick to standard and extended 3DNow! instructions
if you are optimising for the Athlon

Michael



Fri, 09 Apr 2004 18:55:59 GMT  
 Optimal SSE code for dot producting 4 vectors with another vector for the Athlon

Quote:



> <SNIP>
> > i'm not getting stellar performance in each package above.  In each
> > package.. there are 32 floating point operations.  it's taking me.. I
> > believe 13 cycles to execute each of these instructions.
> > Consequently.. the maximum throughput in FLOPS/CYCLE would be 32/13 =
> > 2.46 FLOPS/CYCLE.  This is much to low.  does anybody see anything
> > wrong with how I've set up these instructions.  I do realize that the
> > first move instruction in each package is 4 bytes.. the other 3 are 5
> > bytes.. which means I can not decode more than 1 in any given clock
> > cycle.  Is this a problem, or can the Athlon only decode AND EXECUTE 1
> > movaps instruction per clock cycle.
> To quote the Athlon optimisation manual:

> "Up to three DirectPath instructions can be selected for decode per cycle. Only
> one VectorPath instruction can be selected for decode per cycle. DirectPath
> instructions and VectorPath instructions cannot be simultaneously decoded."

> movaps is a VectorPath instruction, so only one can be decoded per cycle.

> So is xorps, mulps, addps, ... any instruction ending in "ps" is a vecotr
> instruction, and hance only one *ps instruction can be decoded per cycle.

> I would be very wary about using these instructions, as they have only been on
> the Athlon die since late last month, so I doubt many people have these chips
> yet ... Hell, how are you even getting them into your code, or are you
> assembling by hand????

> Note that many compilers will be interpreting them as the SSE instructions by
> the same name.

> For the time being, I would stick to standard and extended 3DNow! instructions
> if you are optimising for the Athlon

> Michael

Yes.. I'm hand assembly optimizing routines at the moment.  You can
download the processor pack from Microsoft and work from there
inlining the assembly into the routine.  Basically.. on a file that's
5000+ lines of code.. only 17 lines are c syntax.. so making a file
that can be asm'd using MASM or NASM should be simple.. though I
haven't done it.

Yes.. the vector path instructions are the same on Athlon/P4 , so you
are talking a minimum of 12 cycles per package.. I believe.  Is that
correct?  So the theoretical max throughput would be 2.66
FLOPS/Cycle.. correct?  SSE seems nice.. the problem is summing up the
numbers in a register... how to add 4 floats in an xmm register.
3DNow! has support for this.. whereas SSE doesn't.  Doing this on
xmm4,xmm5,xmm6 and xmm7 takes a good bit of time which degrades
performance slightly.

Tim



Sat, 10 Apr 2004 02:33:04 GMT  
 Optimal SSE code for dot producting 4 vectors with another vector for the Athlon
I really don't know enough about the Athlon architecture to be able to help very
much. I work on a K6/2 so that's my area of expertise :) However, in case you
don't already have them I would recommend "AMD Athlon Processor: x86
Optimisation guide", downloadable from AMD, and "AMD CodeAnalyst", also
available from AMD's web site. The CodeAnalyst is a pipeline analyser that shows
exactly how the CPU is executing the instructions and places where things are
not running optimally.

Michael

PS: Sorry about the top-post, but I thought the rest should be kept for record
purposes ... nothing else follows this line.



Quote:


> > <SNIP>
> > > i'm not getting stellar performance in each package above.  In each
> > > package.. there are 32 floating point operations.  it's taking me.. I
> > > believe 13 cycles to execute each of these instructions.
> > > Consequently.. the maximum throughput in FLOPS/CYCLE would be 32/13 =
> > > 2.46 FLOPS/CYCLE.  This is much to low.  does anybody see anything
> > > wrong with how I've set up these instructions.  I do realize that the
> > > first move instruction in each package is 4 bytes.. the other 3 are 5
> > > bytes.. which means I can not decode more than 1 in any given clock
> > > cycle.  Is this a problem, or can the Athlon only decode AND EXECUTE 1
> > > movaps instruction per clock cycle.
> > To quote the Athlon optimisation manual:

> > "Up to three DirectPath instructions can be selected for decode per cycle.
Only
> > one VectorPath instruction can be selected for decode per cycle. DirectPath
> > instructions and VectorPath instructions cannot be simultaneously decoded."

> > movaps is a VectorPath instruction, so only one can be decoded per cycle.

> > So is xorps, mulps, addps, ... any instruction ending in "ps" is a vecotr
> > instruction, and hance only one *ps instruction can be decoded per cycle.

> > I would be very wary about using these instructions, as they have only been
on
> > the Athlon die since late last month, so I doubt many people have these
chips
> > yet ... Hell, how are you even getting them into your code, or are you
> > assembling by hand????

> > Note that many compilers will be interpreting them as the SSE instructions
by
> > the same name.

> > For the time being, I would stick to standard and extended 3DNow!
instructions
> > if you are optimising for the Athlon

> > Michael

> Yes.. I'm hand assembly optimizing routines at the moment.  You can
> download the processor pack from Microsoft and work from there
> inlining the assembly into the routine.  Basically.. on a file that's
> 5000+ lines of code.. only 17 lines are c syntax.. so making a file
> that can be asm'd using MASM or NASM should be simple.. though I
> haven't done it.

> Yes.. the vector path instructions are the same on Athlon/P4 , so you
> are talking a minimum of 12 cycles per package.. I believe.  Is that
> correct?  So the theoretical max throughput would be 2.66
> FLOPS/Cycle.. correct?  SSE seems nice.. the problem is summing up the
> numbers in a register... how to add 4 floats in an xmm register.
> 3DNow! has support for this.. whereas SSE doesn't.  Doing this on
> xmm4,xmm5,xmm6 and xmm7 takes a good bit of time which degrades
> performance slightly.

> Tim



Sat, 10 Apr 2004 15:55:52 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. no vector = vector*vector in BLAS?

2. Vector*Matrix on P4 using SSE and cache?

3. SSE/3dnow scaled large vector add

4. Vector set!, void vector problem

5. matrix as vector of vectors

6. Inserting a vector slice into another vector

7. BLT vector creation: creation time grows as number of vectors created grows

8. (simple-vector 20) vs (vector single-float 20)

9. Vectors and Vector Spaces in ST80

10. SSE + Athlon XP + NT4

11. clarion II library vector function codes

12. Forth coding puzzle: vector math

 

 
Powered by phpBB® Forum Software