Hi,

I'm new to assembly.. but have been learning for about a month and

a half. I'm building a matrix * matrix ( AxB ) optimized math

library for the Athlon. However, I'm having problems getting high

throughput. I thought maybe someone here could help me out.

The kernel of my code.. involves multiplying a 64x64 submatrix of A

times a 64x64 submatrix of B. The submatrices are prefetched into

cache.. and this kernel should fly at the speed of light. Both

submatrix A and B are in L1.

I multiply 4 rows of submatrix A at a single time times a column of

submatrix B.Then I move to the next 4 rows of submatrix A... and so

on. The entire multiplication of submatrix A times a "single" column

of B is completely unrolled. Then I loop over the columns of B.

It's pivotal that I get "stellar performance" in the dot product 4

rows of submatrix A upon the 64 floats in the column of B submatrix.

The data is arranged as such:

** register "edi" points to the first element of submatrix A

** register "esi" points to the column of submatrix B

Notes:

======

I bias the edi and esi registers by 128 bytes.. so I can sweep through

the entire 64 floats (256 bytes) of each row of A. In this format:

[edi-128] == address of first element of first row of submatrix A

[edi+112] == address of last element of first row of submatrix A

SSE uses xmm registers and each contains 4 floats.. or 16 bytes. So I

load 16 bytes at a time into the xmm registers.

Ok.. the code goes something like this:

=========================================================================

.

..

...

add edi,128

add esi,128

mov eax,256 ; size in bytes of a single row of submatrix A

mov ebx,768 ; size in bytes of a 3 rows of submatrix A

xorps xmm5,xmm5

xorps xmm6,xmm6

xorps xmm7,xmm7

xorps xmm8,xmm8

movaps xmm1,XMMWORD PTR [edi-128] ; First 4 floats of row 1 of A

movaps xmm2,XMMWORD PTR [edi+eax-128] ; First 4 floats of row 2 of A

movaps xmm3,XMMWORD PTR [edi+eax*2-128]; First 4 floats of row 3 of A

movaps xmm4,XMMWORD PTR [edi+ebx-128] ; First 4 floats of row 4 of A

mulps xmm1,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 1 with col

mulps xmm2,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 2 with col

mulps xmm3,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 3 with col

mulps xmm4,XMMWORD PTR [esi-128] ; multipy 4 #'s of row 4 with col

addps xmm5,xmm1 ; accumulate dot product of row 1 with col

addps xmm6,xmm2 ; accumulate dot product of row 2 with col

addps xmm7,xmm3 ; accumulate dot product of row 3 with col

addps xmm8,xmm4 ; accumulate dot product of row 4 with col

; WE HAVE HANDLED 4 FLOATS now.. so we must load xmm registers

; with data 16 bytes in front of our previous accesses

movaps xmm1,XMMWORD PTR [edi-112]

movaps xmm2,XMMWORD PTR [edi+eax-112]

movaps xmm3,XMMWORD PTR [edi+eax*2-112]

movaps xmm4,XMMWORD PTR [edi+ebx-112]

mulps xmm1,XMMWORD PTR [esi-112]

mulps xmm2,XMMWORD PTR [esi-112]

mulps xmm3,XMMWORD PTR [esi-112]

mulps xmm4,XMMWORD PTR [esi-112]

addps xmm5,xmm1

addps xmm6,xmm2

addps xmm7,xmm3

addps xmm8,xmm4

movaps xmm1,XMMWORD PTR [edi-96]

movaps xmm2,XMMWORD PTR [edi+eax-96]

movaps xmm3,XMMWORD PTR [edi+eax*2-96]

movaps xmm4,XMMWORD PTR [edi+ebx-96]

mulps xmm1,XMMWORD PTR [esi-96]

mulps xmm2,XMMWORD PTR [esi-96]

mulps xmm3,XMMWORD PTR [esi-96]

mulps xmm4,XMMWORD PTR [esi-96]

addps xmm5,xmm1

addps xmm6,xmm2

addps xmm7,xmm3

addps xmm8,xmm4

movaps xmm1,XMMWORD PTR [edi-80]

movaps xmm2,XMMWORD PTR [edi+eax-80]

movaps xmm3,XMMWORD PTR [edi+eax*2-80]

movaps xmm4,XMMWORD PTR [edi+ebx-80]

mulps xmm1,XMMWORD PTR [esi-80]

mulps xmm2,XMMWORD PTR [esi-80]

mulps xmm3,XMMWORD PTR [esi-80]

mulps xmm4,XMMWORD PTR [esi-80]

addps xmm5,xmm1

addps xmm6,xmm2

addps xmm7,xmm3

addps xmm8,xmm4

.

..

...

=========================================================================

i'm not getting stellar performance in each package above. In each

package.. there are 32 floating point operations. it's taking me.. I

believe 13 cycles to execute each of these instructions.

Consequently.. the maximum throughput in FLOPS/CYCLE would be 32/13 =

2.46 FLOPS/CYCLE. This is much to low. does anybody see anything

wrong with how I've set up these instructions. I do realize that the

first move instruction in each package is 4 bytes.. the other 3 are 5

bytes.. which means I can not decode more than 1 in any given clock

cycle. Is this a problem, or can the Athlon only decode AND EXECUTE 1

movaps instruction per clock cycle.

Thanks for any assistance...

tim wilkens