
Performance problems with MSVC 4.2
The Parallel Computing and Imaging Laboratory at Johns Hopkins University
is developing a custom set of linear algebra routines for Pentium Pro
machines. We are evaluating their performance using a variety of compilers
and ran into this problem with the Microsoft Visual C++ Enterprise Edition
Version 4.2.
Sample C Code:
for (k=0; k<p; k++) {
for (i=0; i<m; i++) {
r = A[i+k*m];
for (j=0; j<n-4; j+=5) {
C[j+i*n] += r*B[j+k*n];
C[1+j+i*n] += r*B[1+j+k*n];
C[2+j+i*n] += r*B[2+j+k*n];
C[3+j+i*n] += r*B[3+j+k*n];
C[4+j+i*n] += r*B[4+j+k*n];
Quote:
}
for (j2=j; j2<n; j2++)
C[j2+i*n] += r*B[j2+k*n];
Quote:
}
}
where p=100, m=100, and n=100. Using the "maximize speed" optimization,
the code takes 131ms to run on a 133 Mhz Pentium Pro. Using the
"minimize size" optimization, the code takes 45ms to execute.
We believe the discrepancy can be explained by looking at a portion of the
corresponding assembly codes:
Maximize Speed:
00a6 eb 03 jmp L4
00a8 dd 5b f8 L3 fstp qword ptr -8H[ebx]
00ab dd 45 00 L4 fld qword ptr +0H[ebp]
00ae dc 4c 24 10 fmul qword ptr +10H[esp]
00b2 83 c5 28 add ebp,00000028H
00b5 83 c3 28 add ebx,00000028H
00b8 48 dec eax
00b9 dc 43 d8 fadd qword ptr -28H[ebx]
00bc dd 5b d8 fstp qword ptr -28H[ebx]
00bf dd 45 e0 fld qword ptr -20H[ebp]
00c2 dc 4c 24 10 fmul qword ptr +10H[esp]
00c6 dc 43 e0 fadd qword ptr -20H[ebx]
00c9 dd 5b e0 fstp qword ptr -20H[ebx]
00cc dd 45 e8 fld qword ptr -18H[ebp]
00cf dc 4c 24 10 fmul qword ptr +10H[esp]
00d3 dc 43 e8 fadd qword ptr -18H[ebx]
00d6 dd 5b e8 fstp qword ptr -18H[ebx]
00d9 dd 45 f0 fld qword ptr -10H[ebp]
00dc dc 4c 24 10 fmul qword ptr +10H[esp]
00e0 dc 43 f0 fadd qword ptr -10H[ebx]
00e3 dd 5b f0 fstp qword ptr -10H[ebx]
00e6 dd 45 f8 fld qword ptr -8H[ebp]
00e9 dc 4c 24 10 fmul qword ptr +10H[esp]
00ed dc 43 f8 fadd qword ptr -8H[ebx]
00f0 75 b6 jne L3
00f2 dd 5b f8 fstp qword ptr -8H[ebx]
Minimize Size:
009b eb 03 jmp L4
009d dd 59 f8 L3 fstp qword ptr -8H[ecx]
00a0 dd 06 L4 fld qword ptr [esi]
00a2 dc 4d f8 fmul qword ptr -8H[ebp]
00a5 83 c6 28 add esi,00000028H
00a8 dc 01 fadd qword ptr [ecx]
00aa dd 19 fstp qword ptr [ecx]
00ac dd 46 e0 fld qword ptr -20H[esi]
00af dc 4d f8 fmul qword ptr -8H[ebp]
00b2 83 c1 28 add ecx,00000028H
00b5 48 dec eax
00b6 dc 41 e0 fadd qword ptr -20H[ecx]
00b9 dd 59 e0 fstp qword ptr -20H[ecx]
00bc dd 46 e8 fld qword ptr -18H[esi]
00bf dc 4d f8 fmul qword ptr -8H[ebp]
00c2 dc 41 e8 fadd qword ptr -18H[ecx]
00c5 dd 59 e8 fstp qword ptr -18H[ecx]
00c8 dd 46 f0 fld qword ptr -10H[esi]
00cb dc 4d f8 fmul qword ptr -8H[ebp]
00ce dc 41 f0 fadd qword ptr -10H[ecx]
00d1 dd 59 f0 fstp qword ptr -10H[ecx]
00d4 dd 46 f8 fld qword ptr -8H[esi]
00d7 dc 4d f8 fmul qword ptr -8H[ebp]
00da dc 41 f8 fadd qword ptr -8H[ecx]
00dd 75 be jne L3
00df dd 59 f8 fstp qword ptr -8H[ecx]
The use of [esp] in the "Maximize Speed" code is the most likely culprit in
our opinion. What we suspect is happening is that the use of the stack as
a temporary storage location for "r" is forcing the floating point number
to straddle an 8 byte boundary. Is a non-aligned access the problem or are
we missing something else?
It turns out the problem can be avoided by not even using the variable "r."
Other compilers store "r" in st(0) the floating point stack, thus
avoiding the possibility of non-aligned loads. We typically program in the
above fashion because the use of the variable "r" sometimes encourages
compilers on RISC machines not to try to reload that element of A
everytime, though most compilers in this simple case could recognize that
without using the "r."
If a non-aligned accees is the problem, has the 5.0 compiler addressed this?