Performance problems with MSVC 4.2 
Author Message
 Performance problems with MSVC 4.2

The Parallel Computing and Imaging Laboratory at Johns Hopkins University
is developing a custom set of  linear algebra routines for Pentium Pro
machines.  We are evaluating their performance using a variety of compilers
and ran into this problem with the Microsoft Visual C++ Enterprise Edition
Version 4.2.

Sample C Code:

for (k=0; k<p; k++) {
for (i=0; i<m; i++) {
r = A[i+k*m];
for (j=0; j<n-4; j+=5) {
C[j+i*n] += r*B[j+k*n];
C[1+j+i*n] += r*B[1+j+k*n];
C[2+j+i*n] += r*B[2+j+k*n];
C[3+j+i*n] += r*B[3+j+k*n];
C[4+j+i*n] += r*B[4+j+k*n];

Quote:
}

for (j2=j; j2<n; j2++)
C[j2+i*n] += r*B[j2+k*n];

Quote:
}
}

where p=100, m=100, and n=100.  Using the "maximize speed" optimization,
the code takes 131ms to run on a 133 Mhz Pentium Pro.    Using the
"minimize size" optimization, the code takes 45ms to execute.  
We believe the discrepancy can be explained by looking at a portion of the
corresponding assembly codes:

Maximize Speed:

 00a6  eb 03                             jmp     L4
 00a8  dd 5b f8          L3              fstp    qword ptr -8H[ebx]
 00ab  dd 45 00          L4              fld     qword ptr +0H[ebp]
 00ae  dc 4c 24 10                       fmul    qword ptr +10H[esp]
 00b2  83 c5 28                          add     ebp,00000028H
 00b5  83 c3 28                          add     ebx,00000028H
 00b8  48                                dec     eax
 00b9  dc 43 d8                          fadd    qword ptr -28H[ebx]
 00bc  dd 5b d8                          fstp    qword ptr -28H[ebx]
 00bf  dd 45 e0                          fld     qword ptr -20H[ebp]
 00c2  dc 4c 24 10                       fmul    qword ptr +10H[esp]
 00c6  dc 43 e0                          fadd    qword ptr -20H[ebx]
 00c9  dd 5b e0                          fstp    qword ptr -20H[ebx]
 00cc  dd 45 e8                          fld     qword ptr -18H[ebp]
 00cf  dc 4c 24 10                       fmul    qword ptr +10H[esp]
 00d3  dc 43 e8                          fadd    qword ptr -18H[ebx]
 00d6  dd 5b e8                          fstp    qword ptr -18H[ebx]
 00d9  dd 45 f0                          fld     qword ptr -10H[ebp]
 00dc  dc 4c 24 10                       fmul    qword ptr +10H[esp]
 00e0  dc 43 f0                          fadd    qword ptr -10H[ebx]
 00e3  dd 5b f0                          fstp    qword ptr -10H[ebx]
 00e6  dd 45 f8                          fld     qword ptr -8H[ebp]
 00e9  dc 4c 24 10                       fmul    qword ptr +10H[esp]
 00ed  dc 43 f8                          fadd    qword ptr -8H[ebx]
 00f0  75 b6                             jne     L3
 00f2  dd 5b f8                          fstp    qword ptr -8H[ebx]

Minimize Size:

 009b  eb 03                             jmp     L4
 009d  dd 59 f8          L3              fstp    qword ptr -8H[ecx]
 00a0  dd 06             L4              fld     qword ptr [esi]
 00a2  dc 4d f8                          fmul    qword ptr -8H[ebp]
 00a5  83 c6 28                          add     esi,00000028H
 00a8  dc 01                             fadd    qword ptr [ecx]
 00aa  dd 19                             fstp    qword ptr [ecx]
 00ac  dd 46 e0                          fld     qword ptr -20H[esi]
 00af  dc 4d f8                          fmul    qword ptr -8H[ebp]
 00b2  83 c1 28                          add     ecx,00000028H
 00b5  48                                dec     eax
 00b6  dc 41 e0                          fadd    qword ptr -20H[ecx]
 00b9  dd 59 e0                          fstp    qword ptr -20H[ecx]
 00bc  dd 46 e8                          fld     qword ptr -18H[esi]
 00bf  dc 4d f8                          fmul    qword ptr -8H[ebp]
 00c2  dc 41 e8                          fadd    qword ptr -18H[ecx]
 00c5  dd 59 e8                          fstp    qword ptr -18H[ecx]
 00c8  dd 46 f0                          fld     qword ptr -10H[esi]
 00cb  dc 4d f8                          fmul    qword ptr -8H[ebp]
 00ce  dc 41 f0                          fadd    qword ptr -10H[ecx]
 00d1  dd 59 f0                          fstp    qword ptr -10H[ecx]
 00d4  dd 46 f8                          fld     qword ptr -8H[esi]
 00d7  dc 4d f8                          fmul    qword ptr -8H[ebp]
 00da  dc 41 f8                          fadd    qword ptr -8H[ecx]
 00dd  75 be                             jne     L3
 00df  dd 59 f8                          fstp    qword ptr -8H[ecx]

The use of [esp] in the "Maximize Speed" code is the most likely culprit in
our opinion.  What we suspect is happening is that the use of the stack as
a temporary storage location for "r" is forcing the floating point number
to straddle an 8 byte boundary.  Is a non-aligned access the problem or are
we missing something else?

It turns out the problem can be avoided by not even using the variable "r."
 Other compilers store "r" in st(0) the floating point stack, thus
avoiding the possibility of non-aligned loads.  We typically program in the
above fashion because the use of the variable "r" sometimes encourages
compilers on RISC machines not to try to reload that element of A
everytime, though most compilers in this simple case could recognize that
without using the "r."

If a non-aligned accees is the problem, has the 5.0 compiler addressed this?



Tue, 14 Dec 1999 03:00:00 GMT  
 Performance problems with MSVC 4.2

It is difficult to track without more info. Are A,B,C int, float, double?
Are you working in NT or W95? Under W95 there is a well known problem in
runnig 16 bit code on PPro, so that it is desirable that you align data at
32 bit. Size optimizing aligns data at a minimum alloved by your OS. To my
knowledge there is no way to generate optimal code for PPro in VC++4.2
I'm very interesed in this problem. Please forward me a mail with more data
(including this one) and I'll respond you after a more in-depth analysis.

dr. Ovidiu Popa



Quote:
> The Parallel Computing and Imaging Laboratory at Johns Hopkins University
> is developing a custom set of  linear algebra routines for Pentium Pro
> machines.  We are evaluating their performance using a variety of
compilers
> and ran into this problem with the Microsoft Visual C++ Enterprise
Edition
> Version 4.2.

> Sample C Code:

> for (k=0; k<p; k++) {
> for (i=0; i<m; i++) {
> r = A[i+k*m];
> for (j=0; j<n-4; j+=5) {
> C[j+i*n] += r*B[j+k*n];
> C[1+j+i*n] += r*B[1+j+k*n];
> C[2+j+i*n] += r*B[2+j+k*n];
> C[3+j+i*n] += r*B[3+j+k*n];
> C[4+j+i*n] += r*B[4+j+k*n];
> }
> for (j2=j; j2<n; j2++)
> C[j2+i*n] += r*B[j2+k*n];
> }
> }

> where p=100, m=100, and n=100.  Using the "maximize speed" optimization,
> the code takes 131ms to run on a 133 Mhz Pentium Pro.    Using the
> "minimize size" optimization, the code takes 45ms to execute.  
> We believe the discrepancy can be explained by looking at a portion of
the
> corresponding assembly codes:

> Maximize Speed:

>  00a6  eb 03                             jmp     L4
>  00a8  dd 5b f8          L3              fstp    qword ptr -8H[ebx]
>  00ab  dd 45 00          L4              fld     qword ptr +0H[ebp]
>  00ae  dc 4c 24 10                       fmul    qword ptr +10H[esp]
>  00b2  83 c5 28                          add     ebp,00000028H
>  00b5  83 c3 28                          add     ebx,00000028H
>  00b8  48                                dec     eax
>  00b9  dc 43 d8                          fadd    qword ptr -28H[ebx]
>  00bc  dd 5b d8                          fstp    qword ptr -28H[ebx]
>  00bf  dd 45 e0                          fld     qword ptr -20H[ebp]
>  00c2  dc 4c 24 10                       fmul    qword ptr +10H[esp]
>  00c6  dc 43 e0                          fadd    qword ptr -20H[ebx]
>  00c9  dd 5b e0                          fstp    qword ptr -20H[ebx]
>  00cc  dd 45 e8                          fld     qword ptr -18H[ebp]
>  00cf  dc 4c 24 10                       fmul    qword ptr +10H[esp]
>  00d3  dc 43 e8                          fadd    qword ptr -18H[ebx]
>  00d6  dd 5b e8                          fstp    qword ptr -18H[ebx]
>  00d9  dd 45 f0                          fld     qword ptr -10H[ebp]
>  00dc  dc 4c 24 10                       fmul    qword ptr +10H[esp]
>  00e0  dc 43 f0                          fadd    qword ptr -10H[ebx]
>  00e3  dd 5b f0                          fstp    qword ptr -10H[ebx]
>  00e6  dd 45 f8                          fld     qword ptr -8H[ebp]
>  00e9  dc 4c 24 10                       fmul    qword ptr +10H[esp]
>  00ed  dc 43 f8                          fadd    qword ptr -8H[ebx]
>  00f0  75 b6                             jne     L3
>  00f2  dd 5b f8                          fstp    qword ptr -8H[ebx]

> Minimize Size:

>  009b  eb 03                             jmp     L4
>  009d  dd 59 f8          L3              fstp    qword ptr -8H[ecx]
>  00a0  dd 06             L4              fld     qword ptr [esi]
>  00a2  dc 4d f8                          fmul    qword ptr -8H[ebp]
>  00a5  83 c6 28                          add     esi,00000028H
>  00a8  dc 01                             fadd    qword ptr [ecx]
>  00aa  dd 19                             fstp    qword ptr [ecx]
>  00ac  dd 46 e0                          fld     qword ptr -20H[esi]
>  00af  dc 4d f8                          fmul    qword ptr -8H[ebp]
>  00b2  83 c1 28                          add     ecx,00000028H
>  00b5  48                                dec     eax
>  00b6  dc 41 e0                          fadd    qword ptr -20H[ecx]
>  00b9  dd 59 e0                          fstp    qword ptr -20H[ecx]
>  00bc  dd 46 e8                          fld     qword ptr -18H[esi]
>  00bf  dc 4d f8                          fmul    qword ptr -8H[ebp]
>  00c2  dc 41 e8                          fadd    qword ptr -18H[ecx]
>  00c5  dd 59 e8                          fstp    qword ptr -18H[ecx]
>  00c8  dd 46 f0                          fld     qword ptr -10H[esi]
>  00cb  dc 4d f8                          fmul    qword ptr -8H[ebp]
>  00ce  dc 41 f0                          fadd    qword ptr -10H[ecx]
>  00d1  dd 59 f0                          fstp    qword ptr -10H[ecx]
>  00d4  dd 46 f8                          fld     qword ptr -8H[esi]
>  00d7  dc 4d f8                          fmul    qword ptr -8H[ebp]
>  00da  dc 41 f8                          fadd    qword ptr -8H[ecx]
>  00dd  75 be                             jne     L3
>  00df  dd 59 f8                          fstp    qword ptr -8H[ecx]

> The use of [esp] in the "Maximize Speed" code is the most likely culprit
in
> our opinion.  What we suspect is happening is that the use of the stack
as
> a temporary storage location for "r" is forcing the floating point number
> to straddle an 8 byte boundary.  Is a non-aligned access the problem or
are
> we missing something else?

> It turns out the problem can be avoided by not even using the variable
"r."
>  Other compilers store "r" in st(0) the floating point stack, thus
> avoiding the possibility of non-aligned loads.  We typically program in
the
> above fashion because the use of the variable "r" sometimes encourages
> compilers on RISC machines not to try to reload that element of A
> everytime, though most compilers in this simple case could recognize that
> without using the "r."

> If a non-aligned accees is the problem, has the 5.0 compiler addressed
this?



Tue, 14 Dec 1999 03:00:00 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. MSVC 4.2 Problems (MFC/non MFC)

2. problem with alignment of struct member datas (MSVC 4.2)

3. new problem with msvc 4.2 and odbc recordset

4. MSVC 4.2 watch problem

5. Changes between MSVC 4.2 and 5.0

6. using breakpoints MSVC 4.2

7. DAO SDK 3.5 with MSVC 4.2

8. SetFieldNull in CRecordset (MSVC 4.2)

9. Watching static variables doesnt work in MSVC 4.2 but does in 5.0

10. MSVC 4.2

11. MSVC 4.2 Project configuration using external makefile

12. Setting up the latest platform SDK with MSVC 4.2

 

 
Powered by phpBB® Forum Software