Slow floating point to integer conversion in VC++ -- Pentium Pro/II 
Author Message
 Slow floating point to integer conversion in VC++ -- Pentium Pro/II

I have some code like this:

double x,y;
int ix,iy;

// do lots of floating point calculations
....

ix = (int)x;
iy = (int)y;

It turns out that 80% of the time is spent on the conversion to int.  VC++
5.0 makes each one of these into a call to the CRT function _ftol.  This
function does a lot of work (sorry for the lousy formatting):

_ftol:
push     ebp                        1
mov      ebp, esp                   1
add      esp, -12
wait                                1
fnstcw   WORD PTR [ebp-2]          2
wait                                1
mov      ax, WORD PTR [ebp-2]      1
or       ah, 12                     1
mov      WORD PTR [ebp-4], ax      2   ;; PPro_Partial_Stall_eax:12-16
fldcw    WORD PTR [ebp-4]          3+4      ;; PPro_Serialized
fistp    QWORD PTR [ebp-12]        10       ;; fdiv/fldcw_Stall:4
fldcw    WORD PTR [ebp-2]          3+4      ;;PPro_Serialized
mov      eax, DWORD PTR [ebp-12]  1
mov      edx, DWORD PTR [ebp-8]    1        ;; PPro_Mem_Stall:6-9
leave                               3
ret                                 3

What it is trying to do is set the floating point unit to TRUNC, rather than
one of the rounds and after the fistp that actually stores the result, reset
it.  Problem 1)  Lots of stalls on the PPro/II.  Problem 2)  I'm doing a
series of these conversions, why do I have to go through all the overhead of
setting up and restoring this state each time.

Anyone know something better to do.  My assembly skills are too rusty to
hand-code this stuff quickly.

John Gossman



Sun, 14 Jan 2001 03:00:00 GMT  
 Slow floating point to integer conversion in VC++ -- Pentium Pro/II
You can use inline assembler:

double xp = x + 0.5; // if you really need to trunc
__asm
    {
    fld    qword ptr xp
    fistp  dword ptr ix
    }



Mon, 15 Jan 2001 03:00:00 GMT  
 Slow floating point to integer conversion in VC++ -- Pentium Pro/II
That did it.  I was trying to get too complicated, worrying about the state
of the floating point registers at this point.

basically I substituted this inline assembly for the (int) casts and got my
function down from an average
512 cycles to an average of 116 cycles.  Microsoft really should update
their code-generator.

    Thanks much,

    John Gossman

Quote:

>You can use inline assembler:

>double xp = x + 0.5; // if you really need to trunc
>__asm
>    {
>    fld    qword ptr xp
>    fistp  dword ptr ix
>    }



Tue, 16 Jan 2001 03:00:00 GMT  
 Slow floating point to integer conversion in VC++ -- Pentium Pro/II

Quote:

> That did it.  I was trying to get too complicated, worrying about the state
> of the floating point registers at this point.

> basically I substituted this inline assembly for the (int) casts and got my
> function down from an average
> 512 cycles to an average of 116 cycles.  Microsoft really should update
> their code-generator.

There's not that much MS (or any other) can really do with their C code
generators, not while still staying ANSI C compliant.

The real problem is that Andy Glew didn't manage to get Control Word
virtualization included in the PPro fpu core, mostly because he couldn't
find a good enough reason at the time.

Today, with all the 3D/image software out there that needs fp
coordinates converted to fixed-point pixel addresses, the need is
obvious. :-(

Quote:

>     Thanks much,

>     John Gossman


> >You can use inline assembler:

> >double xp = x + 0.5; // if you really need to trunc
> >__asm
> >    {
> >    fld    qword ptr xp
> >    fistp  dword ptr ix
> >    }

You can actually do even better than this, while avoiding inline asm, by
using addition of a 'magic' value instead. This works out esp. good if
you can pipeline the conversion process:

  float temp = x + fpMagic;
  unsigned long ul = *(unsigned long *) &temp;
  ul &= 0x007fffff;

This uses one FADD, one FST(P), one MOV and a AND operation, taking just
5 cycles when properly pipelined.  The final AND can sometimes be
avoided by just adjusting for the fixed-value exponent residing in the
upper 9 bits.

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Fri, 19 Jan 2001 03:00:00 GMT  
 Slow floating point to integer conversion in VC++ -- Pentium Pro/II
    Thanks for the info, and the alternate, though since I'm working with
doubles rather than floats I think I'm back up to 7-8 cycles (even if it did
save me 2-cycles, that's pretty much buried in the 104 others the function
takes).
    As for Microsoft's code generator--here's the ftol() function code (with
some VTUNE notes):

Ftol:

push     ebp                        1
mov      ebp, esp                   1
add      esp, -12
wait                                1
fnstcw   WORD PTR [ebp-2]          2
wait                                1
mov      ax, WORD PTR [ebp-2]      1
or       ah, 12                     1
mov      WORD PTR [ebp-4], ax      2   ;; PPro_Partial_Stall_eax:12-16
fldcw    WORD PTR [ebp-4]          3+4      ;; PPro_Serialized
fistp    QWORD PTR [ebp-12]        10       ;; fdiv/fldcw_Stall:4
fldcw    WORD PTR [ebp-2]          3+4      ;;PPro_Serialized
mov      eax, DWORD PTR [ebp-12]  1
mov      edx, DWORD PTR [ebp-8]    1        ;; PPro_Mem_Stall:6-9
leave                               3
ret                                 3

    They could at least get rid of the stalls, and I'm not sure why they
couldn't in-line this code (at least as an optimizer option).

    -JG

Quote:


>> That did it.  I was trying to get too complicated, worrying about the
state
>> of the floating point registers at this point.

>> basically I substituted this inline assembly for the (int) casts and got
my
>> function down from an average
>> 512 cycles to an average of 116 cycles.  Microsoft really should update
>> their code-generator.

>There's not that much MS (or any other) can really do with their C code
>generators, not while still staying ANSI C compliant.

>The real problem is that Andy Glew didn't manage to get Control Word
>virtualization included in the PPro fpu core, mostly because he couldn't
>find a good enough reason at the time.

>Today, with all the 3D/image software out there that needs fp
>coordinates converted to fixed-point pixel addresses, the need is
>obvious. :-(



Sat, 20 Jan 2001 03:00:00 GMT  
 Slow floating point to integer conversion in VC++ -- Pentium Pro/II

Quote:

>     Thanks for the info, and the alternate, though since I'm working with
> doubles rather than floats I think I'm back up to 7-8 cycles (even if it did
> save me 2-cycles, that's pretty much buried in the 104 others the function
> takes).

You can convert doubles just as easily, it just makes the code a little
less portable, because you have to have either int64 (i.e. long long)
support, or you must make it endian-dependent:

inline int64 double2int64(double x)
{
  double temp = x + dMagic;
  return (*(int64 *) &x)) & 0x0007ffffffffffff;

Quote:
}

This corresponds to the following asm code:

  fadd [dMagic]
  fstp [temp]
  mov eax,dword ptr [temp]
  mov edx,dword ptr [temp+4]
  and edx,0007ffffh

with the 64-bit result returned in the EDX:EAX.

To return just the lower 32 bits as an unsigned long is still portable:

inline uint32 double2uint32(double x)
{
  double temp = x + dMagic;
  return (uint32) (*(int64 *) &x));

Quote:
}

A good compiler will turn this into:

  fadd [dMagic]
  fstp [temp]
  mov eax,[temp]

The only stall here is due to the different memory access sizes: First
storing a double and then loading a 32-bit register will have to wait
until the store have retired, it cannot take advantage of the internal
forwarding.

If you can convert several values at the same time, it should still be
possible to get very close to 4 cycles/conversion.

Without int64 compiler support, you'll have to make do with the
endian-dependent version:

#define LITTLE_ENDIAN
inline uint32 double2uint32(double x)
{
  double temp = x + dMagic;
#ifdef LITTLE_ENDIAN
  return *(uint32*) &x;
#else
  return ((uint32*) &x)[1];
#endif

Quote:
}
>     As for Microsoft's code generator--here's the ftol() function code (with
> some VTUNE notes):

> Ftol:

> push     ebp                        1
> mov      ebp, esp                   1

The two instruction above is not needed, everything can be biased from
ESP instead of EBP.

Quote:
> add      esp, -12
> wait                                1
> fnstcw   WORD PTR [ebp-2]          2
> wait                                1
> mov      ax, WORD PTR [ebp-2]      1
> or       ah, 12                     1

 OR AX,0C00h            ; Avoids the PRS

Quote:
> mov      WORD PTR [ebp-4], ax      2   ;; PPro_Partial_Stall_eax:12-16
> fldcw    WORD PTR [ebp-4]          3+4      ;; PPro_Serialized
> fistp    QWORD PTR [ebp-12]        10       ;; fdiv/fldcw_Stall:4
> fldcw    WORD PTR [ebp-2]          3+4      ;;PPro_Serialized
> mov      eax, DWORD PTR [ebp-12]  1
> mov      edx, DWORD PTR [ebp-8]    1        ;; PPro_Mem_Stall:6-9
> leave                               3
> ret                                 3

>     They could at least get rid of the stalls, and I'm not sure why they
> couldn't in-line this code (at least as an optimizer option).

Inlining doesn't really help much for a 100~ cycle function, getting rid
of the call/ret overhead would save less than 10%.

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Sat, 20 Jan 2001 03:00:00 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. HELP: Fast Integer to Floating-Point Conversion on a Pentium

2. MAJOR Floating point bug discovered in Pentium II

3. Pentium Pro & Pentium II instruction decomposition (uops)

4. Pentium II vs. Pentium Pro

5. Integer/floating point type conversion in Prolog

6. Pentium Pro/II FXCH

7. Pentium/II/III/Pro CISC or RISC

8. Looking for Pentium Pro/II instruction to micro-op(s) breakdown

9. string instruction replacements for Pentium pro/II

10. Pentium Pro/II FXCH

11. MSFPS1.0 on PPro/P6 (Pentium Pro/II)

12. Integer divide by 0 with Pentium II 300 Mhz

 

 
Powered by phpBB® Forum Software