gcc inline asm, can't do even simple things :-( 
Author Message
 gcc inline asm, can't do even simple things :-(

OK, I can't get my head around gcc, inline assembly, and the FPU.
The following has comments indicating what I'm trying to do:

#include <stdlib.h>
#include <stdio.h>

typedef unsigned long long i52;

static i52 p;
static double pr;

int main(int argc, char**argv)
{
  int i;
  i52 v=3,v2=3;
  p=argv[1]?atoll(argv[1]):1234567891234567;

  // I want pr = 1.0/x, but I also want to leave
  // pr on the FP stack as I'll beusing it repeatedly

  __asm__ __volatile__
    ( "fild p                  \n\t"      // FPU: p
      "fld1                    \n\t"      // FPU: 1   p
      "fdivrp %%st(0), %%st(1) \n\t"      // FPU: 1/p
      "fst pr"
      : : : "memory" );

  // check - have I done the divide?
  printf("p=%llu, pr=%f\n", p, pr);

  for(i=1; i<10; ++i)
    {
      // do some funky stuff
    }

  // erm, I think I should pop that reciprocal
  __asm__ __volatile__
    ( "fstp pr" );

  return 0;

Quote:
}

And here's the output:
<<<
bash-2.05b$ ./asm 12345
p=12345, pr=0.000000

So I can't do shit in assembly, which is depressing, as when I was younger
I could. I tried different things, but basically couldn't do anything with
floating point values, which given the purpose of the FPU is a bit of a
shame.

Begging, grovelling, can someone tell me what I'm doing wrong?

Cheers,
Phil



Sun, 26 Jun 2005 08:05:28 GMT  
 gcc inline asm, can't do even simple things :-(

Quote:

> OK, I can't get my head around gcc, inline assembly, and the FPU.
> The following has comments indicating what I'm trying to do:

> #include <stdlib.h>
> #include <stdio.h>

> typedef unsigned long long i52;

> static i52 p;
> static double pr;

> int main(int argc, char**argv)
> {
>   int i;
>   i52 v=3,v2=3;
>   p=argv[1]?atoll(argv[1]):1234567891234567;

>   // I want pr = 1.0/x, but I also want to leave
>   // pr on the FP stack as I'll beusing it repeatedly

>   __asm__ __volatile__
>     ( "fild p                  \n\t"      // FPU: p
>       "fld1                    \n\t"      // FPU: 1   p
>       "fdivrp %%st(0), %%st(1) \n\t"      // FPU: 1/p
>       "fst pr"
>       : : : "memory" );

>   // check - have I done the divide?
>   printf("p=%llu, pr=%f\n", p, pr);

>   for(i=1; i<10; ++i)
>     {
>       // do some funky stuff
>     }

>   // erm, I think I should pop that reciprocal
>   __asm__ __volatile__
>     ( "fstp pr" );

>   return 0;
> }

> And here's the output:
> <<<
> bash-2.05b$ ./asm 12345
> p=12345, pr=0.000000

> So I can't do shit in assembly, which is depressing, as when I was younger
> I could. I tried different things, but basically couldn't do anything with
> floating point values, which given the purpose of the FPU is a bit of a
> shame.

> Begging, grovelling, can someone tell me what I'm doing wrong?

If the name "i52" indicates that you are using only 52-bit integers then you
can solve your problem perfectly without any inline asm code at all.  Just
use a signed type, because the signed_integer-to-double conversion will
always be exact and the generated code will be optimal (a single machine
instruction):

  typedef long long i52;

  i52 p;
  double pr;

  /* initialise p to something != 0 */

  pr = 1.0 / (double) p;

If you are going to use "pr" very often afterwards, then gcc's optimizer will
likely keep it in a FPU register, anyway.  If in doubt, take a look at the
generated assembly code.

If you insist on the type "unsigned long long", but want to avoid all the
conversion code that gcc generates for an unsigned-to-double type cast, then
you need only a short asm snippet to load a 64-bit integer into the FPU.  But
beware, the FPU always treats it as a *signed* 64-bit quantity, so limit your
values to 2^63.  The solution is:

  typedef unsigned long long i63;

  i63 p;
  double pr;

  /* initialise p to something != 0 */

  /* Equivalent to:  pr = (double) p; */
  __asm__ ("fildq %1"
           : "=t" (pr)   /* result on _t_op of FPU stack (in variable pr) */
           : "m" (p));   /* load p from _m_emory */

  pr = 1.0 / pr;

---
Joe Leherbauer             Leherbauer at telering dot at

"Somewhere something incredible is waiting to be known."
                                 -- Isaac Asimov



Sun, 26 Jun 2005 17:33:23 GMT  
 gcc inline asm, can't do even simple things :-(

Quote:

> If the name "i52" indicates that you are using only 52-bit integers then you

Absolutely, I have a suite of cfunctions that assume I can map them to
doubles with no loss of precision, and another set based on i64s, which
can't makethat assumption. (I code mainly for Alpha, and only occasionally
x86, but I understand 80-bit floats will be useful there.).

Quote:
> can solve your problem perfectly without any inline asm code at all.  Just
> use a signed type, because the signed_integer-to-double conversion will
> always be exact and the generated code will be optimal (a single machine
> instruction):

Yeah, that part was just so that I could get started, the routine I wanted
to write loaded loads of ints into the FPU, did stuff, and then output an
i52. It was actually only the final float->int instruction that I wanted to
hand code, but that would mean that I'd have to get my hands dirty in teh FP
stages too. The reason I wanted to change the builtin float->int was because
gcc output a complete dogs breakfast - Alas it included 3 'cw' instrictions
as it changed the default rounding mode (it makes no assumtion about what
that is) into C's prefered, (and for my app, my prefered) rounding mode -
round to zero. After the single instruction, it then restores the rounding
mode to whatever it was before.

Quote:
> If you are going to use "pr" very often afterwards, then gcc's optimizer will
> likely keep it in a FPU register, anyway.  If in doubt, take a look at the
> generated assembly code.

It does seem quite smart about register allocation, and the 'cw' was gleaned
from the .s . I may not know what to change, but I do know where to look...
The gcc mailing list archives provided me with what I needed, which was how
to do a float->int cast. I had to work out the 'long long' version myself,
and I'm still not sure I've got it right.

/* round float to long long in various precision modes */

static __inline long long
fastf2ll (double x)
{
  __volatile long long result;
  __asm __volatile ("fistpq %0" : "=m" (result) : "t" (x) );
  return result;

Quote:
}
> If you insist on the type "unsigned long long", but want to avoid all the
> conversion code that gcc generates for an unsigned-to-double type cast, then
> you need only a short asm snippet to load a 64-bit integer into the FPU.  But
> beware, the FPU always treats it as a *signed* 64-bit quantity, so limit your
> values to 2^63.  The solution is:

I think I'll be signed anyway. And I always have to stay 2 bits short of the
mantissa size anyway, so I'll never be above 62 bits in my i64 type.

Quote:
>   typedef unsigned long long i63;

>   i63 p;
>   double pr;

>   /* initialise p to something != 0 */

>   /* Equivalent to:  pr = (double) p; */
>   __asm__ ("fildq %1"
>            : "=t" (pr)   /* result on _t_op of FPU stack (in variable pr) */
>            : "m" (p));   /* load p from _m_emory */

>   pr = 1.0 / pr;

I'll add that to my x86 snippets file, thanks. I really can't get my head
around the exceptionally powerful, but learning curve from hell, gcc syntax.

After about 8 hours of googling, I'd still not come accross the 'q' suffix.
I'm guessing it means 'quad', i.e. 64-bit. Some of the docs I read said that
"t" meant "temporary", i.e 80-bit float, rather than "top".

About 2 hours later, I came across the code I posted above. And about an
hour later I came accross references to the lrint() and llrint() functions,
which wouldhave done what I wanted all along!
All in all, I spent an entire day working on _1_ istruction.

Worst of all - it only saved me 1% getting rid of the 'cw' instructions.

No, that's not worst - worst is that when I changed from gcc-2.95 to gcc-3.2
(worse!) to gcc-3.0, I managed to shave a further 2% off. So I could had 3
times the effect on runtime simply by 5 seconds of makefile tweaking at the
outset.

So I now have a pure C version that's 3-1/2 times faster than my mate's asm
optimised version, but I want it 4 times faster, so I'm going to have to hit
assembly anyway.

Are there any other wide arithmetic types available on more advanced
processors? I don't mind assuming P!!! or Athlon, but I don't have a P4, so
if I were to try and code for that I'd be coding blind. I looked at MMX,
and its nothing but silly byte/short nonsense, and 3D-Now's single precision
floats, again useless for ~64 bit maths (with >100 bit intermediates).
I've really not kept up with the modern processors, I'm ashamed to say.
(What am I talking about? I went 64-bit last millennium, though about 5
years after many alpha owners!)

Thanks for your help, expect some more ineffectual source to be posted
in the near future!

Cheers,
Phil



Mon, 27 Jun 2005 10:54:37 GMT  
 gcc inline asm, can't do even simple things :-(

Quote:

> /* round float to long long in various precision modes */

> static __inline long long
> fastf2ll (double x)
> {
>   __volatile long long result;
>   __asm __volatile ("fistpq %0" : "=m" (result) : "t" (x) );
>   return result;
> }

1. "volatile" is not necessary

2. Something else is missing:

You have to tell gcc that you have just popped the top of the FPU register
stack.  This is done with a "clobber" argument (: "st"):

  long long i;
  double d;

  __asm__ ("fistpq %0" : "=m" (i) : "t" (d) : "st");

---
Joe Leherbauer             Leherbauer at telering dot at

"Somewhere something incredible is waiting to be known."
                                 -- Isaac Asimov



Mon, 27 Jun 2005 16:12:37 GMT  
 gcc inline asm, can't do even simple things :-(


Quote:
> OK, I can't get my head around gcc, inline assembly, and the FPU.
> The following has comments indicating what I'm trying to do:

> #include <stdlib.h>
> #include <stdio.h>

> typedef unsigned long long i52;

> static i52 p;
> static double pr;

> int main(int argc, char**argv)
> {
>   int i;
>   i52 v=3,v2=3;
>   p=argv[1]?atoll(argv[1]):1234567891234567;

>   // I want pr = 1.0/x, but I also want to leave
>   // pr on the FP stack as I'll beusing it repeatedly

>   __asm__ __volatile__
>     ( "fild p                  \n\t"      // FPU: p

This should be:
|      ( "fildl p                \n\t"        // FPU: p
to save GAS any confusion (the l suffix meaning long)

Quote:
>       "fld1                    \n\t"      // FPU: 1   p
>       "fdivrp %%st(0), %%st(1) \n\t"      // FPU: 1/p
>       "fst pr"

This should be:
|        "fstl pr"
to save GAS any confusion (the l suffix here meaning double)

Quote:
>       : : : "memory" );

To be safe, this should be
|    __asm__ __volatile__
|        ("fildl %1\n\t"                                /* FPU: p */
|         "fld1\n\t"                                     /* FPU: 1    p */
|         "fdivrp %%st(0), %%st(1)\n\t"    /* FPU: 1/p */
|         "fstl %0"
|         : "=m" (pr) : "rm" (p) );

Quote:

>   // check - have I done the divide?
>   printf("p=%llu, pr=%f\n", p, pr);

>   for(i=1; i<10; ++i)
>     {
>       // do some funky stuff
>     }

>   // erm, I think I should pop that reciprocal
>   __asm__ __volatile__
>     ( "fstp pr" );

Are you wanting to save the reciprocal? If yes:
|      ( "fstpl pr" );
or
|      ( "fstpl %0" : "=m" (pr) );

If no, then:
|      ( "ffreep %st(0)" );

Quote:

>   return 0;
> }

> And here's the output:
> <<<
> bash-2.05b$ ./asm 12345
> p=12345, pr=0.000000

Well, GAS probably thinks

- Show quoted text -

Quote:
> So I can't do shit in assembly, which is depressing, as when I was younger
> I could. I tried different things, but basically couldn't do anything with
> floating point values, which given the purpose of the FPU is a bit of a
> shame.

> Begging, grovelling, can someone tell me what I'm doing wrong?

> Cheers,
> Phil



Mon, 27 Jun 2005 20:03:36 GMT  
 gcc inline asm, can't do even simple things :-(

Quote:

> To be safe, this should be
> |    __asm__ __volatile__
> |        ("fildl %1\n\t"                                /* FPU: p */
> |         "fld1\n\t"                                     /* FPU: 1    p */
> |         "fdivrp %%st(0), %%st(1)\n\t"    /* FPU: 1/p */
> |         "fstl %0"
> |         : "=m" (pr) : "rm" (p) );

Woo woo! Every line wrong!
Dare I say it, but I'm trying to avoid assembly, I'd rather code C that
compiles into what I'd probably want to write (gcc on Alpha's like that, I write
asm usingC statements, with usuallya 1-1 mapping!).
So I'm now down to 1 __asm__, and one instruction at that, in the whole code.

I've

Quote:
> Are you wanting to save the reciprocal? If yes:
> |      ( "fstpl pr" );
> or
> |      ( "fstpl %0" : "=m" (pr) );

> If no, then:
> |      ( "ffreep %st(0)" );

You're a smart cookie, Ben, you worked out what the _real_ questions were as
well as answering the ones that I actuall asked!

Quote:
> Well, GAS probably thinks
>> So I can't do shit in assembly,

It would probably be right! :-)

I know noone's interested, but I worked out that if I change my memory
usage (I hve a hash-table, and I'm changing its size/associativity), then I
can increast my speed by 33%, maybe more. So _currently_ I'm looking more at
getting memory usage right, in any language (so in C), and I'll return to
assemblifying the critical bits when I've got that bit as good as I can.
(Terje's signature comes to mind here)

Thanks immensely for your input.
Cheers,
Phil



Tue, 28 Jun 2005 02:03:23 GMT  
 gcc inline asm, can't do even simple things :-(

Quote:


>> /* round float to long long in various precision modes */

>> static __inline long long
>> fastf2ll (double x)
>> {
>>   __volatile long long result;
>>   __asm __volatile ("fistpq %0" : "=m" (result) : "t" (x) );
>>   return result;
>> }

> 1. "volatile" is not necessary

> 2. Something else is missing:

> You have to tell gcc that you have just popped the top of the FPU register
> stack.  This is done with a "clobber" argument (: "st"):

>   long long i;
>   double d;

>   __asm__ ("fistpq %0" : "=m" (i) : "t" (d) : "st");

Joe - you're a miracle worker!

By posting that yesterday, you fixed a bug that didn't even come into
existance until about 10 minutes ago. I was lazy and continued without
your changes, until ~weird shit~ started happening, and as soon as it did
happen I decided perhaps it was time to follow your directions. Thanks
for looking at the entrails of my code, and for predicting the future from
them, you got it spot on!

(And my reworked memory architecture is coming on nicely, it should be the
fastest yet.)

Thanks again,
Phil



Tue, 28 Jun 2005 07:24:59 GMT  
 gcc inline asm, can't do even simple things :-(

Quote:
> > Well, GAS probably thinks
> >> So I can't do shit in assembly,

> It would probably be right! :-)

Sorry about above. I posted before I was finished. Silly me.
What I meant to say was:

By default, GAS (GNU Assembler) assumes that the type is single precision
float when dealing with a floating point operation with a memory operand and
without an explicit type suffix.

And, whoops, I should have put fildq instead of fildl.

Strangely, I could not find atoll in libc.

OK. The following works on my system. Please don't flame me that you have
already fixed it.

#include <stdlib.h>
#include <stdio.h>

typedef unsigned long long i52;

static i52 p;
static double pr;

int main(int argc, char**argv)
{
  int i;
  i52 v=3,v2=3;
  p=argv[1]?strtoll(argv[1], NULL, 10):1234567891234567;

  // I want pr = 1.0/x, but I also want to leave
  // pr on the FP stack as I'll beusing it repeatedly

    __asm__ __volatile__
        ("fildq %1\n\t"                                /* FPU: p */
         "fld1\n\t"                                    /* FPU: 1    p */
         "fdivp %%st(0), %%st(1)\n\t"                  /* FPU: 1/p */
         "fstl %0"
         : "=m" (pr) : "rm" (p) );

  // check - have I done the divide?
  printf("p=%llu, pr=%f\n", p, pr);

  for(i=1; i<10; ++i)
    {
      // do some funky stuff
    }

  // erm, I think I should pop that reciprocal
  __asm__ __volatile__
    ( "fstpl _pr" );

  return 0;

Quote:
}

And it produces the following result:

E:\DJGPP>gcc -o test1 test1.c

E:\DJGPP>test1 12345
p=12345, pr=0.000081



Tue, 28 Jun 2005 18:10:30 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. GCC inline ASM -- calling C functions from __asm__

2. GCC inline ASM -- calling C functions from __asm__ blocks

3. porting inline Watcom asm to MSVC inline asm

4. On inline asm in gcc

5. Help with gcc inline floating point asm

6. Inline asm syntax (gcc)

7. GCC inline asm question i386

8. inline asm in using gcc under linux 2.0.0

9. Inline asm with gcc on Linux.

10. Inline asm in GCC

11. Trying to use gcc inline asm correctly

12. GCC inline asm i386

 

 
Powered by phpBB® Forum Software