Longevity of xor eax,eax on pII 
Author Message
 Longevity of xor eax,eax on pII

Hi,

Lots has been said about the benefits of using

xor eax,eax

or someother way of zeroing a 32bit register to avoid partial register
stalls on the PII.

At what point does the PII suddenly decide that the reg in question is
now (not) being accessed in a partial manner?

If I access it only be byte parts, ie. only ah,al, is that ok, or...
what?  When do I start incurring penalties again?

Jim



Tue, 10 Jul 2001 03:00:00 GMT  
 Longevity of xor eax,eax on pII

Quote:



> > Hi,

> > Lots has been said about the benefits of using

> > xor eax,eax

> > or someother way of zeroing a 32bit register to avoid partial register
> > stalls on the PII.

> > At what point does the PII suddenly decide that the reg in question is
> > now (not) being accessed in a partial manner?

> > If I access it only be byte parts, ie. only ah,al, is that ok, or...
> > what?  When do I start incurring penalties again?

> As far as I understand, it works like this. All registers have a "dirty"
> flag associated with them. If you perform a XOR reg32,reg32 (and possibly
> a SUB reg32,reg32) this internal marker get's cleared. If you write to a
> partial register (word or byte access) and the flag is clear, the processor
> "knows" that it can just zero-extend the data and write the whole 32 bits
> of the register. If the flag is set on the other hand, the processor "knows"
> it needs to merge your new data with old data from the remaining parts of
> the register, and a stall occurs while the merge is in progress Any
> instruction writing a register or parts of it (except of course XOR
> itself) causes the register to be marked as dirty.

> I am sure Agner Fog (or Vesa, or Terje) will set me straight if I explained
> this incorrectly.

No, we won't because your explanation is exactly right.

The only key issue you left out is the fact that these 'dirty' bits are
not a part of the visible processor state, i.e. there is no way to
save/restore them across an interrupt.

This means that the PRS-avoiding hack only works when the distance (in
time) between the XOR EAX,EAX and the MOV AL,data/use of EAX is so short
that there hasn't been any kind of interrupt in the meantime.

Andy Glew once told me that this very gotcha had caught some Intel:

The key register(s) were XOR'ed outside a loop with a very high
iteration count, so during execution a majority of the time spent in the
loop would be after the first external interrupt. This made the loop run
_much_ slower than it should have since every partial acces/full-width
use combination generated another PRS.

I have actually seen behaviour like this for code which really shouldn't
have been able to run long enough to give a noticable chance of an
external interrupt: This code ran exactly twice as fast after I made two
separate versions of it, using the original for Pentium and lower, and
the new for PPro+.

The PPro version used MOVZX exclusively, and it ran very close to the
theoretical speed.

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Fri, 13 Jul 2001 03:00:00 GMT  
 Longevity of xor eax,eax on pII
On 25 Jan 1999 15:43:25 GMT, Terje Mathisen

Quote:



>> > Hi,

>> > Lots has been said about the benefits of using

>> > xor eax,eax

Thanks Terje et al.

Basically, the registers stay clean for as little time as possible.
Even an interrupt can affect the current status of the registers.

If I use al, and then access ah, I still suffer a stall, because eax
now sees that it has to combine the bits together.

Arse.  I'll be re-coding my clip flags routine again tomorrow.

Scenario:
I have a routine which clips homogneous coordinates.
ie. it has to do

x > z
x < -z
y > z
y < -z

I do this ATM with
xor eax,eax

cmp foo,bar
sets ah
or al,ah
add al,al
..repeat for each clip plane.

This has the (massive) advantage of having no jumps which kill the PII
worse than PRS - mainly because there is no way of branch predicting
this sort of code.

It would probably be better to move the flags into a new register, but
at a guess from what you say, the 'dirty' flag will be set if I
adc, shl etc.  on any register.

Jim



Sun, 15 Jul 2001 03:00:00 GMT  
 Longevity of xor eax,eax on pII

Quote:

> > As far as I understand, it works like this. All registers have a "dirty"
> > flag associated with them. If you perform a XOR reg32,reg32 (and possibly
> > a SUB reg32,reg32) this internal marker get's cleared. If you write to a
> > partial register (word or byte access) and the flag is clear, the processor
> > "knows" that it can just zero-extend the data and write the whole 32 bits
> > of the register. If the flag is set on the other hand, the processor "knows"
> > it needs to merge your new data with old data from the remaining parts of
> > the register, and a stall occurs while the merge is in progress Any
> > instruction writing a register or parts of it (except of course XOR
> > itself) causes the register to be marked as dirty.

> > I am sure Agner Fog (or Vesa, or Terje) will set me straight if I explained
> > this incorrectly.

> No, we won't because your explanation is exactly right.

> The only key issue you left out is the fact that these 'dirty' bits are
> not a part of the visible processor state, i.e. there is no way to
> save/restore them across an interrupt.

> This means that the PRS-avoiding hack only works when the distance (in
> time) between the XOR EAX,EAX and the MOV AL,data/use of EAX is so short
> that there hasn't been any kind of interrupt in the meantime.

> Andy Glew once told me that this very gotcha had caught some Intel:

> The key register(s) were XOR'ed outside a loop with a very high
> iteration count, so during execution a majority of the time spent in the
> loop would be after the first external interrupt. This made the loop run
> _much_ slower than it should have since every partial acces/full-width
> use combination generated another PRS.

Ahahahahahaha!!!  I'm sorry but that is just too funny.  I can just see
Intel trying to explain this in their documentation.  Its an interesting
anomoly that perhaps Anger should add to his infamous guide.

Anyhow, as Agner indicates, however, if the register retires and leaves
the "forwarding space" I think that the partial registers are collected,
and I would assume that this flag is also turned off.  What this suggests
to me is that you can simply wait long enough before using the register
again, and it will automatically be recollected back into a uniform
register.  I have not confirmed this myself, however.

--
Paul Hsieh



Sun, 22 Jul 2001 03:00:00 GMT  
 Longevity of xor eax,eax on pII

Quote:


> > The key register(s) were XOR'ed outside a loop with a very high
> > iteration count, so during execution a majority of the time spent in the
> > loop would be after the first external interrupt. This made the loop run
> > _much_ slower than it should have since every partial acces/full-width
> > use combination generated another PRS.

> Ahahahahahaha!!!  I'm sorry but that is just too funny.  I can just see
> Intel trying to explain this in their documentation.  Its an interesting
> anomoly that perhaps Anger should add to his infamous guide.

> Anyhow, as Agner indicates, however, if the register retires and leaves
> the "forwarding space" I think that the partial registers are collected,

Yes, that's what retirement means: The renamed registers are written
back to the 'true' architectural registers.

Quote:
> and I would assume that this flag is also turned off.  What this suggests
> to me is that you can simply wait long enough before using the register
> again, and it will automatically be recollected back into a uniform
> register.  I have not confirmed this myself, however.

That is correct, but not very useful for fast code:

With 2-3 microops/cycle and 10-20 cycles before retirement, you'd have
to wait 20-60 instructions between writing the partial register and
using the full reg.

As I said, not very useful for fast code. :-(

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Sun, 22 Jul 2001 03:00:00 GMT  
 Longevity of xor eax,eax on pII


[ ... ]

Quote:
> Yes, that's what retirement means: The renamed registers are written
> back to the 'true' architectural registers.

except for exceptions -- on a K6 (or close relative thereof)
retirement is separate.  Retirement only happens when all four micro-
ops (oops, I mean RISC86 instructions) on an op-quad have had their
results written back.

Quote:
> > and I would assume that this flag is also turned off.  What this suggests
> > to me is that you can simply wait long enough before using the register
> > again, and it will automatically be recollected back into a uniform
> > register.  I have not confirmed this myself, however.

> That is correct, but not very useful for fast code:

> With 2-3 microops/cycle and 10-20 cycles before retirement, you'd have
> to wait 20-60 instructions between writing the partial register and
> using the full reg.

> As I said, not very useful for fast code. :-(

Depending -- in some cases, having 30 or 40 other instructions first
is perfect reasonable, especially if you can put a loop in-between.  
AAMOF, it seems some basic ideas just keep coming back around -- this
reminds me a lot of maximizing throughput with older floating point
units by putting LOTS of instructions between starting the FPU doing
something, and putting it to use.  As I recall, on a 387, fsin (for
example) could take over 500 cycles, which was typically around 100 to
150 integer instructions.

Of course, it IS really a pain to deal with this -- in general, the
more other "stuff" you do between an instruction and using the result,
the more difficult it makes maintenance later.  Every instruction
executed in the interim is an opportunity to mess things up...



Mon, 23 Jul 2001 03:00:00 GMT  
 Longevity of xor eax,eax on pII

Quote:

>Ahahahahahaha!!!  I'm sorry but that is just too funny.  I can just see
>Intel trying to explain this in their documentation.  Its an interesting
>anomoly that perhaps Anger should add to his infamous guide.

>Anyhow, as Agner indicates, however, if the register retires and leaves
>the "forwarding space" I think that the partial registers are collected,
>and I would assume that this flag is also turned off.  What this suggests
>to me is that you can simply wait long enough before using the register
>again, and it will automatically be recollected back into a uniform
>register.  I have not confirmed this myself, however.

Thanks guys or your help.  It seems that there's little hope for
*real* code to avoid PRS, especially under OS controlled conditions.
Haven't rethought my code yet though!  It seems that it's not worth
relying on the fact that _your_ code knows what PRS is, because
something else in the OS has a good chance of screwing your
expectations.  Still, I still save because of lack of branches.
It must be horrible for Intel to realise they screwed up so badly -
PRS is deadly, but there seems to be no sensible ay to avoid it over
lengthy code sequences running on multitasking os'es.

Now, anyone got any answers for debugging FPU code under PII.

It's a nightmare.  The CPU pipeline is so deep that fpu exceptions
occur many cycles past when they would have on a 486.  Usually the
symptom is a crash on a docile instruction like float a = b * c, and I
can print b and c to the terminal and they are ok.  It's just that
something a few dozen cycles ago screwed it up.

Thanks for any further help.

Jim
PS.  Does anyone know where agner fog's page went?
It used to be at
www.announce.com/~agner
but it has died.



Tue, 24 Jul 2001 03:00:00 GMT  
 Longevity of xor eax,eax on pII

Quote:

> PS.  Does anyone know where agner fog's page went?
> It used to be at
> www.announce.com/~agner
> but it has died.

www.agner.org

Terje
--

Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"



Tue, 24 Jul 2001 03:00:00 GMT  
 
 [ 12 post ] 

 Relevant Pages 

1. How come Test Eax,Eax is = when eax = 0, but not when eax = 1?

2. what does mov %eax, %eax do?

3. BSWAP EAX vs SHL EAX,16

4. MOVZX EAX vs. MOV AL / AND EAX, 0ffh

5. inc eax / dec eax vs. nop nop?

6. I get {mov eax,[MyFunc]; call eax;}, but I need {call [MyFunc];}!!!

7. Weird eax value?...

8. Div eax?

9. mov esp, eax crahses?

10. why does eax change

11. Help with 32-bit assembly...mov eax, offset?

12. What is EAX,EBX,...

 

 
Powered by phpBB® Forum Software