P4/Athlon instruction speed 
Author Message
 P4/Athlon instruction speed

I have a programm that makes heavy use of MMX instructions/assembler code.
The code works fine on P4/Athlon XP processors, however it takes about 3-4
times longer on the P4 (P4:2.4 GHz, Athlon XP 2000+).
None of the benchmarks I have tried on these computers showed similar
result, P4 is usually a bit faster, sometimes 10% slower, but never 3-4
times slower.
Now, I'd like to find out why my code is so "AMD-Optimized".
Normal profiling (line-by-line) does not work with assembler code, does it?
In 80286 times I could look up how much time which instruction takes (in
processor cycles) in a book and see which of my instructions takes so long,
but I had to learn there are no such specifications any more.
Does anyone know what kind of instruction is that much slower on P4?

P4: 2.4 GHz, 256 MB (1660 Rambus), W2K SP2
Athlon XP: 2000+, 256 MB RAM, W2K SP2



Wed, 30 Mar 2005 00:55:08 GMT  
 P4/Athlon instruction speed

Quote:
> I have a programm that makes heavy use of MMX instructions/assembler code.
> The code works fine on P4/Athlon XP processors, however it takes about 3-4
> times longer on the P4 (P4:2.4 GHz, Athlon XP 2000+).
> None of the benchmarks I have tried on these computers showed similar
> result, P4 is usually a bit faster, sometimes 10% slower, but never 3-4
> times slower.
> Now, I'd like to find out why my code is so "AMD-Optimized".
> Normal profiling (line-by-line) does not work with assembler code, does
it?
> In 80286 times I could look up how much time which instruction takes (in
> processor cycles) in a book and see which of my instructions takes so
long,
> but I had to learn there are no such specifications any more.
> Does anyone know what kind of instruction is that much slower on P4?

One way to tell is by using VTune. I've never used it but apparently it can
tell you how long each instruction takes by simulating the P4's pipeline.
The other way is to post the code here and let people analyse it :) However,
if neither of these options is possible, try to do pipeline analysis by hand
with the P4 optimisation manual from Intel (which unfortunately only has
timings for a small subset of instructions).

One thing to remember is that the P4 has a much longer pipeline than the
Atlon XP, so suffers more when executing code with lots of close
dependencies. For example, pmullw. On a P4 this has a worst-case execution
time of 8 clocks if it is reg dep'd, or 2 if it is not. On an XP it takes 3
clocks regardless. Also, the XP can do mmx shifts (pslld et al) and some
other operations twice the speed of the P4 (if written correctly) as it has
almost two full MMX execution units (nitpickers: yeah, I know, it's a little
more complex than this :) ).

Finally the caching is totally different for the P4 and the XP. If you're
dealing with datasets in th 32-58 kb range, this will be coming from within
the XP's L1 cache, but on the P4 with only 8kb of L1, it will mostly be
coming from the much slower L2.

Hope this helps.

--
Michael Brown
My inbox is always open (remove the obvious):



Wed, 30 Mar 2005 10:56:12 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. P4 and Intel compiler compared to Athlon

2. Instruction Speed and instruction availability

3. P4's "PAUSE" instruction

4. Athlon MMX Instruction Latencies

5. Athlon NOP speed

6. --- Instruction speed ---

7. --- Instruction speed ---

8. Instruction speeds

9. Instruction Speed

10. pentium instruction clock speeds

11. instruction -> speed

12. Speed of WAM instructions

 

 
Powered by phpBB® Forum Software