Cyrix/PPro pairing rules? 
Author Message
 Cyrix/PPro pairing rules?

Does anyone know if there is an online reference to the pairing rules
(which instructions work in which pipes, delays, etc...) on the Cyrix 6x86
and the PPro?  Cyrix's processor manual at www.cyrix.com doesn't seem to
have this info.

Dave



Sat, 21 Aug 1999 03:00:00 GMT  
 Cyrix/PPro pairing rules?

Quote:
>Does anyone know if there is an online reference to the pairing rules
>(which instructions work in which pipes, delays, etc...) on the Cyrix 6x86
>and the PPro?  Cyrix's processor manual at www.cyrix.com doesn't seem to
>have this info.

>Dave

The Pentium Pro doesn't have the u/v pairing rules, it works out of
order. Branch prediction seems to be the big issue for optimizing its
code. However: you might want to design your code in such a fashion
that all 5 ports for the uops are kept busy, align your jump/call
labels on 16 byte boundaries and keep a close eye on partial register
stalls (that didn't exist on the Pentium).

  "A compiler that generates better code than a good
   assembly programmer? That'll be the day."
  (Michael Abrash)




Sun, 22 Aug 1999 03:00:00 GMT  
 Cyrix/PPro pairing rules?

Quote:

> Does anyone know if there is an online reference to the pairing rules
> (which instructions work in which pipes, delays, etc...) on the Cyrix 6x86
> and the PPro?  Cyrix's processor manual at www.cyrix.com doesn't seem to
> have this info.

> Dave

Here are all the pairing rules for the Cyrix 6x86 processor that I
have found. Most of these have been obtained through extensive testing
by myself. This information should cover most of the aspects of 6x86
performance, although it may not be complete..hopes it helps, though..

On-line references to 6x86 pairing rules:

None that I know of.

Pipeline:
        F -- Fetch
        ID1 -- Instruction Decode 1 : Instruction Length Determination
        ID2 -- Instruction Decode 2 : Actual decode
        AD1 -- Address Decode 1 : Address calculation
        AD2 -- Address Decode 2 : TLB & cache reads, register reads
        EX -- Execute
        WB -- Write-back : Register/memory writes, flags maintenance,
                                conditional jump evaluation

Having register read/writes outside the EX unit does NOT impair
performance, because the processor does Register Renaming and Data
Bypassing..

Instruction Timings:

Refer to Cyrix docs. I haven't found any erroneous timings there yet..

Non-pairable instructions:

These instructins are NOT pairable:
PUSHA/PUSHAD, POPA/POPAD, IN/OUT, MUL/IMUL, DIV/IDIV,
LODS/STOS/MOVS/CMPS/SCAS/INS/OUTS,
CALL, intersegment JMP, BOUND, SMSW, XCHG, BSWAP (?)
Protected-mode segment loads +
all other privileged instructions.

All other instructions are pairable.

X and Y pipelines:

Do not fully correspond to the Pentium U and V pipelines.
6x86 is able to swap instructions between the pipelines
in the ID2 step. Normally, it works like this:
If the previous instruction in the stream was passed in the Y
pipeline, then the next instruction is passed in the X pipeline.
If the previous instruction in the stream was passed in the X
pipeline, then the next instruction is passed in the Y pipeline.
The exceptions are as follows:
 - Jump instructions are passed in the X pipeline
 - FPU instructions are passed in the X pipeline
 - Non-pairable instructions are generaly passed in the X pipeline

Prefixes:

If the first byte of an instruction's opcode is not enough to determine
the type of instruction (e.g. whether it's an ADD or SUB) then
the 6x86 considers the byte to be a prefix.

The 6x86 can decode up to 2 prefixed instructions per clock cycle
as long as none of the instructions have more than one prefix
and none of then has any immediates (except: Rotate/Shift
instructions, Near conditional jumps ( 0F 8x xx xx ) ).

If the instruction contains 2 or more prefixes, then it will stall
the ID1 unit of the pipeline for (number of prefixes minus 1)
cycles.

If the instruction contains immediates (except those mentioned above)
AND prefixes, then the ID1 unit cannot determine the length of any other
instructions within the same clock cycle.

Prefixes affect the speed with which the instructions' lengths
are determined; the do NOT affect pairability in the EX unit.

Instruction Length:

If an instruction is 7 bytes or longer, then the ID1 unit cannot
determine the length of any other instructions within
the same clock cycle.

If the instructions are 6 bytes or shorter (and not riddled with
prefixes) then the ID1 unit can determine the length of 2
instructions per clock cycle.

6x86 stores NO predecode information in any of its caches.

Read-after-write (RAW) dependencies:

These appear if the first of two instructions (which we want to pair)
writes to a register and the second instruction reads the
same register.

For example -
        ADD AX,BX         ;; modifies AX
        MOV CX,AX         ;; reads AX

The 6x86 can pair the two instructions if:
        - only one of the instructions performs any arithemetic
        - the operands (register written and register read)
           have the same size.

Otherwise, the 6x86 executes only the first instruction and tries
to pair the other one with the next instruction in the stream.

Some consequences of this are also:

        PUSH AX         ;; modifies SP
        PUSH BX         ;; also modifies SP
2 stack instructions can thus never pair on a 6x86.

        MOV CX,555      ;; modifies CX
        LOOP flag1      ;; also modifies CX
Instructions will not pair because both of then modify CX.

Write-after-read (WAR) , write-after write (WAW) dependencies:

WAR:    MOV AX,BX       ;; reads BX
        ADD BX,5        ;; writes to BX        

WAV:    ADD BX,5        ;; writes to BX
        MOV BX,AX       ;; writes to BX

Do not affect pairability. 6x86's Register Renaming capabilities
avoid potential collisions.

Memory accessing:

Address Generation Interlock - occurs when one instruction modifies
registers which another (later) instruction uses for memory accessing.
Can stall the 6x86 processor for a maximum of 2 cycles.

Unaligned accesses - unaligned reads need 2 cycles in the AD2 unit.
Unaligned writes need 2 cycles in the WB unit.
Unaligned memory accesses are defined as all memory accesses that
cross an 8-byte boundary.
A memory read that crosses a cache line boundary will NOT cause
any memory to cached if it is not already cached

Address generation - If the address expression is composed of 3
elements, something like
        MOV EAX,[ EBX + 8*ESI + 5555]
then the AD1 unit is stalled for 1 cycle. Otherwise, it is not.
(even with 2 registers, like  MOV AX,[BX+SI]. The Cyrix documentation
is in error here.)

There are NO restrictions as to which pipeline can have access to which
cache line, i.e. 2 instructions can read from the same cache line and
still pair.

The cache can respond to a maximum of 2 accesses per clock cycle. The
priority seems to be:
       1: Code Fetch (if necessary)
       2: Data Read
       3: Data Write

6x86 can fetch code either from the Instruction Line Cache (ILC)
or, if the code is not in the ILC, from the unified cache.

If the cache is so overloaded that data writes cannot take place
immediately, then 6x86 writes to one of its 4 write buffers instead
and waits until the cache becomes accessible again.

In case of a conditional jump, the 6x86 will do a code fetch from
the not-predicted target during evaluation.

Memory Writes that hit the ILC will invalidate the appropriate ILC
line. 6x86 also does checking to ensure that an instruction that
writes to data already within the pipeline flush the pipeline and
update the modified instructions properly. This mechanism works as
long as the instruction responsible for the write is not actually
paired with its target. Avoid using this mechanism, though; it
causes a stall of something like 30 cycles.

6x86 does Code Fetches aligned to 16-byte (128-bit) boundaries.
Try to avoid having jump targets within the last 4 or 5 bytes
of any 16-byte block. Else 6x86 will have to fetch twice, which
effectively wastes a cycle.

The fastest way to initialize memory is  REP STOSD
(134 MB/s on my system -- with a 133MHz 6x86 and "Force Cache Line fill
on Write Miss" feature enabled - see "Configuration Registers" below)

The fastest way to move a memory block is  REP MOVSD

Do not try to use FILD/FISTP for 64-bit moves -- 6x86 does
write-combining, which has the same effect and is faster.

A small anomaly : CMP [memory],value will be cached/not cached as if
it is actually doing a memory write. Avoid it if you are not absolutely
sure that the memory location is cached.

Flags:

It seems to me that the WB unit is responsible for maintaining the
flags - it collects the results of the instructions that have
been executed and determines the appropriate flag values.

It also seems to me that each of the pipelines maintains its own copy
of the flags in the EX unit.

If an instruction needs a flag to operate properly, this can give
some strange results..

For example, the sequence
                ADD AX,BX
                ADC CX,DX
will need 3 cycles to execute. These cycles go something like this:
    1:   Executes ADD AX,BX in the X pipeline.
         ADC CX,DX cannot be done yet.
    2:   The WB unit will update the general flags register
         ADC CX,DX can still not be executed
    3:   Flags are ready
         ADC CX,DX is now executed in the Y pipeline

The sequence    ADD AX,BX
                PUSH BX
                ADC CX,DX
on the other hand, will need only 2 cycles:
    1:   Executes ADD AX,BX in the X pipeline
         Executes PUSH BX in the Y pipeline
    2:   Local flags are maintained in the X pipeline
         ADC CX,DX can be executed in the X pipeline.

So: Put exactly one instruction between the instruction that generates
flags and the instruction that requires them. And also: Do not
pair the flag-generating instruction with another flag-generating
instruction..

Conditional jump instructions are not affected as severely by this
odd flag handling; 6x86 allows conditional jumps to be evaluated
in the WB unit.

Speculative Execution:

Whenever the 6x86 executes a conditional jump or an FPU instruction,
it checkpoints its registers and increments its speculation level
(Level 4=Max, Level 1=No Speculative Execution available).
It is decremented as soon as the 6x86 has evaluated the jump or
FPU instruction in question.

If the 6x86 speculation level was last incremented by a conditional
jump, then the 6x86 doesn't allow any instructions to change
the speculation level until the conditional jump has been
properly resolved. (Information known to be incomplete/guesswork)
This has at least 2 effects  :
 - 6x86 cannot do a conditional jump more often than once
   per 2 clock cycles.
 - 6x86 cannot process any FPU instructions for the first 2 cycles
   after a conditional jump.

ALL FPU instructions can be buffered using Speculative execution.
So, if you wish, you can do the sequence
                FYL2XP1
                FSINCOS
                FPATAN
                F2XM1
within 4 clock cycles and then spend the next 400 cycles doing
something else while the FPU actually does the calculations.

All memory writes during Speculative Execution are done to internal
write buffers and not committed to cache or main memory until
Speculative Execution ends. The write buffers can hold up to 4
writes.

These ...

read more »



Sun, 22 Aug 1999 03:00:00 GMT  
 Cyrix/PPro pairing rules?

Quote:

>Here are all the pairing rules for the Cyrix 6x86 processor that I
>have found. Most of these have been obtained through extensive testing
>by myself. This information should cover most of the aspects of 6x86
>performance, although it may not be complete..hopes it helps, though..

This information ought to be made available on the net somewhere. If J?rn
Nystad doesn't have a home page, then I am sure somebody else can find a space
for it.

===============================================================================
Agner Fog, Ph.D.                            Pentium optimation manual at



Sun, 22 Aug 1999 03:00:00 GMT  
 Cyrix/PPro pairing rules?



Quote:

>>Here are all the pairing rules for the Cyrix 6x86 processor that I
>>have found. Most of these have been obtained through extensive testing
>>by myself. This information should cover most of the aspects of 6x86
>>performance, although it may not be complete..hopes it helps, though..

>This information ought to be made available on the net somewhere. If J?rn
>Nystad doesn't have a home page, then I am sure somebody else can find a space
>for it.

It is available at ftp://ftp.cyrix.com/HARDWARE/6x86/


Mon, 23 Aug 1999 03:00:00 GMT  
 Cyrix/PPro pairing rules?



Quote:

>>Here are all the pairing rules for the Cyrix 6x86 processor that I
>>have found. Most of these have been obtained through extensive testing
>>by myself. This information should cover most of the aspects of 6x86
>>performance, although it may not be complete..hopes it helps, though..

>This information ought to be made available on the net somewhere. If J?rn
>Nystad doesn't have a home page, then I am sure somebody else can find a space
>for it.

And this one I forgot. It's the same as the cyrix as far as I know.

http://www.chips.ibm.com/products-nojs/x86-nojs/x86dev/l3devibm6x86.html



Mon, 23 Aug 1999 03:00:00 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. Instruction Pairing on CYRIX MII

2. Instruction Pairing on CYRIX MII

3. CFP: RULE'02 - PLI-Workshop on Rule-Based Programming

4. CFP: RULE 2001 (2nd Int'l Workshop on Rule-based Programming)

5. CFP: RULE'02 - PLI-Workshop on Rule-Based Programming

6. CFP: RULE 2001 (2nd Int'l Workshop on Rule-based Programming)

7. RFI - Rule sets - pre-built - domain-specific foundation rules - availability or contacts

8. New instructions on PPro, K6, 6x86 ?

9. Question about return stack buffer (RSB) on PPRO and PII

10. New instructions on PPro, K6, 6x86 ?

11. a faster strcpy for pentium/ppro

12. Doing bit-wise logical operations in the Pentium and PPro floating point units

 

 
Powered by phpBB® Forum Software