Thoughts on a new instruction 
Author Message
 Thoughts on a new instruction

Having done a lot of assembler coding (for code that gets installed
in and runs as part of the VM operating system), I was struck bythe
clumsiness of the CS instruction -- required in a little loop that
takes about four instructions,ending in a test and a conditional
branch, if you need to update (or reliably increment or decrement)
a memory location without fear that another processor might be
doing the same thing.  In 370, of course, to increment a word
(or whatever) of storage, you have to load the old value into a
register, increment the register, then store the value back.  This
is where the potential conflict comes in -- another processor
(or process) might be doing the same thing, and one update might
get lost.

I've always thought that the language should have an INCREMENT
instruction and a DECREMENT instruction.  This would remove the
need to code the multi-line idiom to increment or decrement
critical counters.  Of course, the only parm required would be
the target address.  Decrement could set the CC if the counter
goes to zero, or went negative.  Increment could set the CC
for something.  

Also, the same technique with CS is required to get a lock
(as in HCPLCK), and to reliably add a block of some kind to the
end of a chain (or remove it).  Perhaps we need a single,
hardware-interlocked Load and Store instruction to do this.
I haven't thought this one out yet, but I am sure that the 390
language would be better with INC and DEC.  Any comments?

-- David Walker



Fri, 17 Apr 1998 03:00:00 GMT  
 Thoughts on a new instruction

: A single instruction to atomically update a chain would cause severe
: interlocking of the very fast processor and the relativly slow shared
: main memory. This would have a serious performance hit similar to what
: we experienced in MP65 when the only serializing instruction was TS
: (Test and Set).

Why in the world would a single instruction to update a chain cause
any more of a performance hit than CS?  CS and CDS already require
processor serialization, and they update storage.  I don't understand
the difference here.

-- David Walker



Sat, 18 Apr 1998 03:00:00 GMT  
 Thoughts on a new instruction

:  . . . . snap - the text was essentially if CS could be improved by having
: some INC or DEC instruction (non-interruptible). calling for comments.
: .....
: CS is used for much more that just incrementing or decrementing storage values.
: The function is : Replace storage value provided the current value is unchanged
: from a (previously saved) reference value.

Yes, CS is used for much more than interlocked increments and decrements,
but the fact remains, that in VM and I suspect in MVS, there are
thousands of calls a second (there's a guess for you) to code to
increment or decrement a fullword or halfword with interlock.  The
frequency of these calls was what prompted me to wish for an increment
and a decrement instruction.

: I agree that CS appears to be clumsy, but the function really calls for the use
: of two registers: One to hold the reference value and another to hold the
: replacement value. The function also must have a condition code to tell
: wether the operation was successful or not. And I believe that is exactly the CS
: instruction as simple as is possible?

Yes, for its more general uses.  For increment, you very often don't care
what the new (or old) value was!  The same is true for decrement, although
you might sometimes want to know whether it is now zero, and if so,
you could test the CC of a decrement instruction.  And, I am suggesting
that increment and decrement, of course, are always successful.  Again,
even for the other uses of CS (like adding to a chain), an instruction
could be implemented so as to be always successful -- obviating the need
to test for "interference" from another process or processor.

What do other machine languages do for these requirements?

-- David

: regards Sven



Sat, 18 Apr 1998 03:00:00 GMT  
 Thoughts on a new instruction

:>
:>
:>: A single instruction to atomically update a chain would cause severe
:>: interlocking of the very fast processor and the relativly slow shared
:>: main memory. This would have a serious performance hit similar to what
:>: we experienced in MP65 when the only serializing instruction was TS
:>: (Test and Set).
:>
:>Why in the world would a single instruction to update a chain cause
:>any more of a performance hit than CS?  CS and CDS already require
:>processor serialization, and they update storage.  I don't understand
:>the difference here.
:>
:>-- David Walker
:>

Perhaps because programmers would use a single instruction indiscriminately,
causing unnecessary serialization, as apprently happened with TS. One major
advantage of CS and CDS in this regard is that they are not likely to be used
unless serialization is required.

While it may be academically interesting to speculate about such a change to
the architecture, whether it takes one instruction or several to
accomplish the desired function doesn't seem important to me unless you're
talking about millions of them per second. If it's ease of coding you desire,
write a macro.

Romney White
ParaSoft, Inc.



Sat, 18 Apr 1998 03:00:00 GMT  
 Thoughts on a new instruction

Quote:
>In theory, you may never be done.  In practice, a lot of the time,
>all you want to do is say "hey hardware -- go stroke the counter in
>location such-and-such, while I continue with the other stuff I
>need to do.  Don't bother me with any details such as its current
>value -- I don't need it loaded into a register".  Now, isn't
>that conceptually a WHOLE LOT simpler?  For increment and decrement,
>anyway.  

The key is that CS does NOT cause processor serialization. It does not halt either
processor in any way. Your INC or DEC instruction would require that the other
processors in the complex (not just the other processor, since you can have up
to ten) be halted so that they would not reference the storage location, then
the controlling processor could increment (decrement) the location, and then
the other processors could be released.

Why halt the work of nine so that one can increment a storage location. Isn't it
better to let the one spin while trying to get control of the location while the
other nine continue to perform useful work? I think that this may be the reasoning
behind not having your "simple" instructions.



Sun, 19 Apr 1998 03:00:00 GMT  
 Thoughts on a new instruction

    Yes, CS is used for much more than interlocked increments and decrements,
    but the fact remains, that in VM and I suspect in MVS, there are
    thousands of calls a second (there's a guess for you) to code to
    increment or decrement a fullword or halfword with interlock.

Well, I think that thousands of atomic increments/decrements per second is a
bit on the high side...

I can't speak for MVS, but VM does remarkably few increments and/or
decrements as a percentage of total use of CS/CDS.  And, for those cases
where it *does* simply want to increment or decrement a word, there is a
macro which can be used to cover up the idiomatic use of CS.

Note, though, that the System/390 architecture is guided as much by
processor performance as by programmer convenience, especially since the IBM
system programmers mostly use PL/X and everybody else mostly uses COBOL,
fortran, PL/I, or C.  The biggest "problem" with an atomic INC instruction
would be that programmers would tend to use it indiscriminately whenever
they wanted to increment a word, regardless of whether they needed atomicity
or not.  Note that, performance-wise, there is a big difference between:

                LA    1,1
                A     1,FOO
                ST    1,FOO

and:

                L     1,FOO
        LOOP    LA    2,1
                AR    2,1
                CS    1,2,FOO
                BC    7,LOOP

and the second should *only* be used when you need the serialization.

Another thing to keep in mind is that, contrary to popular opinion, the
System/390 is basically a (heresy!) RISC engine with various useful CISCish
things bolted in on top.  E.g., other than the decimal instructions, note
the lack of memory-to-memory arithmetic: the architecture is very register
centric rather than memory centric.  Your proposed INC and DEC instructions
would be completely out of line with this overall design.


      IBM Research, Yorktown Heights



Sun, 19 Apr 1998 03:00:00 GMT  
 Thoughts on a new instruction

Quote:

>need to do.  Don't bother me with any details such as its current
>value -- I don't need it loaded into a register".  Now, isn't
>that conceptually a WHOLE LOT simpler?  For increment and decrement,
>anyway.  

For whatever you are doing, a single instruction that does that thing,
seems more efficient.  But programming doesn't work that way.  
A processor architecture is supposed to provide primitives out of
which programs can be built.  For multiprocessing programs CS is a
necessary and sufficient primitive - you can build any other primitive
(including atomic INC and DEC, atomic bit update, test and set etc.)
out of it _but not vice versa_.

Quote:
>A hardware INC ought to be REALLY efficient.  I have to admit, I
>have no idea how many times per second it would be used.

Well suppose it's an incredible 1 BILLION times faster than CS,
and you use it once every 1000 instructions.  Your program speeds
up by a factor of 0.1% - assuming the cost of all that extra logic
in the storage interface hasn't made the machine run slower overall.


Mon, 20 Apr 1998 03:00:00 GMT  
 Thoughts on a new instruction
    The key is that CS does NOT cause processor serialization.
    [...]

From "ESA/390 Principles of Operation", SA22-7201-02, under "Compare and
Swap" on page 7-20:

        "A serialization function is performed before the operand is fetched
        and again after the operation is completed."

Of course, what PrincOps means by "serialization" is probably different from
what you intended by the use of the word; specifically, only the processor
executing the instruction is necessarily affected, and the serialization
function consists of discarding any early FETCHes and pushing out any late
STOREs.  However, this is not the issue...

It would be "easy" to add an atomic INCREMENT operation to the architecture;
it would simply be implemented "under the covers" using the existing support
that is there for compare and swap: the processor issuing the instruction
acquires exclusive access to the cache line containing the target storage
location, and then performs the update operation "locally".  No big deal.
The other CPUs are unaffected in either case.  


      IBM Research, Yorktown Heights



Mon, 20 Apr 1998 03:00:00 GMT  
 Thoughts on a new instruction
Quote:


>>In theory, you may never be done.  In practice, a lot of the time,
>>all you want to do is say "hey hardware -- go stroke the counter in
>>location such-and-such, while I continue with the other stuff I
>>need to do.  Don't bother me with any details such as its current
>>value -- I don't need it loaded into a register".  Now, isn't
>>that conceptually a WHOLE LOT simpler?  For increment and decrement,
>>anyway.  

>The key is that CS does NOT cause processor serialization. It does not halt either
>processor in any way. Your INC or DEC instruction would require that the other
>processors in the complex (not just the other processor, since you can have up
>to ten) be halted so that they would not reference the storage location, then
>the controlling processor could increment (decrement) the location, and then
>the other processors could be released.

Oops, time out for a manual check here.  In my copy of 370/XA POO (Principles
of Operation), it states specifically "A serialization function is performed before the
operand is fetched and again after the operation is completed.".

The whole point of CS is its atomicity (one instruction, logically interruptible either
before or after execution, but not during), and the fact that the storage update
(via serialization) is consistent when viewed by multiple separate CPUs.

Quote:
>Why halt the work of nine so that one can increment a storage location. Isn't it
>better to let the one spin while trying to get control of the location while the
>other nine continue to perform useful work? I think that this may be the reasoning
>behind not having your "simple" instructions.

Consider the case where you have 2 (or 3, or 4, or ...) spinning on the same
instruction ...

Also, in the Intel world, if you have an MP system, I believe you have to issue the
instruction prefix 'lock' (which forces a LCK signal on the memory bus), to guarantee
coherent updates by multiple CPUs to a single memory location.



Tue, 21 Apr 1998 03:00:00 GMT  
 Thoughts on a new instruction

'

Quote:
>Note, though, that the System/390 architecture is guided as much by
>processor performance as by programmer convenience, especially since the IBM
>system programmers mostly use PL/X and everybody else mostly uses COBOL,
>FORTRAN, PL/I, or C.  The biggest "problem" with an atomic INC instruction

...

Quote:
>Another thing to keep in mind is that, contrary to popular opinion, the
>System/390 is basically a (heresy!) RISC engine with various useful CISCish
>things bolted in on top.  E.g., other than the decimal instructions, note
>the lack of memory-to-memory arithmetic: the architecture is very register
>centric rather than memory centric.  Your proposed INC and DEC instructions
>would be completely out of line with this overall design.

Agreed. My hardware monitor measurements of large S/370s running hundreds
of concurrent users doing commercial and scientific work showed that
the risc core: L A LA TM CLI BALR BR BC and a few others,
constituted about 85% of the instruction mix over a short and long
time intervals[up to a week]. SS ops such as mvc were WAY down
the list, and floating point was much less than 1%. The system
was mostly assembler code which tried to exploit SS ops whenever
it could, too! [This was in the days when we thought exploiting
SS ops was a good idea.]

The point to be made is the same as that made by Hennessy&Patterson
and the RISC community: Fancier instructions may or may not run
faster than simple ones. HOWEVER, if they may result in additional
gate delays in the pipeline[thereby limiting maximum clock speed],
or if they may stall the pipeline [Ooops, have to wait to see
if we can page fault somewhere in this MVCL. Oops, this MVC might
stomp on the next instruction...], then you want to look at
them VERY carefully.

Bob



Tue, 21 Apr 1998 03:00:00 GMT  
 Thoughts on a new instruction

: Well, I think that thousands of atomic increments/decrements per second is a
: bit on the high side...

Well, I was guessing, and assuming a VERY FAST processor  ;-)  .

: I can't speak for MVS, but VM does remarkably few increments and/or
: decrements as a percentage of total use of CS/CDS.  

Well, you should be in a position to know, so I'll take your word for it.
It just *seemed* inefficient.  Maybe not.

I wasn't worried about having to code the CS loop; I did use the HCPCOUNT
macro (of course) when I needed to count something (if for no other reason,
to be able to cross-reference where things are counted).  And I only used
the INTERLOCK parameter when it was necessary.  Programmer convenience
wasn't the issue; it was the use of CS itself (and using the macro didn't
make me forget that CS would be used...).

Thanks for the info.  

-- David Walker
:    

:       IBM Research, Yorktown Heights



Wed, 22 Apr 1998 03:00:00 GMT  
 Thoughts on a new instruction


        >'
        >>Note, though, that the System/390 architecture is guided as much by
        >>processor performance as by programmer convenience, especially since the IBM
        >>system programmers mostly use PL/X and everybody else mostly uses COBOL,
        >>FORTRAN, PL/I, or C.  The biggest "problem" with an atomic INC instruction

        >...

        >>Another thing to keep in mind is that, contrary to popular opinion, the
        >>System/390 is basically a (heresy!) RISC engine with various useful CISCish
        >>things bolted in on top.  E.g., other than the decimal instructions, note
        >>the lack of memory-to-memory arithmetic: the architecture is very register
        >>centric rather than memory centric.  Your proposed INC and DEC instructions
        >>would be completely out of line with this overall design.

        >Agreed. My hardware monitor measurements of large S/370s running hundreds
        >of concurrent users doing commercial and scientific work showed that
        >the risc core: L A LA TM CLI BALR BR BC and a few others,
        >constituted about 85% of the instruction mix over a short and long

---This is an expected result of doing such surveys.

   It depends wholly on the code produced by the compiler, which
may not be up to much.  When the code is written by hand, it
depends on the skills of the programmer.  Much of the speed
of the PL/C compiler is credited with the fact that they used
the TRT instruction extensively, for example.

   A simple count of instruction executions always shows up
CISC instructions poorly, simply because you don't need to execute
a lot of them to do useful work.

   A better measure of useage is to measure the instruction *t*i*m*e*s
taken, and that way, CISC-type instructions rank much higher.

   If you compare a move (move 256 bytes from A to B), it requires
one MVC instruction.  To do the same job with a loop requires
4 instructions (load, load, store, decrement & text).  The number
of executed instructions is 64x3+1 = 193.  Conclusion: Don't
need MVC because it's not executed often.  Need Load, store,
decrement & test instead, after all, each is executed 64 times!

   In other words, simple instruction counts cause CISC-type
instructions to be "swamped" by RISC-type instruction counts.



Fri, 01 May 1998 03:00:00 GMT  
 Thoughts on a new instruction

:    A simple count of instruction executions always shows up
: CISC instructions poorly, simply because you don't need to execute
: a lot of them to do useful work.

:    In other words, simple instruction counts cause CISC-type
: instructions to be "swamped" by RISC-type instruction counts.

Very perceptive!  I hadn't thought of that, but I think you're
absolutely right.

-- David Walker



Sat, 02 May 1998 03:00:00 GMT  
 Thoughts on a new instruction


   :    A simple count of instruction executions always shows up
   : CISC instructions poorly, simply because you don't need to execute
   : a lot of them to do useful work.

   :    In other words, simple instruction counts cause CISC-type
   : instructions to be "swamped" by RISC-type instruction counts.

   Very perceptive!  I hadn't thought of that, but I think you're
   absolutely right.

Who cares about instruction counts?  Measure *time* spent in the
instructions.

I think you'll find that by this metric, MVCL, for example, isn't
worth having (especially since it has some irritating limitations,
such as 16MB operand length): an MVC loop does the trick nicely.  (On
the other hand, moving characters with a load-store loop is typically
too slow (especially on RISC machines with register interlock), so a
simple MVC command is a win.)

A lot of CISC instructions weren't well thought-out.  And sometimes
RISC goes too far.  Good engineering is finding the happy medium
between the extremes.

--

ExperNet                        phone: +1.415.949.8691
Systems Architect               fax:   +1.415.949.1395
KNS Project Leader              3945 Freedom Circle,
                                Suite 240, Santa Clara CA 95054



Sat, 02 May 1998 03:00:00 GMT  
 
 [ 14 post ] 

 Relevant Pages 

1. A thought about new:

2. A few thoughts from a new Dylan user..

3. New Oberon OS: some thoughts

4. ****** New Os Development Idea ****** Get your free thoughts and opinions he

5. Thoughts of a new Cobol Programmer

6. Thoughts on new grammar rules (PEP 284 in particular)

7. new instruction (fun)

8. New 64-bit Instruction Mnemonics

9. New instructions

10. New instructions we'd like to see

11. Looking for a new Sort Pattern/Stacker Control Instruction (SCI) generator

12. New BRXH and BRXLE Instructions

 

 
Powered by phpBB® Forum Software