MVCL, the Mighty 
Author Message
 MVCL, the Mighty

Quote:




>---No, I haven't forgotten.  But there's one *mighty* important
>difference: the instructions to set up such a CISC move/compare
>are executed just *once* and only once, not for *every* byte moved.
>However, the same instructions are needed to set up RISC (load
>addresses of source/destination/length of transfer).
>   The move/compare itself requires just one instruction to be
>held in control -- not a loop.

Ah, the mighty MVCL instruction on the S/370, eh? This instruction
is of the type you describe. You set up FOUR registers with appropriate
contents and it'll go to town.
Of course they must be TWO pairs of even/odd registers, so your
register allocator may produce crummy code for the remainder of
the subroutine in order to have the register resources needed to
use MVCL.

The Mighty MVCL will move up to 16meg of data at a crack.
Of course, if you may want to move
more than that [as in tinkering with a large piece of weather data
or something], then you have to write a loop.

The Mighty MVCL will pad your data with a specified character
such as zero or blank.
Of course, if you want to pad with something other than a single character,
such as "---this page intentionally left blank---", or a floating point
1.0d0 or a Unicode character, then you can't use it.

The Mighty MVCL also runs about half the speed of an MVC loop, at least
on any machine I've done timings on.

Oh, and the setup is about the same as the setup for an MVC loop.

The ONLY thing MVCL does right is: It'll move zero bytes! [MVC
insists on moving at least 1 byte -- false economy time...]

Now, what were you saying about how swell CISC instructions are?

Bob



Sun, 19 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty


: >

: >

: >
: >---No, I haven't forgotten.  But there's one *mighty* important
: >difference: the instructions to set up such a CISC move/compare
: >are executed just *once* and only once, not for *every* byte moved.
: >However, the same instructions are needed to set up RISC (load
: >addresses of source/destination/length of transfer).
: >   The move/compare itself requires just one instruction to be
: >held in control -- not a loop.

: Ah, the mighty MVCL instruction on the S/370, eh? This instruction
: is of the type you describe. You set up FOUR registers with appropriate
: contents and it'll go to town.
: Of course they must be TWO pairs of even/odd registers, so your
: register allocator may produce crummy code for the remainder of
: the subroutine in order to have the register resources needed to
: use MVCL.

  I don't think 1 stm and 2 lm insrructions is too much to pay for
   avoiding loops of 256 byte moves.

: The Mighty MVCL will move up to 16meg of data at a crack.
: Of course, if you may want to move
: more than that [as in tinkering with a large piece of weather data
: or something], then you have to write a loop.

: The Mighty MVCL will pad your data with a specified character
: such as zero or blank.
: Of course, if you want to pad with something other than a single character,
: such as "---this page intentionally left blank---", or a floating point
: 1.0d0 or a Unicode character, then you can't use it.

: The Mighty MVCL also runs about half the speed of an MVC loop, at least
: on any machine I've done timings on.

   Suggest you review your timing routines.  There is NO address
   setup cycles for R+disp as in the mvc inst.  
   Or are yoy doing mvcl with a lot of pad chars and including that
   in your timing loop?

: Oh, and the setup is about the same as the setup for an MVC loop.

: The ONLY thing MVCL does right is: It'll move zero bytes! [MVC
: insists on moving at least 1 byte -- false economy time...]

: Now, what were you saying about how swell CISC instructions are?

: Bob
--
From the keyboard of Don Eakin --- Conversion Associates



Mon, 20 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty

Quote:




>: >

>: >

>: >
>: Of course they must be TWO pairs of even/odd registers, so your
>: register allocator may produce crummy code for the remainder of
>: the subroutine in order to have the register resources needed to
>: use MVCL.

>  I don't think 1 stm and 2 lm insrructions is too much to pay for
>   avoiding loops of 256 byte moves.

I'm suggesting that register allocators in compilers may not see things
the same way, and that getting 2 pairs of even/odd registers is
harder than getting 4 independent registers. That may cause
poor allocation of registers for other parts of the program, and
they, in turn may end up running slower, so your MVCL can run
at all.

Quote:

>: The Mighty MVCL also runs about half the speed of an MVC loop, at least
>: on any machine I've done timings on.

>   Suggest you review your timing routines.  There is NO address
>   setup cycles for R+disp as in the mvc inst.  

No, but there are substantial costs having to do with nailing down
all the pages involved, updating registers [4 of them -- counts
and addresses], checking for potential interrupts, etc.

I did these timings on a /3081d and /3081k. It would be nice to
see timings on a current system.

Quote:
>   Or are yoy doing mvcl with a lot of pad chars and including that
>   in your timing loop?

No padding at all. Pure data movement: Copy this here array to
that there array. Plot this for N bytes, 0<=N<=100000 or so.
I think the mvc loop looks something like:
 l0    mvc 0(256,r1),0(r4)
       ar  r1,r2        
       bxle r4,r2,l0

Now, this one makes the register allocator a wee bit crazy as
well, as you need evenreg+0 1 2 for the bxle. In our system,
there was pretty well always a set reserved for such use.
If you don't like that, replace the bxle with la/ar, cr, bc.

The above moves a multiple of 256 bytes. There is some trailing
that uses execute to move the tail fragment.

BOb



Tue, 21 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty




: >: >

: >: >

: >: >
: >: Of course they must be TWO pairs of even/odd registers, so your
: >: register allocator may produce crummy code for the remainder of
: >: the subroutine in order to have the register resources needed to
: >: use MVCL.
: >
: >  I don't think 1 stm and 2 lm insrructions is too much to pay for
: >   avoiding loops of 256 byte moves.

: I'm suggesting that register allocators in compilers may not see things
: the same way, and that getting 2 pairs of even/odd registers is
: harder than getting 4 independent registers. That may cause
: poor allocation of registers for other parts of the program, and
: they, in turn may end up running slower, so your MVCL can run
: at all.

I think the point made was that no matter what your current allocation
of registers, it can be taken care of with one STM to save the regs, then
use MVCL, then (only one) LM to reload the saved values.  This would *not*
affect the overall allocation of registers, so the rest of the program would
not be bothered at all.  Getting 2 pairs of even/odd registers is not hard
at all!

-- David Walker

: >
: >: The Mighty MVCL also runs about half the speed of an MVC loop, at least
: >: on any machine I've done timings on.
: >
: >   Suggest you review your timing routines.  There is NO address
: >   setup cycles for R+disp as in the mvc inst.  

: No, but there are substantial costs having to do with nailing down
: all the pages involved, updating registers [4 of them -- counts
: and addresses], checking for potential interrupts, etc.

: I did these timings on a /3081d and /3081k. It would be nice to
: see timings on a current system.

: >   Or are yoy doing mvcl with a lot of pad chars and including that
: >   in your timing loop?

: No padding at all. Pure data movement: Copy this here array to
: that there array. Plot this for N bytes, 0<=N<=100000 or so.
: I think the mvc loop looks something like:
:  l0    mvc 0(256,r1),0(r4)
:        ar  r1,r2        
:        bxle r4,r2,l0

: Now, this one makes the register allocator a wee bit crazy as
: well, as you need evenreg+0 1 2 for the bxle. In our system,
: there was pretty well always a set reserved for such use.
: If you don't like that, replace the bxle with la/ar, cr, bc.

: The above moves a multiple of 256 bytes. There is some trailing
: that uses execute to move the tail fragment.

: BOb



Tue, 21 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty

Quote:





>>---No, I haven't forgotten.  But there's one *mighty* important
>>difference: the instructions to set up such a CISC move/compare
>>are executed just *once* and only once, not for *every* byte moved.
>>However, the same instructions are needed to set up RISC (load
>>addresses of source/destination/length of transfer).
>>   The move/compare itself requires just one instruction to be
>>held in control -- not a loop.

>Ah, the mighty MVCL instruction on the S/370, eh?
>This instruction is of the type you describe.
>You set up FOUR registers with appropriate contents and it'll go to town.

>Of course they must be TWO pairs of even/odd registers,
>so your register allocator may produce crummy code for the remainder of
>the subroutine in order to have the register resources needed to use MVCL.

Simple solution to the "lack" of registers:
use (R14,R15) and (R0,R1) as the two register-pairs.

These registers are always "available",
since you can't be doing the sequence:

        LA   R1,=A(Arg1,Arg2,...,ArgN)
        L    R15,=V(external_subroutine)
        BALR R14,R15    CALL EXTERNAL_SUBROUTINE(Arg1,Arg2,...,ArgN)
        LTR  R0,R0      Test return-code

(which sets/uses all 4 of those registers)
at precisely the same time as you're setting-up for the 'MVCL'.



Wed, 22 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty





: : >: >

: : >: >

: : >: >
: : >: Of course they must be TWO pairs of even/odd registers, so your
: : >: register allocator may produce crummy code for the remainder of
: : >: the subroutine in order to have the register resources needed to
: : >: use MVCL.
: : >
: : >  I don't think 1 stm and 2 lm insrructions is too much to pay for
: : >   avoiding loops of 256 byte moves.

: : I'm suggesting that register allocators in compilers may not see things
: : the same way, and that getting 2 pairs of even/odd registers is
: : harder than getting 4 independent registers. That may cause
: : poor allocation of registers for other parts of the program, and
: : they, in turn may end up running slower, so your MVCL can run
: : at all.

: I think the point made was that no matter what your current allocation
: of registers, it can be taken care of with one STM to save the regs, then
: use MVCL, then (only one) LM to reload the saved values.  This would *not*
: affect the overall allocation of registers, so the rest of the program would
: not be bothered at all.  Getting 2 pairs of even/odd registers is not hard
: at all!

: -- David Walker

How about 1 LM to set parms for the move.

Don Eakin

: : >
: : >: The Mighty MVCL also runs about half the speed of an MVC loop, at least
: : >: on any machine I've done timings on.
: : >
: : >   Suggest you review your timing routines.  There is NO address
: : >   setup cycles for R+disp as in the mvc inst.  

: : No, but there are substantial costs having to do with nailing down
: : all the pages involved, updating registers [4 of them -- counts
: : and addresses], checking for potential interrupts, etc.

: : I did these timings on a /3081d and /3081k. It would be nice to
: : see timings on a current system.

: : >   Or are yoy doing mvcl with a lot of pad chars and including that
: : >   in your timing loop?

: : No padding at all. Pure data movement: Copy this here array to
: : that there array. Plot this for N bytes, 0<=N<=100000 or so.
: : I think the mvc loop looks something like:
: :  l0    mvc 0(256,r1),0(r4)
: :        ar  r1,r2        
: :        bxle r4,r2,l0

: : Now, this one makes the register allocator a wee bit crazy as
: : well, as you need evenreg+0 1 2 for the bxle. In our system,
: : there was pretty well always a set reserved for such use.
: : If you don't like that, replace the bxle with la/ar, cr, bc.

: : The above moves a multiple of 256 bytes. There is some trailing
: : that uses execute to move the tail fragment.

: : BOb

--
From the keyboard of Don Eakin --- Conversion Associates



Thu, 23 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty




: >: >

: >: >

: >: >
: >: Of course they must be TWO pairs of even/odd registers, so your
: >: register allocator may produce crummy code for the remainder of
: >: the subroutine in order to have the register resources needed to
: >: use MVCL.
: >
: >  I don't think 1 stm and 2 lm insrructions is too much to pay for
: >   avoiding loops of 256 byte moves.

: I'm suggesting that register allocators in compilers may not see things
: the same way, and that getting 2 pairs of even/odd registers is
: harder than getting 4 independent registers. That may cause
: poor allocation of registers for other parts of the program, and
: they, in turn may end up running slower, so your MVCL can run
: at all.
: >
: >: The Mighty MVCL also runs about half the speed of an MVC loop, at least
: >: on any machine I've done timings on.
: >
: >   Suggest you review your timing routines.  There is NO address
: >   setup cycles for R+disp as in the mvc inst.  

: No, but there are substantial costs having to do with nailing down
: all the pages involved, updating registers [4 of them -- counts
: and addresses], checking for potential interrupts, etc.

: I did these timings on a /3081d and /3081k. It would be nice to
: see timings on a current system.

Not sure what you mean by nailing down pages, updating the 4 regs.,
etc., but tomorrow i will test moves om a 3121/440 with an aray of 1
meg and post results tomorrow or next day.

: >   Or are yoy doing mvcl with a lot of pad chars and including that
: >   in your timing loop?

: No padding at all. Pure data movement: Copy this here array to
: that there array. Plot this for N bytes, 0<=N<=100000 or so.
: I think the mvc loop looks something like:
:  l0    mvc 0(256,r1),0(r4)
:        ar  r1,r2        
:        bxle r4,r2,l0

: Now, this one makes the register allocator a wee bit crazy as
: well, as you need evenreg+0 1 2 for the bxle. In our system,
: there was pretty well always a set reserved for such use.
: If you don't like that, replace the bxle with la/ar, cr, bc.

: The above moves a multiple of 256 bytes. There is some trailing
: that uses execute to move the tail fragment.

: BOb

--
From the keyboard of Don Eakin --- Conversion Associates



Thu, 23 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty

Quote:

> I did these timings on a /3081d and /3081k. It would be nice to
> see timings on a current system.

It would be interesting.  The last reference I ever saw was Ed Stewart's
paper in the CME Newsletter #88, in which he did software timing of
instructions on a "slightly loaded 3084".

If I interpret his results correctly, Ed says a 4K move via MVCL operated
at approximately .175 MIPS.  A simple one-word MVC operated at 13.7 MIPS.

On a related note, Ed Stewart's instruction timing tables also say that
the following sequence:
  L     Rx,WORD
  LTR   Rx,Rx         (Load doesn't set CC)
is roughly twice as fast as:
  ICM   Rx,15,WORD    (But ICM does)
I coded the latter for years until I realized that its implementation
was shoddy -- switched to the former for awhile, but eventually got a
bellyful of saving nanoseconds.  I code for clarity these days, when
I do any code at all.  Must be getting old.

--
David Andrews



Thu, 23 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty



        >>  I don't think 1 stm and 2 lm insrructions is too much to pay for
        >>   avoiding loops of 256 byte moves.

---MVCL can be set up with as little as one LM instruction.

        >I'm suggesting that register allocators in compilers may not see things
        >the same way, and that getting 2 pairs of even/odd registers is
        >harder than getting 4 independent registers. That may cause
        >poor allocation of registers for other parts of the program, and
        >they, in turn may end up running slower, so your MVCL can run
        >at all.

        >>: The Mighty MVCL also runs about half the speed of an MVC loop, at least
        >>: on any machine I've done timings on.

        >>   Suggest you review your timing routines.  There is NO address
        >>   setup cycles for R+disp as in the mvc inst.  

        >No, but there are substantial costs having to do with nailing down
        >all the pages involved, updating registers [4 of them -- counts
        >and addresses], checking for potential interrupts, etc.

---The "costs" are not more than doing the same thing on a RISC
machine.

   On a RISC, we need to increment address registers (2 off),
decrement length registers (2 off), test for zero (2 off)
for each byte moved.  The test for pending interrupt is done ONCE
for each instruction on RISC.

   On account of the fact that some of these operations can
be done in parallel on CISC means that the cost for CISC
is less than RISC.

   For example, the tests for zero can be done simultaneously
and the result ORed simultaeously; the increments and decrements
can be done simultaneously on CISC.  THe test for pending interrupts
can be done ONCE per BYTE moved (once for each instruction
executed on RISC--there may well be 8 of them per byte moved).

   "Nailing down" the pages costs the same on any system --
that's unrelated to RISC or CISC.  It's performed by
independent hardware.

   And the programming cost and debugging cost?-- two (perhaps 4)
instructions for MVCL, and perhaps 30 for doing it on RISC.

        >I did these timings on a /3081d and /3081k. It would be nice to
        >see timings on a current system.

        >>   Or are yoy doing mvcl with a lot of pad chars and including that
        >>   in your timing loop?

        >No padding at all. Pure data movement: Copy this here array to
        >that there array. Plot this for N bytes, 0<=N<=100000 or so.
        >I think the mvc loop looks something like:
        > l0    mvc 0(256,r1),0(r4)
        >       ar  r1,r2        
        >       bxle r4,r2,l0

        >Now, this one makes the register allocator a wee bit crazy as
        >well, as you need evenreg+0 1 2 for the bxle. In our system,
        >there was pretty well always a set reserved for such use.
        >If you don't like that, replace the bxle with la/ar, cr, bc.

        >The above moves a multiple of 256 bytes. There is some trailing
        >that uses execute to move the tail fragment.

        >BOb



Fri, 24 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty

|> >
|> > I did these timings on a /3081d and /3081k. It would be nice to
|> > see timings on a current system.
|>
|> It would be interesting.  The last reference I ever saw was Ed Stewart's
|> paper in the CME Newsletter #88, in which he did software timing of
|> instructions on a "slightly loaded 3084".
|>
|> If I interpret his results correctly, Ed says a 4K move via MVCL operated
|> at approximately .175 MIPS.  A simple one-word MVC operated at 13.7 MIPS.
|>
|> On a related note, Ed Stewart's instruction timing tables also say that
|> the following sequence:
|>   L     Rx,WORD
|>   LTR   Rx,Rx         (Load doesn't set CC)
|> is roughly twice as fast as:
|>   ICM   Rx,15,WORD    (But ICM does)
|> I coded the latter for years until I realized that its implementation
|> was shoddy -- switched to the former for awhile, but eventually got a
|> bellyful of saving nanoseconds.  I code for clarity these days, when
|> I do any code at all.  Must be getting old.
|>
|> --
|> David Andrews

|>
Coding for clarity is the best choice unless the section of code requires performance tuning.  Many times, a manufacturer will implement in microcode in order to quickly support new instructions.  For example, program call and return on our Hitachi machines were unbelievably bad until they got around to implementing them with hardware - or, at least, hardware support.  At least you can make this choice with assembler and tune later.  Trying to make a C or PL/I routine "bahave" and generate decent code when a performance problem turns up is no fun.  Many times, the high level code is impossible to understand.


Fri, 24 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty

Quote:

>On a related note, Ed Stewart's instruction timing tables also say that
>the following sequence:
>  L     Rx,WORD
>  LTR   Rx,Rx         (Load doesn't set CC)
>is roughly twice as fast as:
>  ICM   Rx,15,WORD    (But ICM does)

My timings on a 3084 showed that ICM was 100ns while L/LTR was
about 74ns.  There is so much reordering within the CPU that it
probably depends on the surrounding half-dozen instructions.

Are BXLE/BXH ever worth using?
   LTR   R15,R15
   BNZ   JUMP
is about 100ns faster than
   BXH   R15,R15,JUMP
(another favourite trick) when R15 is zero and about 50ns faster
when it's non-zero.

My personal pet hate:
   TIME  TU
is an incredible 50 times slower than
   STCK
even when all clocks are working (as they usually are) - not significant
if you're just printing the date on some output, but significant if
you're religiously time-stamping all your control blocks for debugging
purposes.



Fri, 24 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty

Quote:


>My timings on a 3084 showed that ICM was 100ns while L/LTR was
>about 74ns.  There is so much reordering within the CPU that it
>probably depends on the surrounding half-dozen instructions.

>Are BXLE/BXH ever worth using?
>   LTR   R15,R15
>   BNZ   JUMP
>is about 100ns faster than
>   BXH   R15,R15,JUMP
>(another favourite trick) when R15 is zero and about 50ns faster
>when it's non-zero.

bxle is certainly a nice way to walk through a matrix control block
row by row:
   lm r0,r2,laf
   using ctlblock,r2
lp     ...
   bxle r2,r0,lp

   laf[0] is the length of the control block in bytes
   laf[1] is the address of the LAST row of the control block
          (not the address of the end of the last row)
   laf[2] is the address of the first row of the control block.

Now, if you adhere to a few simple design rules:
    a. Never code the length of the control block into an instruction.
       Always get it indirectly via l r0,laf[0]
    b. laf[0] should be initialized from an equate at the end of
       the dsect describing the layout of the control block.
    c. Never rearrange the control block, delete items from the
       middle, or add entries into the middle. Add items at the
       end only.
    d. Always use the laf bxle triple to find the control block,
       its length, etc.

Doing this lets you add new stuff to the control block by the simple expedient
of altering the dsect, reassembling the one ( if there's more than
one, you have a problem) module that initializes laf, and modifying the
[reassembly SHOULD be all that's required if you do it right)
module that allocates storage for the control block.

You don't have to reassemble the world this way. Often much simpler
and less error prone.

I'm not sure who came up with this scheme, but it was probably
Roger D. Moore or Larry Breed. It's used heavily in apl\360
and its derivatives  and it works great.

Don't know about its performance compared to other techniques, though.

Bob



Fri, 24 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty
So-called instruction timing tables are *all* suspect (at least for any IBM
mainframe built within the last 20 years), since there are simply too many
things going on in the CPU to be able to be captured by a single number.
E.g., things like pipeline stalls, caching, TLB hits/misses, the
effectiveness of branch prediction, out-of-order execution, etc.

AND, unless you are still using a 308X processor, timings which were made
for the 308X should be discarded, as that machine had an internal
organization which bore very little resemblence to most of the rest of the
high-performance processors that Poughkeepsie has cranked out.  If you want
to pretend to measure instruction timings, at least measure a normal
processor like the 303X or one of the 3090s.


      IBM Research, Yorktown Heights

P.S.:  I don't know about the rest of you folks, but I have always found
general registers 0,1 and 14,15 to be convenient for use with CDS and, I
suppose, MVCL (not that I ever had the need for MVCL).



Sat, 25 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty

|>On a related note, Ed Stewart's instruction timing tables also say that
|>the following sequence:
|>  L     Rx,WORD
|>  LTR   Rx,Rx         (Load doesn't set CC)
|>is roughly twice as fast as:
|>  ICM   Rx,15,WORD    (But ICM does)
|
|My timings on a 3084 showed that ICM was 100ns while L/LTR was
|about 74ns.  There is so much reordering within the CPU that it
|probably depends on the surrounding half-dozen instructions.
|
|Are BXLE/BXH ever worth using?
|   LTR   R15,R15
|   BNZ   JUMP
|is about 100ns faster than
|   BXH   R15,R15,JUMP
|(another favourite trick) when R15 is zero and about 50ns faster
|when it's non-zero.
|
|My personal pet hate:
|   TIME  TU
|is an incredible 50 times slower than
|   STCK
|even when all clocks are working (as they usually are) - not significant
|if you're just printing the date on some output, but significant if
|you're religiously time-stamping all your control blocks for debugging
|purposes.
However all of these instruction comparisons are extremely model dependant.
So unless a particular instruction combination is always faster than a
particular single instruction that the sequence is replacing, how do you
know it you are going to win/lose and by how much.
--Rostyk


Sat, 25 Apr 1998 03:00:00 GMT  
 MVCL, the Mighty

Quote:



>|>the following sequence:
>However all of these instruction comparisons are extremely model dependant.
>So unless a particular instruction combination is always faster than a
>particular single instruction that the sequence is replacing, how do you
>know it you are going to win/lose and by how much.

a. You don't know. Writing model-dependent code is almost always
   a mistake. However, this thread arose from RISC/ CISC discussions,
   and the section I deleted above on L/LTR vs ICM is a typical case
   of where RISC is a winner on any current machine -- You design
   orthogonal functionality and use it like an erector set to obtain
   desired functions. L/LTR has the added advantage that a code schedule
   can often move the L backwards in time (by moving it earlier in the
   code sequence), thereby allowing the LTR/BC to proceed without
   pipeline stalls. Since ICM sets the cc, it MAY stall the pipeline
   [processor-dependent] if the data isn't there just yet.

   I think the only point worth keeping from this thread is that
   if you have a choice, 2 simpler instructions are probably better
   than one more complicated one, but unless you're writing a compiler,
   where many people will be using your code fragments [the ones generated,
   not the ones in the compiler], it probably doesn't make a big
   difference in most applications.

b. Sometimes, you are writing a very specific loop kernel for a very
   specific application which is going to eat a bazillion cpu cycles.
   In such cases, a factor of 2 in performance is worthwhile,
   and you might be willing to put time into doing careful analysis of
   it on a processor by processor basis. Fairly rare, I admit.

c. Sometimes, a specific processor has a perfectly ROTTEN implmentation
   of some instruction (The Amdahl V8 SSK comes to mind -- it took
   MILLISECONDS to execute!), and you get nailed by it if it appears
   in your code. In the above V8 example, we used SSK as a way to
   isolate tasks in a multi-user APL system, and would, due to the
   limited number of storage keys available on the S/370, set the
   task-to-be-dispatched storage to an available key, load a psw with
   the same key, and send the task off to work. Interrupts that
   task switched would set the keys back to the normal value.
   This worked fine until the V8 came in. Then, in the course of
   benchmarking some matrix code, I noticed cpu spikes that were
   big enough to perform thousands of iterations of my inner loop.
   We found out it was SSK. Had to cease use of SSK and get
   creative. Times are rough.
   To be fair, the Amdahl designers thought SSK was something that
   gets used only at page fault/address space startup time. With
   performance like that, they guaranteed it.

Bob



Sun, 26 Apr 1998 03:00:00 GMT  
 
 [ 19 post ]  Go to page: [1] [2]

 Relevant Pages 

1. MVCL, the Mighty

2. Mighty Sabo text in Clipper program

3. HELP o ye mighty btrieve masters!

4. MVCL Question

5. MVCL Benchmark Results

6. MVCL (was Re: why code in 370 Assembler)

7. MVC and MVCL (was Re: RISC Mainframe)

8. And there was a mighty Draft; was: Re: ANSI [C++]Committee Members

 

 
Powered by phpBB® Forum Software