Instruction speeds 
Author Message
 Instruction speeds

Is there a list of machine instruction speeds available for the IBM 370
somewhere?  A URL would be the preferable resource.
In particular, what would be faster for moving a variable number (< 256)
of bytes (ignoring setup time), MVC with EX, or MVCL, or a loop where
aligned fullwords are loaded and stored, or an unrolled loop of the
same, or some other method I haven't thought of yet?  What about moving
Quote:
> 256 bytes?

Has anyone done an optimization study on this?  Could you even get
meaningful timing results on a multitasking system?


Thu, 12 Apr 2001 02:00:00 GMT  
 Instruction speeds
Theta,

For your first question: No, there is no longer any published list
of instructions' speeds. Suffice it to say that each instruction,
given that the pipe is set up, will execute on the order of one
or, at most, two clock cycles. The 370 (390-class) engine has been
around for so long now that I suspect that there is very little
that can be done from a microcode standpoint to optimize any of
the regular instructions any more than that already accomplished.

(Exceptions to this are worst case multiply and divide.)

For your second example, ANY form of MVC is faster than an MVCL.
Even an MVC invoked with an EX. There is fastpath code in *most*
of the above mentioned engines which recognizes the field
propagation of moving a string of bytes from loc A to loc A+1.
This is assuming, of course that you intend to "clear" a field
to a known set of byte values. Even if it is not, the MVC still
wins.

If you have to go over the 256 boundary, AND if the quantity is
a known value, then multiple MVC's are still preferred. IF the
value to be moved is an EXECUTE time value, that is the amount
to move is not known at assembly time, then the MVCL may be the
easiest to implement. However, I caution against the use of
MVCL especially if used frequently within what would be called
mainline code. That is, within the main program loop.

Again, if you're trying to clear something more than 256, and
you don't know how much at assembly time, don't worry. More
fastpath microcode can be taken advantage of with the MVCL.
The biggest penalty in MVCL will be the number of page
boundaries the instruction crosses...

Good luck...

Quote:

> Is there a list of machine instruction speeds available for the IBM 370
> somewhere?  A URL would be the preferable resource.
> In particular, what would be faster for moving a variable number (< 256)
> of bytes (ignoring setup time), MVC with EX, or MVCL, or a loop where
> aligned fullwords are loaded and stored, or an unrolled loop of the
> same, or some other method I haven't thought of yet?  What about moving
> > 256 bytes?
> Has anyone done an optimization study on this?  Could you even get
> meaningful timing results on a multitasking system?



Thu, 12 Apr 2001 02:00:00 GMT  
 Instruction speeds

Quote:

> Theta,

> For your first question: No, there is no longer any published list
> of instructions' speeds. Suffice it to say that each instruction,
> given that the pipe is set up, will execute on the order of one
> or, at most, two clock cycles.

You should know better; after all, you mention MVCL a few lines latter.
An MVCL with a large count will take a lot more than 2 cycles.

Quote:
> For your second example, ANY form of MVC is faster than an MVCL.

Aren't there optimizations for, e.g., clearing storage?

Quote:
> The biggest penalty in MVCL will be the number of page
> boundaries the instruction crosses...

How does that differe from MVC? If anything, there should be less of a
problem with MVCL, since the instruction is shorter than an MVS loop.

--

Shmuel (Seymour J.) Metz
Reply to host nsf (dot) gov, user smetz



Fri, 13 Apr 2001 03:00:00 GMT  
 Instruction speeds
In a message dated 10-26-98, Shmuel (Seymour J.) Metz said to All about
"Instruction speeds"

Quote:

> Theta,

> For your first question: No, there is no longer any published list
> of instructions' speeds. Suffice it to say that each instruction,
> given that the pipe is set up, will execute on the order of one
> or, at most, two clock cycles.

SM>You should know better; after all, you mention MVCL a few lines latter.
SM>An MVCL with a large count will take a lot more than 2 cycles.

And we have completely omitted floating point and other SS instructions,
including the ever speedy packed decimal. Few of these go in 1 or 2 clock
cycles, even though most integer RR and RX instructions do.

Quote:
> For your second example, ANY form of MVC is faster than an MVCL.

SM>Aren't there optimizations for, e.g., clearing storage?

Yes, for MVC, XC and MVCL. Possibly others, but these 3 at least have been
documented for some years as optimized.

Quote:
> The biggest penalty in MVCL will be the number of page
> boundaries the instruction crosses...

SM>How does that differe from MVC? If anything, there should be less of a
SM>problem with MVCL, since the instruction is shorter than an MVS loop.

Moreover, when MVCL page faults, the first page fault can be used to cause a
pre-emptive page in of all the subsequent page frames, because the extent of
memory required is known by then. This should mean that a fair sized MVCL
should suffer only 1 page fault, even when processing several page frames of
storage. Whether or not IBM implements VSM that way is open to question, but
the reference pattern should be easy to establish.

Regards

Dave
<Team PL/I>
___
 * MR/2 2.25 #353 * "Maytag" is my middle name; I'm an agitator.



Fri, 13 Apr 2001 03:00:00 GMT  
 Instruction speeds

Quote:

>Theta,

>For your second example, ANY form of MVC is faster than an MVCL.
>Even an MVC invoked with an EX. There is fastpath code in *most*
>of the above mentioned engines which recognizes the field
>propagation of moving a string of bytes from loc A to loc A+1.
>This is assuming, of course that you intend to "clear" a field
>to a known set of byte values. Even if it is not, the MVC still
>wins.

>If you have to go over the 256 boundary, AND if the quantity is
>a known value, then multiple MVC's are still preferred. IF the
>value to be moved is an EXECUTE time value, that is the amount
>to move is not known at assembly time, then the MVCL may be the
>easiest to implement. However, I caution against the use of
>MVCL especially if used frequently within what would be called
>mainline code. That is, within the main program loop.

A former manager of mine experimented with MVC vs MVCL in his
previous shop.  He found that MVCL was fastest _only_ when
started on a 256-byte boundary.  They'd calculate the displacement
from the beginning of the source storage area to the next 256-byte
boundary and move that part with MVC, then use MVCL for the
rest of the area.  This was on S/370's (145's, I think), so
it may not be relavent on S/390's.

--Jim



Fri, 13 Apr 2001 03:00:00 GMT  
 Instruction speeds

Quote:

> In particular, what would be faster for moving a variable number (< 256)
> of bytes (ignoring setup time), MVC with EX, or MVCL, or a loop where
> aligned fullwords are loaded and stored, or an unrolled loop of the
> same, or some other method I haven't thought of yet?  What about moving
> > 256 bytes?

A long time ago (> 20 years) I saw where changing MVCLs to loops of
MVCs with an EX for the left-over bytes sped up a program by about 10%
(it did a lot of string moving).

Some years later, I observed that the PL/I "Optimizing" compiler had a
subroutine in its library that duplicated MVCL functionality, using
MVC loops.  When I replaced the subroutine with an MVCL, there was no
significant change in speed.

Given that simple instructions are hard-wired and complex instructions
are done in microcode, I'd be very surprised if the situation has
changed much.  Chances are, the MVCL is implemented in microcode just
like a loop of MVCs follwed by an EX MVC.

[This kind of stuff ought to go into a FAQ somewhere ...]

BTW, if you're worrying about these kinds of micro-optimizations,
you're probably worrying about the wrong things.

--
Peter Ludemann



Fri, 13 Apr 2001 03:00:00 GMT  
 Instruction speeds
. . .

Quote:

>Moreover, when MVCL page faults, the first page fault can be used to cause a
>pre-emptive page in of all the subsequent page frames, because the extent of
>memory required is known by then. This should mean that a fair sized MVCL
>should suffer only 1 page fault, even when processing several page frames of
>storage. Whether or not IBM implements VSM that way is open to question, but
>the reference pattern should be easy to establish.

Based on the observation of an MVCL causing repeated page faults with
ascending TEAs (as sometimes seen in the trace table), MVS does not do
this.

Andy Wood



Sat, 14 Apr 2001 03:00:00 GMT  
 Instruction speeds
Oh this is an old chestnut...

Quote:

>A long time ago (> 20 years) I saw where changing MVCLs to loops of
>MVCs with an EX for the left-over bytes sped up a program by about 10%
>(it did a lot of string moving).

>Some years later, I observed that the PL/I "Optimizing" compiler had a
>subroutine in its library that duplicated MVCL functionality, using
>MVC loops.  When I replaced the subroutine with an MVCL, there was no
>significant change in speed.

>Given that simple instructions are hard-wired and complex instructions
>are done in microcode, I'd be very surprised if the situation has
>changed much.  Chances are, the MVCL is implemented in microcode just
>like a loop of MVCs follwed by an EX MVC.

I love these arguments over instruction pathlengths. They can be interesting
from an abstract architectural perspective, but ultimately they don't
matter. The best practice is to use MVC where the length is known at
assembly time and its <256 bytes and use MVCL for more than 256.

The more interesting question is what to do when the length is NOT known at
assembly time. It now seems folkloric that MVCL <=> SLOW, but on modern
processors that doesn't seem to be the case. Consider the implementation...

MVC and MVCL are almost certainly BOTH at least partially implemented in
uCode on the CMOS machines. In both cases, there are storage operand
consistency issues to wrestle with (read POPs), but the instruction is just
another piece of work in the processor pipeline.

However, when you code the MVC/EX combination, you have...

(a) extra logic to break the storage into chunks.

(b) the EX instruction has to fetch a target instruction from a location
usually somewhat removed from the EX itself. That can cause cache spill/fill
etc. Then it has to feed it into the decode/execute stage with potential
disruption to the pipeline.

(c) you need to branch backwards to implement the loop. Branches always
stall the pipeline until sufficient instructions have committed for the h/w
to "know" which way it will go - although some machines may do speculative
execution.

So, based on the current state of microprocessor designs, I would guess that
a microcoded implementation would be a hand coded implementation hands down
most of the time. There might be some pathological cases, but in general I
would say you should use MVCL when the length is NOT known at assembly time.

Chris.



Sat, 14 Apr 2001 03:00:00 GMT  
 Instruction speeds

Quote:

> MVC and MVCL are almost certainly BOTH at least partially implemented in
> uCode on the CMOS machines. In both cases, there are storage operand
> consistency issues to wrestle with (read POPs), but the instruction is just
> another piece of work in the processor pipeline.

Chris, I'm so sure about MVC having uCode. From a frequency of occurance
point of view, MVC is certainly on of the top-ten hitters in assembled
or compiled code. It is unlikely that there would be any need for any
assistance in this arena...

Later,
BillB



Sat, 14 Apr 2001 03:00:00 GMT  
 Instruction speeds

Quote:

>I love these arguments over instruction pathlengths. They can be interesting
>from an abstract architectural perspective, but ultimately they don't
>matter. The best practice is to use MVC where the length is known at
>assembly time and its <256 bytes and use MVCL for more than 256.

Other questions, such as whether you have the spare registers
required by MVCL should also enter into the decision.

If your program is doing so much copying that the difference in
speeds matters, you ought to rethink the program design and see if
you can find a way of avoiding much of that copying.  Failing that,
try coding it both ways and do some timings.



Sat, 14 Apr 2001 03:00:00 GMT  
 Instruction speeds
I agree with Neil about copying.

Some years ago I wrote a simple minded data set copy program that
just did

LOOP READ
     CHECK
     WRITE
     CHECK
     B    LOOP

out of a single buffer.  It used hardly any CPU time at all compared to

LOOP GET  (locate)
     PUT  (move from buffer read by GET)
     B   LOOP

The PUT macro in the second program fragment does a storage copy.

For you readers that are MVS challenged.  READ and WRITE are macros that start an I/O event, sort of like the C language FREAD and FRWITE functions.
CHECK is a macro that waits for an I/O event to complete, and then verifies
it completed OK.  By separating the event initiation and the wait for
completion, the program can do other work, which you cannot do in C.

GET and PUT are sort of like FGETS and FPUTS in that they read a
complete logical record, with the underlying I/O hidden from the
program.

In my coding, I use MVCL a lot more than EX -> MVC.  A lot of times
I am doing the copying to move message segments to a completed message,
and the fact that MVCL is updating the output message pointer is
very convenient.  I have a protocol I usually follow to use registers
14 through 1 for the MVCL.  The 14-15 pair is the output side, and
the 0-1 pair is the input side.  It is a good way to use reg 0 for
an address, which it is normally not good for, and 14 through 1 are
usually safe from being altered by macros because, most of the time,
there are no macros in the message build path.

Sadly, though, most of the time you have to do the copying.  The first
program fragment is very, very fast in CPU usage terms.  However, it lacks
any functionality, such as record reblocking, which is a minimal
requirement for a real world data set copy program.


Quote:

>If your program is doing so much copying that the difference in
>speeds matters, you ought to rethink the program design and see if
>you can find a way of avoiding much of that copying.  Failing that,
>try coding it both ways and do some timings.

-- Steve Myers

The E-mail addresses in this message are private property.  Any use of them
to  send  unsolicited  E-mail  messages  of  a  commerical  nature  will be
considered trespassing,  and the originator of the message will be  sued in
small claims court in Camden County,  New Jersey,  for the  maximum penalty
allowed by law.



Sun, 15 Apr 2001 03:00:00 GMT  
 Instruction speeds

Quote:

> Did you also use a locate mode PUT?

> I imagine that:

> LOOP PUT   (also locate)
>      GET   (locate)
>      B     LOOP

> would perform nicely...so long as you used the same buffer

Nope. Won't work efficiently at all. Steve's first example using
GET-locate and PUT-move is the most efficient. (Other than track
to track copies using some home grown EXCP or something similar)

The only problem with QSAM locate from an efficiency standpoint
is not having QSAM completely reconstructing VBS records.



Sun, 15 Apr 2001 03:00:00 GMT  
 
 [ 19 post ]  Go to page: [1] [2]

 Relevant Pages 

1. Instruction Speed and instruction availability

2. --- Instruction speed ---

3. --- Instruction speed ---

4. Instruction Speed

5. P4/Athlon instruction speed

6. pentium instruction clock speeds

7. instruction -> speed

8. Speed of WAM instructions

9. Reference: HW Instructions to Speed Up Lisp?

10. Speed..Speed..Speed

11. Perl speed vs. Python speed

12. integer*8 speed vs integer*4 speed

 

 
Powered by phpBB® Forum Software