My professor said.....
Author Message
My professor said.....

This is not an assignment for school, and the nature of this post will
portray that well. Just wanted to cover that base first.
:)

My 370 Assembler prof said this peice of code is otimized "to it's maximum"
meaning computational effceincy. I always like to throw him curve balls if I
can, because he can be a {*filter*}. I like the guy,I wish there was another
way to put it. Anyway.

The thing is to represent the following expression in code.

E = ( K - T + R * P) / (the absolute value of ( Y - 5))

These are code snippets not entire listings.
Please excuse alignment.

L        5,R            load r5 with value at r
M       4,P            multiply r5 by whats at location p
L        9,K            load r9 with value located at k
S        9,T            r9 now contains the difference of (k-t)
AR     5,9            sum it and stick it in r5

LR        4,5
L            8,Y

S            8,=F'5'            diff in r8
LPR        8,8                abs value of r8
SRDA        4,32            make 64bit num for division
DR            4,8                divide 4 and 5 by 8
ST            5,E                put answer in 5

This does it, and I would like to see iff it can be done with much less
code. I haven't so far, but am still playing with it. It's pretty explicit
and goes step by step.

Henry Williams

Sun, 15 Apr 2001 03:00:00 GMT
My professor said.....

Quote:

>E = ( K - T + R * P) / (the absolute value of ( Y - 5))
>These are code snippets not entire listings.
>Please excuse alignment.
>L        5,R            load r5 with value at r
>M       4,P            multiply r5 by whats at location p
>L        9,K            load r9 with value located at k
>S        9,T            r9 now contains the difference of (k-t)
>AR     5,9            sum it and stick it in r5
>LR        4,5
>L            8,Y
>S            8,=F'5'            diff in r8
>LPR        8,8                abs value of r8
>SRDA        4,32            make 64bit num for division
>DR            4,8                divide 4 and 5 by 8
>ST            5,E                put answer in 5
>This does it, and I would like to see iff it can be done with much less
>code.

I was only able to shave the instruction count by 1, by virtue of a
slightly better use of registers.

L     5,R               Load R
L     8,K               Load K
L     9,Y               Load Y
M     4,P               Compute R * P, result in reg 5
S     8,T               Compute K - T
S     9,=F'5'           Compute Y - 5
AR    8,5               K - T + R*P
LPR   4,9               ABS(Y-5)
SRDA  8,32
DR    8,4               (K-T+R*P)/(ABS(Y-5))
ST    9,E

The biggest different with my code, is that I interleave three parts
of the computation.  That allows for more instruction overlap in a
pipelining processor and should give slightly higher performance.
But discussion of that is probably too confusing to be taught in a
typical assembler class.  In any case, the difference in speed is
probably too small to matter unless this is inside the innermost loop
of a program.  The relatively slow DR instruction might use as much
as half of the computation time.

Sun, 15 Apr 2001 03:00:00 GMT
My professor said.....
Reg 4 should be set to 0s before the first multiply.

Quote:

>>E = ( K - T + R * P) / (the absolute value of ( Y - 5))

>>These are code snippets not entire listings.
>>Please excuse alignment.

>>L        5,R            load r5 with value at r
>>M       4,P            multiply r5 by whats at location p
>>L        9,K            load r9 with value located at k
>>S        9,T            r9 now contains the difference of (k-t)
>>AR     5,9            sum it and stick it in r5

>>LR        4,5
>>L            8,Y

>>S            8,=F'5'            diff in r8
>>LPR        8,8                abs value of r8
>>SRDA        4,32            make 64bit num for division
>>DR            4,8                divide 4 and 5 by 8
>>ST            5,E                put answer in 5

>>This does it, and I would like to see iff it can be done with much less
>>code.

>I was only able to shave the instruction count by 1, by virtue of a
>slightly better use of registers.

>         L     5,R               Load R
>         L     8,K               Load K
>         L     9,Y               Load Y
>         M     4,P               Compute R * P, result in reg 5
>         S     8,T               Compute K - T
>         S     9,=F'5'           Compute Y - 5
>         AR    8,5               K - T + R*P
>         LPR   4,9               ABS(Y-5)
>         SRDA  8,32
>         DR    8,4               (K-T+R*P)/(ABS(Y-5))
>         ST    9,E

>The biggest different with my code, is that I interleave three parts
>of the computation.  That allows for more instruction overlap in a
>pipelining processor and should give slightly higher performance.
>But discussion of that is probably too confusing to be taught in a
>typical assembler class.  In any case, the difference in speed is
>probably too small to matter unless this is inside the innermost loop
>of a program.  The relatively slow DR instruction might use as much
>as half of the computation time.

-- Steve Myers

The E-mail addresses in this message are private property.  Any use of them
to  send  unsolicited  E-mail  messages  of  a  commerical  nature  will be
considered trespassing,  and the originator of the message will be  sued in
small claims court in Camden County,  New Jersey,  for the  maximum penalty
allowed by law.

Sun, 15 Apr 2001 03:00:00 GMT
My professor said.....

Quote:

> Reg 4 should be set to 0s before the first multiply.

I'll admit to having looked this to verify, but it is not
necessary to clear R4. The first paragraph of the multiply
instruction in Book of Doom says, "The second word of the first
operand (multiplicand) is multiplied by the second operand
(multiplier) and the doubleword product is placed at the
first operand location."

I know Steve, you were just checking on us. Keeping us honest,
right?

Sun, 15 Apr 2001 03:00:00 GMT
My professor said.....
I was actually thinking of overflow and sign generation issues.

However, as the Good Book says, you can't overflow, and the even
register is ignored on input.  I had to look it up, too, says the
Assembler man doing it for 30 years, sheepishly.

Quote:

>> Reg 4 should be set to 0s before the first multiply.

>I'll admit to having looked this to verify, but it is not
>necessary to clear R4. The first paragraph of the multiply
>instruction in Book of Doom says, "The second word of the first
>operand (multiplicand) is multiplied by the second operand
>(multiplier) and the doubleword product is placed at the
>first operand location."

>I know Steve, you were just checking on us. Keeping us honest,
>right?

-- Steve Myers

The E-mail addresses in this message are private property.  Any use of them
to  send  unsolicited  E-mail  messages  of  a  commerical  nature  will be
considered trespassing,  and the originator of the message will be  sued in
small claims court in Camden County,  New Jersey,  for the  maximum penalty
allowed by law.

Mon, 16 Apr 2001 03:00:00 GMT
My professor said.....

Quote:

>  I had to look it up, too, says the
> Assembler man doing it for 30 years, sheepishly.

If you've been doing it for 30 years, you shouldn't be sheepish about
looking things up anymore... anybody that doesn't look things up won't
last 30 years<g>  -steve

Mon, 16 Apr 2001 03:00:00 GMT
My professor said.....
On Thu, 29 Oct 1998 14:08:20 -0800

Commented

I have been told by a pipeline guru that Load Multiple and Store
Multiple work faster when both of the following are true:

The first register in the range is even and the second register in the
range is odd.
The storage area is double word aligned.

<<

I wonder if there could be further comment on this.  The trick seems to be to
have the second register odd (regardless of first register), and to have the
address range end at the end of a double word. That is different.

In other words you are not compelled to use an even number of registers to
optimize LM/STM.
You are just dealing with a 64 bit wide transfer somewhere in the data path.

If you do have an even number of registers utilizing R-even,R-odd sets you up
for possible maximum speed, but only if you boundary align on a 64-bit (double
word) boundary.

But of course you don't always want to move an even number of registers in this
way.

If you have an odd number of registers. There would seem to be two possible
optimizations.
1) use R-even,R-even and boundary align the begining of the area on the front
of a double-word boundary, or 2) use R-odd,R-odd and boudary align the end of
the area on the back of a  double-word boundary.

Isn't there some counter-intuitive reason that the hardware actually oprimizes
alternative 2 more effectively. Are the LM/STM instructions actually performed
in reverse order, 64 bits at a time?

Robert Rayhawk

Tue, 17 Apr 2001 03:00:00 GMT
My professor said.....

Quote:

> On Thu, 29 Oct 1998 14:08:20 -0800

> Commented

> I have been told by a pipeline guru that Load Multiple and Store
> Multiple work faster when both of the following are true:

> The first register in the range is even and the second register in the
> range is odd.
> The storage area is double word aligned.

> <<

> I wonder if there could be further comment on this.  The trick seems to be to
> have the second register odd (regardless of first register), and to have the
> address range end at the end of a double word. That is different.

Did you ever notice that in the prolog code for PL/I, that it does a STM
R14,R11?

Since R12 is the TCA pointer, which never changes, you can get away with
this. I always wondered if they avoided the storing of R12 to save a
cycle.

I haven't looked at PL/I generated code since before LE/370, so things
may have changed recently.
--

Beyond Software, Inc.      http://www.beyond-software.com
"Transforming Legacy Applications"

Tue, 17 Apr 2001 03:00:00 GMT
My professor said.....

Quote:

>Did you ever notice that in the prolog code for PL/I, that it does a STM
>R14,R11? Since R12 is the TCA pointer, which never changes, you can get
>away with this. I always wondered if they avoided the storing of R12 to
save a
>cycle. I haven't looked at PL/I generated code since before LE/370, so
things
>may have changed recently.

Quite a few compilers do the same mindless thing. The SAS C and C++
compilers 'frinstance. They had quite a lot of commonality with PL/1 in the
distant past, so its probably not too surprising. The crazy thing is that
STM 14,12,12(13) is such a staggeringly common statement that most
processors recognize it and optimize it. The 14,11 thing breaks that
optimization and almost certainly executes no faster.

Then there are the other generated code quirks. Some of the code fragments
emitted by the SAS compilers can be absolutely baffling. Sure they work, but
I would never have come up with them by hand. I don't have a handy listing,
but some of them are unbelievably obtuse.

Chris.

The standard LE prolog code stores 14,12 just like the book says it oughta.

Tue, 17 Apr 2001 03:00:00 GMT
My professor said.....
[ posted and emailed ]

One optimisation I haven't seen mentioned yet:

Quote:
>L            8,Y
>S            8,=F'5'            diff in r8
>LPR        8,8                abs value of r8

could be replaced by

LA       8,5
S        8,Y
LPR      8,8

eliminating one memory reference and a 4-byte literal.

--

Sat, 21 Apr 2001 03:00:00 GMT
My professor said.....

Quote:

>> On Thu, 29 Oct 1998 14:08:20 -0800

>> Commented

>Did you ever notice that in the prolog code for PL/I, that it does a STM
>R14,R11?
>Since R12 is the TCA pointer, which never changes, you can get away with
>this. I always wondered if they avoided the storing of R12 to save a
>cycle.
>I haven't looked at PL/I generated code since before LE/370, so things
>may have changed recently.
>--

I think PL/I F did the same thing, though in those days machines with
64 bit data path were rare.

-- glen

Sat, 21 Apr 2001 03:00:00 GMT
My professor said.....

Quote:
>Did you ever notice that in the prolog code for PL/I, that it does a
>STM R14,R11?

>Since R12 is the TCA pointer, which never changes, you can get away with
>this. I always wondered if they avoided the storing of R12 to save a
>cycle.

The C/370 compiler went farther and saved only the registers that a
function modified.  I seem to recall something like 1 cycle per
register was the cost (and a similar cost for the matching LM).  That
is, STM R14,R11,12(R13) is equivalent in cost to ST R14,12(,R13);
ST R15,16(,R13); ST R0,20(,R13); ...; ST R11,68(,R13) [except for
cache costs for the larger code size].

- peter ludemann

Sat, 21 Apr 2001 03:00:00 GMT

 Page 1 of 1 [ 14 post ]

Relevant Pages

Powered by phpBB® Forum Software