IBM Mainframes and APL performance 
Author Message
 IBM Mainframes and APL performance

Does anyone out there have any inside information as to which of the
current IBM mainframe processors work best with APL ???

The company I now work for (no names, no P45) recently "upgraded" from
a 9672-R75 G4 Enterprise Server to a 9672-RD6 G5 Enterprise Server.
Although the overall performance was quoted by IBM as being 30%
better, in fact the APL performance got 30% WORSE !!  They rapidly
backed out the upgrade, and went instead to a 9672-R85 G4.

Given that in an earlier job I worked on the IBM configurators I
wasn't too surprised at what happened - only that the IBM sales rep
hadn't checked out the performance figures properly.

IIRC, the G4 Enterprise Servers were optimised for business /
scientific purposes, and thus give superior performance on large APL
applications than the later G5 and G6 models.

Anyone have any comments ?

John Warden



Mon, 03 May 2004 23:42:10 GMT  
 IBM Mainframes and APL performance
 Unfortunately your question has no simple answer.

 For example, if you asked "which IBM processor computes xxxx in SHARP APL most
quickly" then this could be answered for a given xxxx (although I don't claim to
have such answers readily available). More general questions can't.

 The reason is the complexity both of the APL implementations and the IBM
processors. IBM rates processors in MIPS because they have to use some measure,
but "IBM MIPS" does *not* normally mean "million instructions per second"
exactly. (Even if it did, you would need to know *which* instructions. A 5
megabyte MVCL (memory to memory move) is going to be quite a lot slower than a
simple LR, which only shuttles 32 bits between two registers.) It instead is a
measure of how quickly certain benchmarks run on the processors. This in turn is
related to things like clock speed, memory speed, cache organization,
inter-engine contention, and very importantly the interaction between the
preceding and workload profile. So, for instance, on a fast 4-way a workload
with a lot of time in critical sections will outperform an 8-way even with a
higher MIPS rating if the engines are slower. But a workload that doesn't have
such severe critical sections may vastly outperform on the 8-way.

 This rather simple observation applies to less obvious bottlenecks throughout
the very complex set of devices which is a modern IBM mainframe. And there are
other less obvious considerations as well. As one example, one time when we
looked into this we concluded that SHARP APL *at the time* outperformed most IBM
benchmarks on even their UP machines because the interpreter had good locality
of reference. Thus we were getting more benefit from the memory cache and the
TLB (a cache used in resolving virtual addresses) than their benchmarks.

 So the bottom line is that what is missing in predicting how well a new machine
will run an APL given only the machine specifications is a measure of how the
APL workload compares in its use of the machine facilities with the benchmarks
IBM uses to prepare the specifications. This is not, in general, a measure that
it would be easy to acquire. And of course because of the complexity of both
workloads it would not necessarily be very useful in the next situation even if
one could manage to work it out in the present one.

 The real bottom line is that it is dangerous to make performance predictions,
but if you absolutely must then they must be based on *relevant* benchmarks. You
must make a benchmark suite that as accurately as possible reflects the demands
you plan to make upon the proposed processor and run your own measurements of
how well it performs.

 Of course this is quite a lot of trouble and usually sites don't bother. If you
need 30% more capacity you buy 40% and if you don't *get* 40% you can
demonstrate it with your real workload and perhaps beat IBM over the head with
the numbers. Discounts have been known to happen in such situations, and then
everyone is happy (Especially IBM ;-). If you need 8% and buy 8% but get -5%,
then you're out of luck.

 Sorry this message isn't more helpful, but I hope I've at least explained the
problems reasonably clearly.

 ../Leigh



Thu, 06 May 2004 03:33:27 GMT  
 IBM Mainframes and APL performance

Quote:

> Hi John..
> are you running IBM APL ?
> I once did a simple benchmark using Sharpl APL as follows:

> A) IBM 390/(MVS) G4 processors using SAM
> B) Pentium III-450 using SAX

> A simple benchmark was compared. I found that the integer-performance
> was about equal, and that the FP performacne was 3 times faster on the
> mainframe*. File I/O was on "reading" about the same, on "writing" -
> somewhat faster on the mainframe.
> Obviously, the architecture of the MF is superior for large multiple tasks -
> but the result was somewhat of a shock - considering the price difference.

> There are new AMD processors on the market - Athlon MP and Hammer coming in
> 2002 2/2. The SPECint on the Hammer was estimated to be about 1400 !

Not exactly a fair test, since nobody in their right mind [let's avoid
discussing right vs wrong minds and CIOs here... 8^} ] would buy
a mainframe to do PC work. Try your benchmark again this way:

        - build a terabyte database that gets updated continuously through
                the day
        - have users hitting the database with arbitrary requests at 1000hz.
                E.g., "What was the average trading price of semiconductor
                base materials over the past 10 months?"

This will reveal why mainframes still cost a gazillion pazoozas -- they
have
LOTS of I/O and memory bandwidth. They also Do Not Break [any more].

Quote:
> Ever heard of Linux clustering with regard to decentralisation of workload ?
> The SAX ( Sharp APL for  Unix ) also has a Java interface which is really
> nice ! Also, the Sharp interpreter is considered to be probably the fastest on
> the market ( years of optimisation ) and certainly the most reliable !

> * Soliton was surprised - as this test was some 2 years ago, I would
> expect the results on the P3 to be better now ( but would expect you
> to try it on a Dual-Athlon MP CPU for example )!

I doubt if anyone's PC APL interpreter supports MPs very well yet,
except in
a multi-user environment. THat is, I don't think your matrix product or
dynamic programming problem is going to run twice as it does on a UP
box.

The APEX APL compiler generates parallel code automatically via SISAL,
and
got excellent speedups on CRAY Y-MPs, but I haven't heard much else
about
automatic parallel work that's available to the public. There was work
at IBM Research in this area, but I don't think it was made available
under
GPL or anything like that.

I am still trying to find time to rework the back end of APEX to crank
out
usable Linux code, at which time I'll release the whole shebang under
GPL.
I just need about 3 days a week over and above the 7 that are there now.

Bob

--
Robert Bernecky                  Snake Island Research Inc.

+1 416 203 0854                  Toronto, Ontario M5J 2B9 Canada
http://www.snakeisland.com



Fri, 07 May 2004 00:11:31 GMT  
 IBM Mainframes and APL performance

Quote:

> Does anyone out there have any inside information as to which of the
> current IBM mainframe processors work best with APL ???

I've seen a similar thing happen at an APL mainframe site.
The problem stems from the fact that different pieces of
hardware have performance characteristics that vary greatly
depending on workload. This is one reason why clockrate
[the Megahertz Wars] is often an inadequate measure of application
performance. Here are 3 possible measures of performance [not
exhaustive]:
        - main memory bandwidth
        - interprocessor interference on an MP
        - clock rate, pipeline depth, impact of pipeline stall/invalidation
[effectively,
                this is a measure of integer performance]
        - floating point performance

A common misconception among many sales types (and APLers, too)
is that because APL is array-oriented, it:
        - makes highly effective use of array operations
        - uses a lor of floating point

For the vast majority of interpretive APL applications, neither of
these is true in practice.
        a. Measurements taken on a variety of APL interpreters
        on a variety of APL applications show that the mean
        array size over the course of applications is <2 elements.
        About 85% of all operations are on arrays of <32 elements.
        This means that interpretive overhead dominates the
        computation time, unless you're doing a lot of N^3
        operations such as inner products (in which case
        you may be using a poor algorithm for doing string search).

        Yes, it is possible to point to specific applications that
        may do better, but I have yet to see system-wide performance
        figures that do better than the above.
        This is why APL has a reputation for poor performance in an
        interpreted environment; it's also why the CDC STAR-100
        supercomputer was a complete flop -- arrays are rarely long
        enough.

        With regard to IBM mainframe performance, this suggests that
        interpreted APL execution time is dominated by RISC-like
        operations -- load, store, compare, branch, load address, add,
        and a few others. Hardware monitor statistics bear out this
        assertion.

        b. Floating point usage in any interpreted APL environment
        is generally so low as to treated as noise -- a few percent
        maximum. Again, hardware monitor measurements confirm this.
        Yes, inner product +.times can drive this up to about 50%,
        but I'm talking about aggregate application performance.
        Assume you have an application that executes 10 primitives
        for each floating point primitive that executes. By
        floating point primitive, I mean one of the elementary or
        arithmetic functions, such as plus, times, divide, power, log.
        Structural and selection primitives [reshape, indexing,
        rotate, reversal,transpose..] make little or no use of floating
        point performance. Their performance is limited, in general,
        by main memory bandwidth.

        Oops. back to my example. Each primitive requires probably 100 to
        1000 instructions to get through the syntax analyzer, validate
        types, rank, shape and any conformability rules, before it finally
        gets to do any actual work on the data itself. Let's assume 500
        instructions. So, my 10 operations take 5000 instructions before
        they do any useful work.

        Now, we hit the floating point thingy. If array sizes are 10 elements,
        then we're doing about 10 floating point ops for A+B, combined with
        5000 setup instructions [plus whatever non-floating ops were done
        in the other 9 primitives.] It doesn't matter how fast the floating
        point box is here -- it's negligible in terms of application execution
        time. 100-element arrays are no better. Well, let's try 10000-element
        arrays. We now have 10000 add instructions, with 5000 setup
instructions.
        that gives us 2/3 floating point, so faster floating point should
Really Help.

        Well, not really. That presumes the other 9 primitives don't mess with
the
        10000-element arrays, and that we're doing adds only. Unfortunately,
the
        IBM mainframe needs an op sequence like this for A+B, where they're
both
        arrays:
           lp:  LD      D0,A(index)
                LD      D2,B(index)
                ADR     D0,D2
                STD     D0,Z(index)
                BXLE    index,incr,lp
        The LD/STD are floating load/store ops that are memory-limited in
speed.
        The BXLE is a loop increment and closure instruction, one of my
favorites.
        The ADR is the floating add and is the only floating op in the loop.
The
        other 4 ops are "integer".

        Now, IBM has done some Very Nice Work in overlapping the execution of
        the integer ops with the floating point op, but you should see that
        it is Very Hard to push the floating point unit to saturation in
        an interpreted environment.

        Compilers, of course, can do better, by methods such as loop fusion and
        array contraction, but you're presumably using an interpreter.

I suspect if you go to the IBM web site and look around, I think you'll
see some performance statistics for their mainframes. They give a vector
of performance figures for each processor for pseudo-applications such
as
[I'm doing this from memory...] "scientific", "web server",
"business"...
These figures vary GREATLY from machine to machine, and they do NOT
track
each other very well, which is what I was saying at the top of this
message.
Your application's performance is probably a weighted average of those
figures, where the weights are application-dependent.

I'll bet the ratio of new/old performance for floating point was
a lot higher than the ratio of business/server/integer performance.

Bottom line: A Ferrari might go like stink, but if you want to move
a 40-person Squamish team across the continent, get an intercity bus
instead.

Bob

Quote:
> The company I now work for (no names, no P45) recently "upgraded" from
> a 9672-R75 G4 Enterprise Server to a 9672-RD6 G5 Enterprise Server.
> Although the overall performance was quoted by IBM as being 30%
> better, in fact the APL performance got 30% WORSE !!  They rapidly
> backed out the upgrade, and went instead to a 9672-R85 G4.

> Given that in an earlier job I worked on the IBM configurators I
> wasn't too surprised at what happened - only that the IBM sales rep
> hadn't checked out the performance figures properly.

> IIRC, the G4 Enterprise Servers were optimised for business /
> scientific purposes, and thus give superior performance on large APL
> applications than the later G5 and G6 models.

> Anyone have any comments ?

> John Warden

--
Robert Bernecky                  Snake Island Research Inc.

+1 416 203 0854                  Toronto, Ontario M5J 2B9 Canada
http://www.snakeisland.com


Fri, 07 May 2004 00:11:32 GMT  
 
 [ 4 post ] 

 Relevant Pages 

1. Performance of various data types (IBM's latest mainframe compiler)

2. Static vs. Dynamic CAll Performance on IBM Mainframes

3. Static vs. Dynamic CAll Performance on IBM Mainframes

4. ISE/IBM Announcement for ISE Eiffel on IBM Mainframe

5. IBM Mainframe to IBM Unix Cobol conversion?

6. Porting Mainframe APL to APL*PLUS/PC

7. Sharp APL vs IBM APL ??????

8. J on IBM Mainframe

9. IBM Mainframe 370

10. IBM 370 Mainframe Board for the PC

11. Servizi di Source-Recovery per mainframe IBM in ITALIA

12. IBM Mainframes and Linux

 

 
Powered by phpBB® Forum Software