Faster than assembler (Was: Hang on, isn't Forth out of date now?) 
Author Message
 Faster than assembler (Was: Hang on, isn't Forth out of date now?)


: We already have a memory-access-time limitation on execution speed.
: Look at any program that does number-crunching on a large data set:
: with fast co-processors, the arithmetic execution time (except for
: things like transcendental functions) is swamped by the fetch/store
: time. Caching doesn't help much because cahches are finite in size.
: Thus the time eventually asymptotes to the DRAM access time. There
: was an article in DDJ some time ago timing fortran compilers running
: on Pentium machines. There was a substantial hit in Mflops going
: from small- to large-matrix problems, using otherwise-identical
: software. The "problem" was eventually traced to exceeding the size
: of pipelines and/or cache memory.

OK, this ain't "comp.lang.fortran", but ... :-)

It has been possible to write FORTRAN compilers for vector, parallel and
pipelined architectures, including cached machines such as the Pentium,
Pentium Pro and Intel i860 for about 20 years now!  I used to work for a
company, Floating Point Systems, that pioneered one such compiler.  The
people who built this compiler now call themselves The Portland Group
and have extended their expertise to other languages and other
architectures.  These compilers optimize for the architecture, including
pipelines and caches, and are capable of gaining as much of an advanced
architecture's performance as a conventional compiler does for a
conventional architecture..

It is also possible to rewrite one's FORTRAN code so that memory
hierarchies are exploited rather than a bottleneck.  Large matrix
operations are particularly easy to deal with, especially when the
matrices are dense.  Check into old issues (1980s) of the SIAM Journal
on Scientific and Statistical Computing for the works of Jack Dongarra,
Danny Sorensen, Charles Van Loan and Christian Bischof.  The trick is to
keep numbers in the cache and re-use them over and over and over again!

In summary, there is no excuse for poor performance on an advanced
architecture for dense matrix problems.  You can't blame it on the
architecture, you can't blame it on lousy compilers because there are
good ones, and you can't blame it on the language because the techniques
are there whether you program in assembler, FORTRAN, "C" or Forth.

I think I'll crosspost this to "comp.lang.fortran" where my old FPS
buddies hang out :-).

--

Yesterday, I walked by a BMW.  I saw scratches on the bumper, and I stooped
down for a closer look.  It was engraving: "My Other Car is a '73 Pinto!"



Tue, 09 Feb 1999 03:00:00 GMT  
 Faster than assembler (Was: Hang on, isn't Forth out of date now?)

:In summary, there is no excuse for poor performance on an advanced
:architecture for dense matrix problems.  You can't blame it on the
:architecture, you can't blame it on lousy compilers because there are
:good ones, and you can't blame it on the language because the techniques
:are there whether you program in assembler, FORTRAN, "C" or Forth.

    OK, how do I write good code for dense matrix problems in
Forth for parallel or other advanced architectures?

--



Thu, 11 Feb 1999 03:00:00 GMT  
 Faster than assembler (Was: Hang on, isn't Forth out of date now?)

        [ deleted ]

Quote:
> OK, this ain't "comp.lang.fortran", but ... :-)

> It has been possible to write FORTRAN compilers for vector, parallel and
> pipelined architectures, including cached machines such as the Pentium,
> Pentium Pro and Intel i860 for about 20 years now!  I used to work for a
> company, Floating Point Systems, that pioneered one such compiler.  The
> people who built this compiler now call themselves The Portland Group
> and have extended their expertise to other languages and other
> architectures.  These compilers optimize for the architecture, including
> pipelines and caches, and are capable of gaining as much of an advanced
> architecture's performance as a conventional compiler does for a
> conventional architecture..

And so?

Quote:
> It is also possible to rewrite one's FORTRAN code so that memory
> hierarchies are exploited rather than a bottleneck.  Large matrix
> operations are particularly easy to deal with, especially when the
> matrices are dense.  Check into old issues (1980s) of the SIAM Journal
> on Scientific and Statistical Computing for the works of Jack Dongarra,
> Danny Sorensen, Charles Van Loan and Christian Bischof.  The trick is to
> keep numbers in the cache and re-use them over and over and over again!

        [ more deleted ]

I would be interested to know how, when implementing something like
the Gauss-Jordan method for solving linear equations, or the LU decom-
position with partial pivoting, one can _predict_ where the
appropriate pivot row will be, on a dense matrix too large to fit into
a given memory segment. I do not believe it can be done efficiently,
on a sequential machine. Or, to put it another way, there are many
cases where you simply _cannot_ keep the same numbers in cache and
"...reuse them over and over and over again!" I am sure the gentlemen
referred to above are all dynamite programmers, but these are not new
problems and we had to know all about this stuff back in the 1960's
when cpu's were slow, core memory was small and bulk memory was
sequential (i.e. tape).

Algorithms for (parallel) array processors are, I admit, a different
kettle of fish, and somewhat off the subject I was discussing.

If you can provide a precise reference to a method for getting around
what seems to me a quite fundamental limitation:

        the time to process a _large_ amount of data is limited by
        the access speed of the bulk memory store (assuming that is
        the slowest part of the system),

I will be glad to read it and comment on it. On issues like this
I am from Missouri, since I believe you are speaking of a violation
of information theory, or a way to implement time travel.

--
Julian V. Noble



Thu, 11 Feb 1999 03:00:00 GMT  
 Faster than assembler (Was: Hang on, isn't Forth out of date now?)

Quote:

> [...]
> Irrelevant. You don't a priori know where the pivot row is. Once you find it,
> sure, you can keep it cached. But you have to find it--that's where the delay
> comes from.

Well, you look for and re-cache the pivot N times.  The search for
the correct pivot takes O(N^2) comparisons.  The actual reduction
of the rows takes O(N^3) multiply-add operations.  During this last,
and most significant, group of operations, the pivot data is cached.  
Even the non-cached non-pivot rows benefit if the cache used does
read-ahead.

The O(N^3) operations dominate unless the cache miss penalty is
*enormous* and the arrays are small.  For large arrays, the O(N^3)
still dominates.  Of course, if the rows are large enough that they
don't fit into the cache, you have to do some loop unrolling to get
the best speed.  But still, the time needed to select the pivot is
not significant.

J. Giles
Ricercar Software



Thu, 11 Feb 1999 03:00:00 GMT  
 Faster than assembler (Was: Hang on, isn't Forth out of date now?)


        >I would be interested to know how, when implementing something like
        >the Gauss-Jordan method for solving linear equations, or the LU decom-
        >position with partial pivoting, one can _predict_ where the
        >appropriate pivot row will be, on a dense matrix too large to fit into
        >a given memory segment. I do not believe it can be done efficiently,
        >on a sequential machine. Or, to put it another way, there are many
        >cases where you simply _cannot_ keep the same numbers in cache and
        >"...reuse them over and over and over again!"

---Oh, I don't know . . .  In Gauss_Jordan elimination,
a repeated operation is that a multiple of each equation
is subtracted in turn from the pivot equation.
In other words, the coefficients of each pivot equation
are re-used over and over again.



Fri, 12 Feb 1999 03:00:00 GMT  
 Faster than assembler (Was: Hang on, isn't Forth out of date now?)

Quote:


>    >I would be interested to know how, when implementing something like
>    >the Gauss-Jordan method for solving linear equations, or the LU decom-
>    >position with partial pivoting, one can _predict_ where the
>    >appropriate pivot row will be, on a dense matrix too large to fit into
>    >a given memory segment. I do not believe it can be done efficiently,
>    >on a sequential machine. Or, to put it another way, there are many
>    >cases where you simply _cannot_ keep the same numbers in cache and
>    >"...reuse them over and over and over again!"

> ---Oh, I don't know . . .  In Gauss_Jordan elimination,
> a repeated operation is that a multiple of each equation
> is subtracted in turn from the pivot equation.
> In other words, the coefficients of each pivot equation
> are re-used over and over again.

Irrelevant. You don't a priori know where the pivot row is. Once you find it,
sure, you can keep it cached. But you have to find it--that's where the delay
comes from.

--
Julian V. Noble



Fri, 12 Feb 1999 03:00:00 GMT  
 Faster than assembler (Was: Hang on, isn't Forth out of date now?)


Quote:

>> [...]
>> Irrelevant. You don't a priori know where the pivot row is. Once you find it,
>> sure, you can keep it cached. But you have to find it--that's where the delay
>> comes from.
>Well, you look for and re-cache the pivot N times.  The search for
>the correct pivot takes O(N^2) comparisons.  The actual reduction
>of the rows takes O(N^3) multiply-add operations.  During this last,
>and most significant, group of operations, the pivot data is cached.  
>Even the non-cached non-pivot rows benefit if the cache used does
>read-ahead.
>The O(N^3) operations dominate unless the cache miss penalty is
>*enormous* and the arrays are small.  For large arrays, the O(N^3)
>still dominates.  Of course, if the rows are large enough that they
>don't fit into the cache, you have to do some loop unrolling to get
>the best speed.  But still, the time needed to select the pivot is
>not significant.
>J. Giles
>Ricercar Software

Yup!  By the way, most of this wisdom for dense linear algebra is
embodied (in FORTRAN) in Dongarra, et. al., LAPACK and its components,
the level 2 and level 3 BLAS.  If you're FORTRAN literate or just
curious, send e-mail to


and in the body of the message put

   send index

A daemon will send you a long reply with further instructions on how to
get software from the Netlib archives.  Look for LAPACK, BLAS2 and
BLAS3.  If someone would supply me with a Pentium, an ANS Forth with
hardware floating point for it and a few months of paid time off from my
current job, I'd port this stuff for the Forth Scientific Library.  Then
again, maybe Julian Noble and his students would take on the challenge,
since he doesn't think it can be done :-).



Sat, 13 Feb 1999 03:00:00 GMT  
 Faster than assembler (Was: Hang on, isn't Forth out of date now?)

Quote:



> >> [...]
> >> Irrelevant. You don't a priori know where the pivot row is. Once you find it,
> >> sure, you can keep it cached. But you have to find it--that's where the delay
> >> comes from.

> >Well, you look for and re-cache the pivot N times.  The search for
> >the correct pivot takes O(N^2) comparisons.  The actual reduction
> >of the rows takes O(N^3) multiply-add operations.  During this last,
> >and most significant, group of operations, the pivot data is cached.  
> >Even the non-cached non-pivot rows benefit if the cache used does
> >read-ahead.

> >The O(N^3) operations dominate unless the cache miss penalty is
> >*enormous* and the arrays are small.  For large arrays, the O(N^3)
> >still dominates.  Of course, if the rows are large enough that they
> >don't fit into the cache, you have to do some loop unrolling to get
> >the best speed.  But still, the time needed to select the pivot is
> >not significant.

> >J. Giles
> >Ricercar Software

> Yup!  By the way, most of this wisdom for dense linear algebra is
> embodied (in FORTRAN) in Dongarra, et. al., LAPACK and its components,
> the level 2 and level 3 BLAS.  If you're FORTRAN literate or just
> curious, send e-mail to


> and in the body of the message put

>    send index

> A daemon will send you a long reply with further instructions on how to
> get software from the Netlib archives.  Look for LAPACK, BLAS2 and
> BLAS3.  If someone would supply me with a Pentium, an ANS Forth with
> hardware floating point for it and a few months of paid time off from my
> current job, I'd port this stuff for the Forth Scientific Library.  Then
> again, maybe Julian Noble and his students would take on the challenge,
> since he doesn't think it can be done :-).

I'll look at it, despite not having seen references to the actual
SIAM articles, and see if I can comment on what they are doing and
what the relevance might be to my original posting. It is always
possible that I have/am overlook(ed/ing) the obvious. But I think
it more likely we are speaking of different things.

I picture the situation like a highway with cars moving at some
average speed (possibly the info superhighway?). At some point
they must pass a toll collection point. If there are many collectors
(fan out) then the parallelism might allow cars to pass through faster
than they arrive, in which case the queues are mostly empty and the
bottleneck is the highway speed. Conversely, if there is but one
toll collector (fan in), then the toll station becomes the bottleneck.

In matrix triangularization, every element must be fetched and stored
an average of N/3 times. (This is why it takes N^3/3 time.) There are
also N/3 multiplcations and subtractions per element.

If there is a cache, then data must be fetched/stored to/from it and
slower memory. The claim is that by cleverly reordering the calculation
and by exploiting parallelism, the moves to/from slow memory can be
overlapped with the processing of the data in the cache, so that the
time needed for cache<->memory moves becomes irrelevant. I claim that
this can only be true if the processing time for one cache-full exceeds
the time to fetch/store a cache-full.

But I posited an instantaneous processor, in which case ALL the time is
memory time. Since N^2 items must be moved _somewhere_ N/3 times, it
seems to me the memory access time dominates for sufficiently large
sets of data.

I have no doubt that cleverness pays. But I fail to see how the
cleverest use of (finite) cache memory vitiates my original remark.

--
Julian V. Noble



Mon, 15 Feb 1999 03:00:00 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. Hang on, isn't Forth out of date?

2. Hang on, isn't Forth out of date now?

3. Mac vs PC FLAME BAIT Par EXELENCE, was Re: Hang on, isn't Forth , out of date now?

4. isn't there a Palm Pilot Forth

5. Why Forth isn't popular...

6. Why Forth isn't popular...

7. Jim Schneider's Forth assembler

8. FORTH for Windows 95 and NT's assembler

9. Is read hanging or am I confused?

10. Why isn't Haskell mainstream?---A newbie's view

11. GNU Script isn't fixing something that's broken, so is doomed

 

 
Powered by phpBB® Forum Software