Faster than assembler (Was: Hang on, isn't Forth out of date now?)
Author 
Message 
zn.. #1 / 8

Faster than assembler (Was: Hang on, isn't Forth out of date now?)
: We already have a memoryaccesstime limitation on execution speed. : Look at any program that does numbercrunching on a large data set: : with fast coprocessors, the arithmetic execution time (except for : things like transcendental functions) is swamped by the fetch/store : time. Caching doesn't help much because cahches are finite in size. : Thus the time eventually asymptotes to the DRAM access time. There : was an article in DDJ some time ago timing fortran compilers running : on Pentium machines. There was a substantial hit in Mflops going : from small to largematrix problems, using otherwiseidentical : software. The "problem" was eventually traced to exceeding the size : of pipelines and/or cache memory. OK, this ain't "comp.lang.fortran", but ... :) It has been possible to write FORTRAN compilers for vector, parallel and pipelined architectures, including cached machines such as the Pentium, Pentium Pro and Intel i860 for about 20 years now! I used to work for a company, Floating Point Systems, that pioneered one such compiler. The people who built this compiler now call themselves The Portland Group and have extended their expertise to other languages and other architectures. These compilers optimize for the architecture, including pipelines and caches, and are capable of gaining as much of an advanced architecture's performance as a conventional compiler does for a conventional architecture.. It is also possible to rewrite one's FORTRAN code so that memory hierarchies are exploited rather than a bottleneck. Large matrix operations are particularly easy to deal with, especially when the matrices are dense. Check into old issues (1980s) of the SIAM Journal on Scientific and Statistical Computing for the works of Jack Dongarra, Danny Sorensen, Charles Van Loan and Christian Bischof. The trick is to keep numbers in the cache and reuse them over and over and over again! In summary, there is no excuse for poor performance on an advanced architecture for dense matrix problems. You can't blame it on the architecture, you can't blame it on lousy compilers because there are good ones, and you can't blame it on the language because the techniques are there whether you program in assembler, FORTRAN, "C" or Forth. I think I'll crosspost this to "comp.lang.fortran" where my old FPS buddies hang out :). 
Yesterday, I walked by a BMW. I saw scratches on the bumper, and I stooped down for a closer look. It was engraving: "My Other Car is a '73 Pinto!"

Tue, 09 Feb 1999 03:00:00 GMT 


Mike Coughl #2 / 8

Faster than assembler (Was: Hang on, isn't Forth out of date now?)
:In summary, there is no excuse for poor performance on an advanced :architecture for dense matrix problems. You can't blame it on the :architecture, you can't blame it on lousy compilers because there are :good ones, and you can't blame it on the language because the techniques :are there whether you program in assembler, FORTRAN, "C" or Forth. OK, how do I write good code for dense matrix problems in Forth for parallel or other advanced architectures? 

Thu, 11 Feb 1999 03:00:00 GMT 


Julian V. Nob #3 / 8

Faster than assembler (Was: Hang on, isn't Forth out of date now?)
[ deleted ] Quote: > OK, this ain't "comp.lang.fortran", but ... :) > It has been possible to write FORTRAN compilers for vector, parallel and > pipelined architectures, including cached machines such as the Pentium, > Pentium Pro and Intel i860 for about 20 years now! I used to work for a > company, Floating Point Systems, that pioneered one such compiler. The > people who built this compiler now call themselves The Portland Group > and have extended their expertise to other languages and other > architectures. These compilers optimize for the architecture, including > pipelines and caches, and are capable of gaining as much of an advanced > architecture's performance as a conventional compiler does for a > conventional architecture..
And so? Quote: > It is also possible to rewrite one's FORTRAN code so that memory > hierarchies are exploited rather than a bottleneck. Large matrix > operations are particularly easy to deal with, especially when the > matrices are dense. Check into old issues (1980s) of the SIAM Journal > on Scientific and Statistical Computing for the works of Jack Dongarra, > Danny Sorensen, Charles Van Loan and Christian Bischof. The trick is to > keep numbers in the cache and reuse them over and over and over again!
[ more deleted ] I would be interested to know how, when implementing something like the GaussJordan method for solving linear equations, or the LU decom position with partial pivoting, one can _predict_ where the appropriate pivot row will be, on a dense matrix too large to fit into a given memory segment. I do not believe it can be done efficiently, on a sequential machine. Or, to put it another way, there are many cases where you simply _cannot_ keep the same numbers in cache and "...reuse them over and over and over again!" I am sure the gentlemen referred to above are all dynamite programmers, but these are not new problems and we had to know all about this stuff back in the 1960's when cpu's were slow, core memory was small and bulk memory was sequential (i.e. tape). Algorithms for (parallel) array processors are, I admit, a different kettle of fish, and somewhat off the subject I was discussing. If you can provide a precise reference to a method for getting around what seems to me a quite fundamental limitation: the time to process a _large_ amount of data is limited by the access speed of the bulk memory store (assuming that is the slowest part of the system), I will be glad to read it and comment on it. On issues like this I am from Missouri, since I believe you are speaking of a violation of information theory, or a way to implement time travel.  Julian V. Noble

Thu, 11 Feb 1999 03:00:00 GMT 


James Gile #4 / 8

Faster than assembler (Was: Hang on, isn't Forth out of date now?)
Quote:
> [...] > Irrelevant. You don't a priori know where the pivot row is. Once you find it, > sure, you can keep it cached. But you have to find itthat's where the delay > comes from.
Well, you look for and recache the pivot N times. The search for the correct pivot takes O(N^2) comparisons. The actual reduction of the rows takes O(N^3) multiplyadd operations. During this last, and most significant, group of operations, the pivot data is cached. Even the noncached nonpivot rows benefit if the cache used does readahead. The O(N^3) operations dominate unless the cache miss penalty is *enormous* and the arrays are small. For large arrays, the O(N^3) still dominates. Of course, if the rows are large enough that they don't fit into the cache, you have to do some loop unrolling to get the best speed. But still, the time needed to select the pivot is not significant. J. Giles Ricercar Software

Thu, 11 Feb 1999 03:00:00 GMT 


++ rob #5 / 8

Faster than assembler (Was: Hang on, isn't Forth out of date now?)
>I would be interested to know how, when implementing something like >the GaussJordan method for solving linear equations, or the LU decom >position with partial pivoting, one can _predict_ where the >appropriate pivot row will be, on a dense matrix too large to fit into >a given memory segment. I do not believe it can be done efficiently, >on a sequential machine. Or, to put it another way, there are many >cases where you simply _cannot_ keep the same numbers in cache and >"...reuse them over and over and over again!" Oh, I don't know . . . In Gauss_Jordan elimination, a repeated operation is that a multiple of each equation is subtracted in turn from the pivot equation. In other words, the coefficients of each pivot equation are reused over and over again.

Fri, 12 Feb 1999 03:00:00 GMT 


Julian V. Nob #6 / 8

Faster than assembler (Was: Hang on, isn't Forth out of date now?)
Quote:
> >I would be interested to know how, when implementing something like > >the GaussJordan method for solving linear equations, or the LU decom > >position with partial pivoting, one can _predict_ where the > >appropriate pivot row will be, on a dense matrix too large to fit into > >a given memory segment. I do not believe it can be done efficiently, > >on a sequential machine. Or, to put it another way, there are many > >cases where you simply _cannot_ keep the same numbers in cache and > >"...reuse them over and over and over again!" > Oh, I don't know . . . In Gauss_Jordan elimination, > a repeated operation is that a multiple of each equation > is subtracted in turn from the pivot equation. > In other words, the coefficients of each pivot equation > are reused over and over again.
Irrelevant. You don't a priori know where the pivot row is. Once you find it, sure, you can keep it cached. But you have to find itthat's where the delay comes from.  Julian V. Noble

Fri, 12 Feb 1999 03:00:00 GMT 


Ed Boras #7 / 8

Faster than assembler (Was: Hang on, isn't Forth out of date now?)
Quote:
>> [...] >> Irrelevant. You don't a priori know where the pivot row is. Once you find it, >> sure, you can keep it cached. But you have to find itthat's where the delay >> comes from. >Well, you look for and recache the pivot N times. The search for >the correct pivot takes O(N^2) comparisons. The actual reduction >of the rows takes O(N^3) multiplyadd operations. During this last, >and most significant, group of operations, the pivot data is cached. >Even the noncached nonpivot rows benefit if the cache used does >readahead. >The O(N^3) operations dominate unless the cache miss penalty is >*enormous* and the arrays are small. For large arrays, the O(N^3) >still dominates. Of course, if the rows are large enough that they >don't fit into the cache, you have to do some loop unrolling to get >the best speed. But still, the time needed to select the pivot is >not significant. >J. Giles >Ricercar Software
Yup! By the way, most of this wisdom for dense linear algebra is embodied (in FORTRAN) in Dongarra, et. al., LAPACK and its components, the level 2 and level 3 BLAS. If you're FORTRAN literate or just curious, send email to
and in the body of the message put send index A daemon will send you a long reply with further instructions on how to get software from the Netlib archives. Look for LAPACK, BLAS2 and BLAS3. If someone would supply me with a Pentium, an ANS Forth with hardware floating point for it and a few months of paid time off from my current job, I'd port this stuff for the Forth Scientific Library. Then again, maybe Julian Noble and his students would take on the challenge, since he doesn't think it can be done :).

Sat, 13 Feb 1999 03:00:00 GMT 


Julian V. Nob #8 / 8

Faster than assembler (Was: Hang on, isn't Forth out of date now?)
Quote:
> >> [...] > >> Irrelevant. You don't a priori know where the pivot row is. Once you find it, > >> sure, you can keep it cached. But you have to find itthat's where the delay > >> comes from. > >Well, you look for and recache the pivot N times. The search for > >the correct pivot takes O(N^2) comparisons. The actual reduction > >of the rows takes O(N^3) multiplyadd operations. During this last, > >and most significant, group of operations, the pivot data is cached. > >Even the noncached nonpivot rows benefit if the cache used does > >readahead. > >The O(N^3) operations dominate unless the cache miss penalty is > >*enormous* and the arrays are small. For large arrays, the O(N^3) > >still dominates. Of course, if the rows are large enough that they > >don't fit into the cache, you have to do some loop unrolling to get > >the best speed. But still, the time needed to select the pivot is > >not significant. > >J. Giles > >Ricercar Software > Yup! By the way, most of this wisdom for dense linear algebra is > embodied (in FORTRAN) in Dongarra, et. al., LAPACK and its components, > the level 2 and level 3 BLAS. If you're FORTRAN literate or just > curious, send email to
> and in the body of the message put > send index > A daemon will send you a long reply with further instructions on how to > get software from the Netlib archives. Look for LAPACK, BLAS2 and > BLAS3. If someone would supply me with a Pentium, an ANS Forth with > hardware floating point for it and a few months of paid time off from my > current job, I'd port this stuff for the Forth Scientific Library. Then > again, maybe Julian Noble and his students would take on the challenge, > since he doesn't think it can be done :).
I'll look at it, despite not having seen references to the actual SIAM articles, and see if I can comment on what they are doing and what the relevance might be to my original posting. It is always possible that I have/am overlook(ed/ing) the obvious. But I think it more likely we are speaking of different things. I picture the situation like a highway with cars moving at some average speed (possibly the info superhighway?). At some point they must pass a toll collection point. If there are many collectors (fan out) then the parallelism might allow cars to pass through faster than they arrive, in which case the queues are mostly empty and the bottleneck is the highway speed. Conversely, if there is but one toll collector (fan in), then the toll station becomes the bottleneck. In matrix triangularization, every element must be fetched and stored an average of N/3 times. (This is why it takes N^3/3 time.) There are also N/3 multiplcations and subtractions per element. If there is a cache, then data must be fetched/stored to/from it and slower memory. The claim is that by cleverly reordering the calculation and by exploiting parallelism, the moves to/from slow memory can be overlapped with the processing of the data in the cache, so that the time needed for cache<>memory moves becomes irrelevant. I claim that this can only be true if the processing time for one cachefull exceeds the time to fetch/store a cachefull. But I posited an instantaneous processor, in which case ALL the time is memory time. Since N^2 items must be moved _somewhere_ N/3 times, it seems to me the memory access time dominates for sufficiently large sets of data. I have no doubt that cleverness pays. But I fail to see how the cleverest use of (finite) cache memory vitiates my original remark.  Julian V. Noble

Mon, 15 Feb 1999 03:00:00 GMT 


