ANNOUNCE: Optimized Math library for C67x 
Author Message
 ANNOUNCE: Optimized Math library for C67x

Dear colleagues,

I would like to inform you that an Optimized Math library for
TNS320C67x DSP is immediately available. For further information

Breif description of the library:

Optimized Math Library for TMS320C67x is a collection of functions
and macros for computation of elementary algebraic and transcendental
functions. The library consist of the following sections:

   Algebraic and Utilites functions
   Trigonometric functions
   Inverse trigonometric functions
   Hyperbolic trigonometric functions
   Inverse hyperbolic trigonometric functions

The library functions has been hand coded and assembly optimized
to enhance throughput, all the functions give the maximum accuracy
possible in single precision arithmetic (32-bits IEEE-754 floating-
point numbers). A user have a benefit of improved performance.

Most of the functions computes a result which is accurate in all
the decimal places, that is the relative error is less than 1ULP.

Thanks for your attention and my apologies for any inconvenience
this post may incur.

--
Andrew

-------------------------------------------------------
Andrew V. Nesterov,    


phone: +7-812-247-9356
fax:   +7-812-247-1017

Holography and Optoelectronics Laboratory
Ioffe Physical Technical Institute,
26 Politechnicheskaya Street,
194021 St.Petersburg, RUSSIA
-------------------------------------------------------



Sun, 30 Mar 2003 03:00:00 GMT  
 ANNOUNCE: Optimized Math library for C67x
In article <Pine.LNX.3.96.1001011190439.11446E-


...

Quote:
> Thanks for your attention and my apologies for any inconvenience
> this post may incur.

Yeah, your post is very likely to cause some major inconveniences for
TI's application programmers ;-)!

----

I have read (http://www.bdti.com/articles/evolution.pdf) that VLIW
processors are easy compiler targets because of their multiple
computation units. I don't understand that - optimizing for multiple
issue instruction words seems to me a complicated process, and an
optimal solution is very likely only to be found (if at all) in an
iterative procedure.

There is a way to specify an optimal coding for any closed algorithm
(this does not take into account if the algorithm itself is optimal):

1. Count the number of (simple) instructions (A=number of adds, M=
number of multiplies, B=number of shifts). Divide (A,M,B) by the number
of respective computation units available in the processor (I think the
x67x has 4 mutlipliers, 2 adders and 2 shifters?), name them Na, Nm,
Nb.

Let N1=max(A/Na, M/Nm, B/Nb).

2. Count the number of memory accesses (reads and writes) =Ma, and the
number of memory busses on the chip =Ns.

Let N2=Ma/Ns.

(3. One could also account for branching and looping, but that's not
interesting for this comparison.)

Now take N_opt=max(N1,N2), this is the minimal number of instruction
words it takes to encode an algorithm on a processor. Often it is not
possible to code an algorithm to this number of instructions, but it
can be used as a measure.

How do you think does the VLIW C-compiler fare compared to the older C-
compilers, when looking at the ratio N_opt / N_real?

Regards,
Andor

Sent via Deja.com http://www.deja.com/
Before you buy.



Mon, 31 Mar 2003 03:00:00 GMT  
 ANNOUNCE: Optimized Math library for C67x

Quote:

> I have read (http://www.bdti.com/articles/evolution.pdf) that VLIW
> processors are easy compiler targets because of their multiple
> computation units. I don't understand that - optimizing for multiple
> issue instruction words seems to me a complicated process, and an
> optimal solution is very likely only to be found (if at all) in an
> iterative procedure.

"Easier" compared to what?

The article compares VLIW DSP processors with relatively large and uniform
register sets with more primitive DSP processors with small and irregular
register sets - VLIW is easier, but not because it is VLIW, quite the
opposite.

It also compares VLIW DSP processors with superscalar processors. With
superscalar processors, exact execution times are notoriously hard to
predict, and it is quite hard to find the fastest possible code if you
cannot even calculate HOW fast some code is. So here a simple VLIW
processor makes it easier to produce optimal code. However, it will often
be easier to produce code that is quite close to optimal with a
superscalar processor.



Mon, 31 Mar 2003 03:00:00 GMT  
 ANNOUNCE: Optimized Math library for C67x

mac.isltd.insignia.com>,

...

Quote:
> The article compares VLIW DSP processors with relatively large and
uniform
> register sets with more primitive DSP processors with small and
irregular
> register sets - VLIW is easier, but not because it is VLIW, quite the
> opposite.

It seems you agree with me that optimized (in respect to processor
resource utilisation) compiling to a VLIW is problematic. For example
the 2106x SHARC DSP from ADI have a parallel multiplier and ALU, but
the C-compiler rarely generates instructions using parallel load-store,
let alone both computation units. It looks like multi-issue
instructions are already a problem for this (relatively) simple
processor.

So I am wondering exactly how well the VLIW C-compiler has been
implemented.

Quote:
> It also compares VLIW DSP processors with superscalar processors. With
> superscalar processors, exact execution times are notoriously hard to
> predict, and it is quite hard to find the fastest possible code if you
> cannot even calculate HOW fast some code is. So here a simple VLIW
> processor makes it easier to produce optimal code.

Does it really? The only difference between the VLIW and the
superscalar processor is that the latter tries to optimize the code on
the fly, but if they have the same number of computation units and
equal memory access bandwidth, the optimal code will be the same for
both architectures. The superscalar processor will try to optimize, but
end up executing the code sequentially because it already is optimal.

Quote:
> However, it will often
> be easier to produce code that is quite close to optimal with a
> superscalar processor

That's because virtually all code is close to optimal for a superscalar
processor. I guess they are every compiler maker's dream...

Regards,
Andor

Sent via Deja.com http://www.deja.com/
Before you buy.



Mon, 31 Mar 2003 03:00:00 GMT  
 ANNOUNCE: Optimized Math library for C67x

Quote:

>... For example the 2106x SHARC DSP from ADI have a parallel
>multiplier and ALU, but the C-compiler rarely generates instructions
>using parallel load-store, let alone both computation units. It looks
>like multi-issue instructions are already a problem for this
>(relatively) simple processor.

To keep this relevant to comp.lang.c (one of the three groups to
which this is posted): one big problem is that C is often ill-matched
to multi-issue CPUs.  This is not specific to VLIW either; it is
just the VLIWs (a) expose the parallelism to the compiler and (b)
tend to have quite a few more functional units than your typical
"out-of-order multi-issue evolved-from-scalar-system" CPU.  (E.g.,
UltraSPARC has two IEUs, one FPU, and one load/branch handler, so
it can do "add"+"or"+"fmul"+"load" in a single cycle, but that is
just four things, and never three ALU-things.  If you are not doing
floating point, it can only do a maximum of three "things" per cycle.)

The problem comes about because C programmers, and C programs, tend
to use "possibly aliased" pointers with wild abandon.  Something
simple like:

        /* calculate vector sum: res[i] = a[i]+b[i], 0<=i<n. */
        void vector_sum(size_t n, double *res, double *a, double *b) {
                size_t i;

                for (i = 0; i < n; i++)
                        res[i] = a[i] + b[i];
        }

"ought to" "parallelize" nicely, but what if the caller does, e.g.:

        vector_sum(30, &arr[5], &arr[6], &arr[7]);

?  Now, for some values of i,j,k in 0..30, res[i], a[j], and b[k]
are all the same element -- so writing on res[i] changes a[j] and
b[k], for those i,j,k.

There are plenty of solutions, from the fortran-esque one in C99
("restrict"-qualfied pointers) to clever alias analysis and inlining
to multiple code generation.  But unless a compiler implements at
least one those, generating fast "parallelized" (vector, VLIW,
etc.) code can be difficult.

Quote:
>Does it really? The only difference between the VLIW and the
>superscalar processor is that the latter tries to optimize the code on
>the fly, but if they have the same number of computation units and
>equal memory access bandwidth, the optimal code will be the same for
>both architectures. ...

Transistor budgets are large these days but transistors are not
free (and physics, e.g. wire delays, would be still be a problem
even if they were).  So, in practice, VLIW and superscalar processors
*never* have the same number of computation units, much less equal
memory access bandwidth.

The short version of the argument is: the difference between VLIW
and superscalar CPUs is that VLIWs match up functional units in
the code (automatically in the compiler, or manually by an assembly
programmer), while superscalars match them up in hardware.  Obviously
it takes extra hardware to do that, so the VLIW guys can have
smaller+faster chips, or more units.  On the other hand, experience
seems to show that "hardware is soft, but software is hard." :-)
--
In-Real-Life: Chris Torek, Berkeley Software Design Inc




Mon, 31 Mar 2003 03:00:00 GMT  
 ANNOUNCE: Optimized Math library for C67x

Quote:

> The problem comes about because C programmers, and C programs,
> tend to use "possibly aliased" pointers with wild abandon.
> Something simple like:

>         /* calculate vector sum: res[i] = a[i] + b[i], 0 <= i < n. */
>         void vector_sum(size_t n, double *res, double *a, double *b) {
>                 size_t i;

>                 for (i = 0; i < n; i++)
>                         res[i] = a[i] + b[i];
>                }

> "ought to" "parallelize" nicely, but what if the caller does, e.g.:

>         vector_sum(30, &arr[5], &arr[6], &arr[7]);

> ?  Now, for some values of i, j and k in 0..29, res[i], a[j], and b[k]
> are all the same element --
> so writing on res[i] changes a[j] and b[k], for those i, j and k.

> There are plenty of solutions, from the Fortran-esque one in C99
> ("restrict"-qualfied pointers) to clever alias analysis
> and inlining to multiple code generation.
> But unless a compiler implements at least one those
> generating fast "parallelized" (vector, VLIW, etc.) code can be difficult.

Perhaps you just picked a poor example.
Functions like your vector_sum are so common that
they should be included in a highly optimized library
implemented in assembler
if you can't get your optimizing compiler to cooperate.
A call like

    vector_sum(30, &arr[5], &arr[6], &arr[7]);

is considered to be a programming error
on the part of the application programmer
and the result is undefined.
The library developer might provide
a version of function vector_sum
that could be used to detect this error
during application development and testing.

Take a look at
the Vector, Signal and Image Processing Library

    http://www.vsipl.org/



Mon, 31 Mar 2003 03:00:00 GMT  
 ANNOUNCE: Optimized Math library for C67x


Quote:

> > > Functions like your vector_sum are so common that they should be
> > > included in a highly optimized library implemented in assembler
> > > if you can't get your optimizing compiler to cooperate.

> > If some particular C compiler cannot automatically optimize
> > something as simple as vector_sum(. . .),
> > what hope has it for more complex examples?
> > Shall we write them all in assembly?

> > This is why "there are plenty of solutions"
> > (including C99's "restrict" qualifier).
> > People want, for some reason, :-)
> > to write the code in C, yet still get decent optimization.
> > If that means that they have to use "restrict"
> > or set special non-conformant modes,
> > perhaps that is what they will do -- but until they do that,
> > standard C89 programs are likely to compile to relatively poor code
> > on vector and multiple-issue machines.

> That will be true as long as application programmers
> insist upon cooking up their own numerical recipes from scratch.

What if they just have some code on a vector machine that they have written.
Are they allowed to continue writing code for this environment, or is it all
over with now?

Quote:
> But you don't really need a very good optimizing compiler
> to build high performance application programs by calling
> function from an optimized high performance library.

Chips change.  The optimizing library of 5 years ago is old hat.  What now?

Quote:
> The cost of developing and maintaining an optimized library
> can be optimized over several application programs
> so it really doesn't matter whether you must implement it
> in assembler or not.

From what perspective?  It certainly matters from a portability standpoint.
It certainly matters from a maintainability standpoint.

Quote:
> But the fact is that the latest C and C++ compilers
> for high end DSP chips are actually pretty good.

What about aliasing problems?  Are they "pretty good" at that even without the
restrict qualifier?

Quote:
> They can used to implement high performance libraries
> requiring only a small amount of hand coded assembler
> to work around deficiencies in the optimizing compiler.

Hand coded assembly is always an option, but you did not address the questions
Mr. Torek raised about C89 compiler limitations on vector machines.  Or is the
answer simply, "Don't bother -- just code it in assembly."
--
C-FAQ: http://www.eskimo.com/~scs/C-faq/top.html
 "The C-FAQ Book" ISBN 0-201-84519-9
C.A.P. FAQ: ftp://38.168.214.175/pub/Chess%20Analysis%20Project%20FAQ.htm


Mon, 31 Mar 2003 03:00:00 GMT  
 ANNOUNCE: Optimized Math library for C67x
Quote:

>> The problem comes about because C programmers, and C programs,
>> tend to use "possibly aliased" pointers with wild abandon.
>> Something simple like:

>>         /* calculate vector sum: res[i] = a[i] + b[i], 0 <= i < n. */
>>         void vector_sum(size_t n, double *res, double *a, double *b) {
[snip]
>> "ought to" "parallelize" nicely, but [straightforward interpretation

   of C pointer rules forbids this].

Quote:
>> There are plenty of solutions ...



Quote:
>Perhaps you just picked a poor example.

No, I picked a *simple* example.  (A realistic example would have
all kinds of irrelevant stuff in it.)

Quote:
>Functions like your vector_sum are so common that they should be
>included in a highly optimized library implemented in assembler
>if you can't get your optimizing compiler to cooperate.

Indeed.  But if some particular C compiler cannot automatically
optimize something as simple as vector_sum(), what hope has it for
more complex examples?  Shall we write them all in assembly?

This is why "there are plenty of solutions" (including C99's
"restrict" qualifier).  People want, for some reason, :-) to write
the code in C, yet still get decent optimization.  If that means
they have to use "restrict", or set special non-conformant modes,
perhaps that is what they will do -- but until they do that, standard
C89 programs are likely to compile to relatively poor code on vector
and multiple-issue machines.
--
In-Real-Life: Chris Torek, Berkeley Software Design Inc




Tue, 01 Apr 2003 10:57:06 GMT  
 ANNOUNCE: Optimized Math library for C67x

Quote:

> > Functions like your vector_sum are so common that they should be
> > included in a highly optimized library implemented in assembler
> > if you can't get your optimizing compiler to cooperate.

> If some particular C compiler cannot automatically optimize
> something as simple as vector_sum(. . .),
> what hope has it for more complex examples?
> Shall we write them all in assembly?

> This is why "there are plenty of solutions"
> (including C99's "restrict" qualifier).
> People want, for some reason, :-)
> to write the code in C, yet still get decent optimization.
> If that means that they have to use "restrict"
> or set special non-conformant modes,
> perhaps that is what they will do -- but until they do that,
> standard C89 programs are likely to compile to relatively poor code
> on vector and multiple-issue machines.

That will be true as long as application programmers
insist upon cooking up their own numerical recipes from scratch.

But you don't really need a very good optimizing compiler
to build high performance application programs by calling
function from an optimized high performance library.
The cost of developing and maintaining an optimized library
can be optimized over several application programs
so it really doesn't matter whether you must implement it
in assembler or not.
But the fact is that the latest C and C++ compilers
for high end DSP chips are actually pretty good.
They can used to implement high performance libraries
requiring only a small amount of hand coded assembler
to work around deficiencies in the optimizing compiler.



Tue, 01 Apr 2003 10:58:27 GMT  
 ANNOUNCE: Optimized Math library for C67x

Quote:


> > That will be true as long as application programmers
> > insist upon cooking up their own numerical recipes from scratch.

> What if they just have some code
> on a vector machine that they have written?
> Are they allowed to continue writing code for this environment
> or is it all over with now?

> > But you don't really need a very good optimizing compiler
> > to build high performance application programs by calling
> > function from an optimized high performance library.

> Chips change.
> The optimizing library of 5 years ago is old hat.
> What now?

> > The cost of developing and maintaining an optimized library
> > can be optimized over several application programs
> > so it really doesn't matter whether you must implement it
> > in assembler or not.

> From what perspective?
> It certainly matters from a portability standpoint.
> It certainly matters from a maintainability standpoint.

> > But the fact is that the latest C and C++ compilers
> > for high end DSP chips are actually pretty good.

> What about aliasing problems?
> Are they "pretty good" at that even without the restrict qualifier?

> > They can used to implement high performance libraries
> > requiring only a small amount of hand coded assembler
> > to work around deficiencies in the optimizing compiler.

> Hand coded assembly is always an option
> but you did not address the questions Mr. Torek raised
> about C89 compiler limitations on vector machines.
> Or is the answer simply,
> "Don't bother -- just code it in assembly."

What I'm saying is that DSP application programmers
don't need to write loops over subscripted variables.
They can call optimized DSP library routines to do that.
DSP library developers can write optimized loops
over subscripted variables in assembler
if they can't get their optimizing compiler to cooperate.
In fact, many DSP library developers prefer to write
their optimized codes in assembler even if it is possible
to get an optimizing C compiler to do it for them.

If DSP application programmers use a standard API,
their application source code will port to any platform
that hosts a high performance library which implements
the standard API.  The library itself need not be portable.

My remarks simply reflect conventional wisdom
which asserts that it is easier and cheaper
to develop and maintain a high performance DSP library
than it is to develop and maintain reliable DSP applications
with platform dependent optimizations embedded
in the application source code.

But, please, don't take my word for all of this.
Visit the Vector, Signal and Image Processing Library web site

    http://www.vsipl.org/

A lot of really smart people in both academia and industry
have been working on this for over four years now.



Tue, 01 Apr 2003 12:37:22 GMT  
 ANNOUNCE: Optimized Math library for C67x

Quote:

> Dear colleagues,

> I would like to inform you that an Optimized Math library for
> TNS320C67x DSP is immediately available. For further information


Hi Andrew,

Could you briefly describe the licensing terms of your library?
Specifically, is it free/open (and if so, under which license), or is it
a commercial product?

legal-ly y'rs,

=g2
--
_____________________________________________________________________


Publisher of dspGuru                           http://www.dspguru.com
Iowegian International Corporation            http://www.iowegian.com



Tue, 01 Apr 2003 03:00:00 GMT  
 ANNOUNCE: Optimized Math library for C67x

Quote:

> Hi Andrew,

> Could you briefly describe the licensing terms of your library?
> Specifically, is it free/open (and if so, under which license), or is it
> a commercial product?

Hi Grant,

It is a commercial product. As it is such, I have apologised for taking
a bit of bandwidth from this (and couple of others) groupes. In case
you need more info, please contact me directly.

Best regards,
--
Andrew

Quote:
> legal-ly y'rs,

> =g2
> --
> _____________________________________________________________________


> Publisher of dspGuru                           http://www.dspguru.com
> Iowegian International Corporation       http://www.iowegian.com



Tue, 01 Apr 2003 03:00:00 GMT  
 
 [ 12 post ] 

 Relevant Pages 

1. Optimized Math libraries

2. Optimized math lib for Intel Pentium?

3. ANNOUNCE: Math::MatrixBool 5.7

4. Optimizing standard library calls

5. compiler optimizing out references to library

6. ANNOUNCE: kobra - a .NET wrapper library for Python

7. Announce: libavl balanced tree library, version 2.0 ALPHA 2 released

8. Pre-announce: unterminated string library...

9. Announce: free portable C library

10. ANNOUNCE: Fully-Functional Eval Library

11. ANNOUNCE: Fixed-Point Arithmetic Class Library

12. Announce: libavl balanced tree library, version 2.0 ALPHA 2 released

 

 
Powered by phpBB® Forum Software