is there any burning need for this proposed builtin "MP"? 
Author Message
 is there any burning need for this proposed builtin "MP"?

Have I stumbled into a suggestion for a useful [g]awk built-in, MP the
Matching Part of the current regexp:

awk '/[zb]|ee/{a[MP]++};END{for(i in a)print i, a[i]} <<!
zzzz
bbbb
bbbb
eeeee
!
z 1
b 2
ee 1

[hmmm, zbbb bzzz eezb probably just would add to z's count... not my
main point here, though]
--
http://www.*-*-*.com/ +886-4-25854780 e-mail:restore .com.



Wed, 31 Dec 2003 09:53:15 GMT  
 is there any burning need for this proposed builtin "MP"?

Quote:

> Have I stumbled into a suggestion for a useful [g]awk built-in, MP the
> Matching Part of the current regexp:

> awk '/[zb]|ee/{a[MP]++};END{for(i in a)print i, a[i]} <<!
> zzzz
> bbbb
> bbbb
> eeeee
> !
> z 1
> b 2
> ee 1

MP would be akin to Perl's built-in match variable

    $& ($MATCH)             the string matched by the last successful
                            pattern match.

Then we would want awk to also have built-in variables corresponding
to Perl's

    $` ($PREMATCH)          the string preceding what was matched by the
                            last successful pattern match;
    $' ($POSTMATCH)         the string following what was matched by the
                            last successful pattern match;
    $+ ($LAST_PAREN_MATCH)  the string matched by the last parenthesized
                            subexpression in the last successful pattern
                            match;

and, of course, most importantly (by dint of their generality)

    $1...$9...              the strings matched by the corresponding
                            parenthesized subexpressions in the last
                            successful pattern match.

In general, awk lacks regular expression pattern match memory. This
is its greatest weakness as a text processing language. The sub()
and gsub() functions have the & (ampersand) token to restore the
whole substring matched by the regular expression pattern to the
replacement string, but I've hardly ever found this feature useful.
Some implementations of awk do support more generalized match memory
in their substitution functions (gensub() in GNU awk, sub() and
gsub() in MKS awk), but these are incompatible extensions to the
base language, and the match memory is limited to substitution
operations. The very latest version of GNU awk (version 3.1.0) has
an extended match() function that takes an optional third argument,
an array, to hold the text matched by parenthesized subexpressions.
While this is certainly a laudable--and long overdue--enhancement,
it's nonetheless an implementation-specific extension to a language
whose recent development has splintered into multiple incompatible
implementations.

Your suggestion is a good one. In fact, a guy named Larry added
this and several other nice features to awk back in the late 80s,
at which point he gave the enhanced language a new name.

Personally, I'm opposed to any major changes to the awk programming
language. It fits its niche perfectly and can only be made worse
by trying to retrofit it with snazzy parts that will, at best, only
ever turn it into a clone of another already mature programming
language. And then there won't be good ol' simple, small, ubiquitous,
quickly-learned, easy-to-use awk anymore. :-(

--
Jim Monty

Tempe, Arizona USA



Wed, 31 Dec 2003 14:35:24 GMT  
 is there any burning need for this proposed builtin "MP"?
Gawk 3.1.0 does in fact introduce this capability.

Check the manpage and look for match().  Awka 0.7.5 has it as well.

cheers,
Andrew



Wed, 31 Dec 2003 15:27:59 GMT  
 is there any burning need for this proposed builtin "MP"?


Quote:

>Personally, I'm opposed to any major changes to the awk programming
>language. It fits its niche perfectly and can only be made worse
>by trying to retrofit it with snazzy parts that will, at best, only
>ever turn it into a clone of another already mature programming
>language. And then there won't be good ol' simple, small, ubiquitous,
>quickly-learned, easy-to-use awk anymore. :-(

>--
>Jim Monty

>Tempe, Arizona USA

        Linux, and Unix in general, is a journey, not a destination.
        If that were not the case, Larry could not have evolved his
        write-only language.

        Awk's gentle and considered evolution is a joy to behold. The
        recent extensions ease our labour, and shorten our scripts,
        while retaining a readable syntax.

        The vim|vi|nvi|vile|elvis|... community thrives in an environment
        of general compatibility, and tolerance. (If you tolerate no
        deviation, then maybe emacs is also for you. ;)

        For portable scripts we can gawk --traditional, or gawk --posix.

        The splitting done by match(s,r[,a]) is elegant and highly useful.
        I for one thank the team for revealing to the world this phase of
        awk's maturation. And for granting us choice at several levels.

Erik Christiansen
--
 ----
 If no-one sent .ppt files in emails, there'd be no need to touch MSW at all!



Fri, 02 Jan 2004 16:19:55 GMT  
 is there any burning need for this proposed builtin "MP"?


% Have I stumbled into a suggestion for a useful [g]awk built-in, MP the
% Matching Part of the current regexp:

This doesn't provide any additional functionality. For instance, you can
use match, at a cost of a little typing:

 awk 'match($0, /[zb]|ee/) {a[substr($0, RSTART, RLENGTH)]++}
      END{for(i in a)print i, a[i]} <<!
 zzzz
 bbbb
 bbbb
 eeeee
 !
 z 1
 ee 1
 b 2

On the other hand, if ordinary matching using the ~ operator had to set
a variable, then it would degrade overall performance of the language
in two ways: it would introduce the cost of setting a variable every
time ~ is used, implicitly or explicitly, and it would force ~ matches
to do more work. For instance, any RE with Kleene closure can currently
exit the moment the closure is satisfied, but if MP had to be set,
the match would have to continue until the longest match was found.
Given /a[a-z]+z/, there are two matches in abzcdefghijklmnopqrstuvwxyzxy.
A simple boolean match can quit after the third character, but an MP
match must continue until all the letters have been read, and keep
track of the last location where the RE was satisfied.

So, this doesn't strike me as a particularly good idea.
--

Patrick TJ McPhee
East York  Canada



Fri, 02 Jan 2004 21:53:56 GMT  
 is there any burning need for this proposed builtin "MP"?


Quote:


>% Have I stumbled into a suggestion for a useful [g]awk built-in, MP the
>% Matching Part of the current regexp:

>This doesn't provide any additional functionality. For instance, you can
>use match, at a cost of a little typing:

> awk 'match($0, /[zb]|ee/) {a[substr($0, RSTART, RLENGTH)]++}
>      END{for(i in a)print i, a[i]} <<!

As has been pointed out, GAWK now has this feature (TAWK has had it for
quite a while already), but, yes, in the old days, when the world was young
and men were men, we just parsed it out using RSTART and RLENGTH.

In both GAWK and TAWK, this feature (shoving the entire matched part into a
variable) is a natural outgrowth of the feature that allows you to get at
the sub-matches (parenthesized pieces of the reg exp).  I.e., the entire
string is (the analog of) the $0 where the sub-matches are (the analog of)
the $1, $2, etc.

BTW, GAWK gives you a single array whose contents are the actual strings
matched, whereas TAWK gives you the indices (start, length) into the
original string at which the strings can be found.  The later is slightly
more powerful, at the cost of making you code the substr() yourself.  In
both cases (GAWK & TAWK), the indices of the array are numeric, with 0
corresponding to the entire matched string and 1, 2, ... corresponding to
the nth parenthesized piece.

HTH...



Fri, 02 Jan 2004 23:46:40 GMT  
 is there any burning need for this proposed builtin "MP"?

Quote:
> This doesn't provide any additional functionality. For instance, you can
> use match, at a cost of a little typing:

>  awk 'match($0, /[zb]|ee/) {a[substr($0, RSTART, RLENGTH)]++}
>       END{for(i in a)print i, a[i]} <<!
>  zzzz
>  bbbb
>  bbbb
>  eeeee
>  !
>  z 1
>  ee 1
>  b 2

For this example, you are correct.  But the new match() extension also
provides the sub-expressions that matched - new functionality.

Quote:
> On the other hand, if ordinary matching using the ~ operator had to set
> a variable, then it would degrade overall performance of the language
> in two ways: it would introduce the cost of setting a variable every
> time ~ is used, implicitly or explicitly, and it would force ~ matches
> to do more work. For instance, any RE with Kleene closure can currently
> exit the moment the closure is satisfied, but if MP had to be set,
> the match would have to continue until the longest match was found.
> Given /a[a-z]+z/, there are two matches in abzcdefghijklmnopqrstuvwxyzxy.
> A simple boolean match can quit after the third character, but an MP
> match must continue until all the letters have been read, and keep
> track of the last location where the RE was satisfied.

> So, this doesn't strike me as a particularly good idea.

Yes, it would be a bad idea if implemented as you describe, as it would
slow down all matches.  However, with the enhanced match() function, the
AWK implementation should be able to use fast methods of evaluating the
RE _except_ for when the third argument has been specified by the user.

cheers,
Andrew



Sat, 03 Jan 2004 12:17:56 GMT  
 is there any burning need for this proposed builtin "MP"?

[I wrote]

% > This doesn't provide any additional functionality. For instance, you can
% > use match, at a cost of a little typing:

[...]

% For this example, you are correct.  But the new match() extension also
% provides the sub-expressions that matched - new functionality.

But I wasn't referring to extensions to match().  I was referring to
the suggestion to which I was replying.
--

Patrick TJ McPhee
East York  Canada



Sun, 04 Jan 2004 00:21:17 GMT  
 is there any burning need for this proposed builtin "MP"?
...
Quote:
>In both GAWK and TAWK, this feature (shoving the entire matched part into a
>variable) is a natural outgrowth of the feature that allows you to get at
>the sub-matches (parenthesized pieces of the reg exp).  I.e., the entire
>string is (the analog of) the $0 where the sub-matches are (the analog of)
>the $1, $2, etc.

...

These semantics of parenthesized subexpressions are nice, but without Perl's
ability to use _either_ lazy or greedy closures (not to mention look ahead
assertions), it's a minor improvement.



Mon, 05 Jan 2004 08:31:29 GMT  
 
 [ 9 post ] 

 Relevant Pages 

1. "builtin" methods

2. string.join(["Tk 4.2p2", "Python 1.4", "Win32", "free"], "for")

3. What does MP stands for in the "IMSL MP" library?

4. BEGIN{want[]={"s1o", "s2o", "s2q", "s3q"}

5. Parsing ""D""?

6. "Fifth", "Forth", zai nar?

7. Ruby "finalize", "__del__"

8. beginners "let"/"random" question

9. ANNOUNCE: new "plus"- and "dash"-patches available for Tcl7.5a2/Tk4.1a2

10. Looking for "stdin", "stdout"

11. Art and all that Jazz: i am crudely long, so I burn you

12. "NEED CASH $$$$$ READ HERE $$$$$"

 

 
Powered by phpBB® Forum Software