Author |
Message |
Barry #1 / 15
|
 How to: Create Regex which extracts N number of words before target word
Assume you have a target word, e.g. "cat", and you want to extract that word and a certain number of words before it. How is one to do this in a non-literal manner with a regular expression which will support any number of pre-words to be extracted? Following does not work. It should replace target and previous two words, to wit: word1 word2 Z #code $_ = "word1 word2 word3 word4 cat"; s: \b.+\b{2}?cat :Z:xg; print $_; #end code
|
Fri, 17 Oct 2003 21:17:45 GMT |
|
 |
Garry Willia #2 / 15
|
 How to: Create Regex which extracts N number of words before target word
Quote: > Assume you have a target word, e.g. "cat", and you want to extract > that word and a certain number of words before it. How is one to do > this in a non-literal manner with a regular expression which will > support any number of pre-words to be extracted? > Following does not work. It should replace target and previous two > words, to wit: word1 word2 Z > #code > $_ = "word1 word2 word3 word4 cat"; > s: \b.+\b{2}?cat :Z:xg;
The quantifier ({2}) is quantifying the atom `\b'. I don't think that's what you meant. The `?' following the quantifier {2} makes no sense, since the {2} is not allowed any latitude -- it forces exactly two. Here's one way to do what I think you want: $ perl -wle '$_="word1 word2 word3 word4 cat";' > -e 's/\b(?:\w+ +){2}cat/Z/; print' word1 word2 Z $ Obligatory mention: see perlre. -- Garry Williams
|
Fri, 17 Oct 2003 22:34:53 GMT |
|
 |
Barry #3 / 15
|
 How to: Create Regex which extracts N number of words before target word
Alas, this won't even compile on my machine.
|> |> Assume you have a target word, e.g. "cat", and you want to extract |> that word and a certain number of words before it. How is one to do |> this in a non-literal manner with a regular expression which will |> support any number of pre-words to be extracted? |> |> Following does not work. It should replace target and previous two |> words, to wit: word1 word2 Z |> |> code |> |> word1 word2 word3 word4 cat"; |> |> s: \b.+\b{2}?cat :Z:xg; |> |> The quantifier ({2}) is quantifying the atom `\b'. I don't think |> that's what you meant. The `?' following the quantifier {2} makes no |> sense, since the {2} is not allowed any latitude -- it forces exactly |> two. |> GW: |> Here's one way to do what I think you want: |> |> perl -wle '$_="word1 word2 word3 word4 cat";' |> e 's/\b(?:\w+ +){2}cat/Z/; print' |> word1 word2 Z |> $ |> |> Obligatory mention: see perlre. |> |> -- |> Garry Williams
|
Sat, 18 Oct 2003 19:36:04 GMT |
|
 |
Barry #4 / 15
|
 How to: Create Regex which extracts N number of words before target word
This works for two words: $_ = "aaahhh dog whale cat male xxx yyy"; s|[A-Za-z0-9\.]+ [A-Za-z0-9\.]+ cat [A-Za-z0-9\.]+ [A-Za-z0-9\.]+ |X|; print $_; ======= Quote: > |> Assume you have a target word, e.g. "cat", and you want to extract > |> that word and a certain number of words before it. How is one to do > |> this in a non-literal manner with a regular expression which will > |> support any number of pre-words to be extracted?
|
Sat, 18 Oct 2003 20:27:22 GMT |
|
 |
Richard J. Rauenza #5 / 15
|
 How to: Create Regex which extracts N number of words before target word
^^^^^^^^ ^^^^^^^^ <off topic> Be aware that someone out there (tucows, to be specific) owns fake.com and has to handle (i.e., bounce) all of your spam/misdirected replies. If you want to use a fake email address,
see the email munging faq. </off topic> Quote: >Assume you have a target word, e.g. "cat", and you want to extract that word >and a certain number of words before it. How is one to do this in a >non-literal manner with a regular expression which will support any number >of pre-words to be extracted? >Following does not work. It should replace target and previous two words, to >wit: word1 word2 Z
You would do yourself a great service by having perl give you warnings about your code: #!perl -w use strict; Quote: >#code >$_ = "word1 word2 word3 word4 cat"; >s: \b.+\b{2}?cat :Z:xg; >print $_; >#end code
How about... #!/usr/bin/perl -w use strict; $_ = "word1 word2 word3 word4 cat"; s/(\w+\W+){2}cat/Z/xg; print $_; --
Technical Consultant | I speak for me, | 19055 Pruneridge Ave. Development Alliances Lab| *not* HP | MS 46TU2 ESPD / E-Serv. Partner Division +--------------+---- Cupertino, CA 95014
|
Sat, 18 Oct 2003 08:00:44 GMT |
|
 |
Barry #6 / 15
|
 How to: Create Regex which extracts N number of words before target word
That's quite brilliant. Most elegant. I tried to extend this to also include 2 words after the word, but this mucks up the elegant code === $_ = "word1 word2 word3 word4 cat word5 word6 word7"; s/(\w+\W+){2}cat(\w+\W+){2}/Z/xg; print $_; ===
Quote: > How about... > #!/usr/bin/perl -w > use strict; > $_ = "word1 word2 word3 word4 cat"; > s/(\w+\W+){2}cat/Z/xg; > print $_; > --
> Technical Consultant | I speak for me, | 19055 Pruneridge Ave. > Development Alliances Lab| *not* HP | MS 46TU2 > ESPD / E-Serv. Partner Division +--------------+---- Cupertino, CA 95014
|
Sat, 18 Oct 2003 21:40:59 GMT |
|
 |
Ren Maddo #7 / 15
|
 How to: Create Regex which extracts N number of words before target word
Quote:
> I tried to extend this to also include 2 words after the word, but > this mucks up the elegant code > === > $_ = "word1 word2 word3 word4 cat word5 word6 word7"; > s/(\w+\W+){2}cat(\w+\W+){2}/Z/xg; > print $_;
The "\w+\W+" bit matches, in this case, a word and the space(s) after it. But for after "cat" you want to reverse that and match the space(s) before the word and then the word: s/(\s+\W+){2}cat(\W+\w+){2}/Z/g; # no x needed since no spaces in regex -- Ren Maddox
|
Sat, 18 Oct 2003 22:48:46 GMT |
|
 |
Craig Ber #8 / 15
|
 How to: Create Regex which extracts N number of words before target word
: Assume you have a target word, e.g. "cat", and you want to extract that word : and a certain number of words before it. How is one to do this in a : non-literal manner with a regular expression which will support any number : of pre-words to be extracted? : : Following does not work. It should replace target and previous two words, to : wit: word1 word2 Z : : $_ = "word1 word2 word3 word4 cat"; : s: \b.+\b{2}?cat :Z:xg; : print $_; $_ = 'word1 word2 word3 word4 cat'; s/(\w+\s+){2}cat/Z/g; print; Output is 'word1 word2 Z'. -- | Craig Berry - http://www.cinenet.net/~cberry/ --*-- "When the going gets weird, the weird turn pro." | - Hunter S. Thompson
|
Sun, 19 Oct 2003 04:20:56 GMT |
|
 |
Richard J. Rauenza #9 / 15
|
 How to: Create Regex which extracts N number of words before target word
Quote:
>That's quite brilliant. Most elegant.
Thanks -- and thanks for updating your munging! One last request for now -- give up the habit of top posting. It really annoys the locals, and you're likely to get more help when you don't annoy the locals. For a quoting guide: http://www.netmeister.org/news/learn2quote2.html Quote: >I tried to extend this to also include 2 words after the word, but this >mucks up the elegant code >=== >$_ = "word1 word2 word3 word4 cat word5 word6 word7"; >s/(\w+\W+){2}cat(\w+\W+){2}/Z/xg; >print $_;
I'll give you a hint because you're really close -- look up the meanings of \w and \W in the perlre man page and adjust what you added accordingly. (You do understand the purpose of the ()'s, right? The {2} applies to the entire sub expression within the ()'s.) Rich --
Technical Consultant | I speak for me, | 19055 Pruneridge Ave. Development Alliances Lab| *not* HP | MS 46TU2 ESPD / E-Serv. Partner Division +--------------+---- Cupertino, CA 95014
|
Sun, 19 Oct 2003 04:10:00 GMT |
|
 |
Bart Lateu #10 / 15
|
 How to: Create Regex which extracts N number of words before target word
Quote:
>Assume you have a target word, e.g. "cat", and you want to extract that word >and a certain number of words before it. How is one to do this in a >non-literal manner with a regular expression which will support any number >of pre-words to be extracted?
Careful with that. $_ = 'dog cat mouse cat parrot'; /((?:\S+\s+){2})cat\b/ and print $1; --> 'cat mouse' I hope you like what it matched. I think not. -- Bart.
|
Sun, 19 Oct 2003 17:25:50 GMT |
|
 |
Barry #11 / 15
|
 How to: Create Regex which extracts N number of words before target word
Quote: > Thanks -- and thanks for updating your munging! One last request for > now -- give up the habit of top posting. It really annoys the locals, > and you're likely to get more help when you don't annoy the locals. > For a quoting guide: http://www.netmeister.org/news/learn2quote2.html
Great tip! Thanks for the reference. Will do. +++++++++++++++++++++++++++++++ Rich, I took your implicit suggestion to reverse the order of "\w\W", but Bart pointed out a great counterexample which breaks the regex: == START CODE === $string = "word1 word2 word3 cat word5 cat word7 word8 word9";
== END CODE ===== THIS SHOULD RETURN: word2 word3 cat word5 cat cat word5 cat word7 word8 BUT INSTEAD RETURNS: word3 cat
|
Sun, 19 Oct 2003 21:28:45 GMT |
|
 |
Ren Maddo #12 / 15
|
 How to: Create Regex which extracts N number of words before target word
Quote:
> $string = "word1 word2 word3 cat word5 cat word7 word8 word9";
> == END CODE ===== > THIS SHOULD RETURN: > word2 word3 cat word5 cat > cat word5 cat word7 word8
A single invocation of a regex cannot give overlapping matches. You'll have to be more clever. Quote: > BUT INSTEAD RETURNS: > word3 > cat
This output is mostly a result of the capturing parens you have in the regex, and their interaction with the {2}. Change it to: / (?: \w+ \W+ ){2} cat (?: \W+ \w+ ){2} /gx and you will at least get the first match that you expect. As for handling overlapping matches, here's one way to do it: $string =~ m{ ( # start saving match \b # needed to limit back-track (?:\w+\W+){2} # match two words cat (?:\W+\w+){2} # match two words \b # needed to limit back-track ) # ok, that's a match
(?!) # force back-track for other matches }x; Aren't experimental regex features fun! -- Ren Maddox
|
Mon, 20 Oct 2003 00:04:07 GMT |
|
 |
Richard J. Rauenza #13 / 15
|
 How to: Create Regex which extracts N number of words before target word
Quote:
>Rich, I took your implicit suggestion to reverse the order of "\w\W", but >Bart pointed out a great counterexample which breaks the regex:
But now you've also changed your requirements from a replacement expression to a matching expression. This still works as I would expect it to -- I don't know what you expect: #!/usr/bin/perl -w use strict; $_ = "word1 word2 word3 word4 cat word5 cat word6 word7 word8"; s/(\w+\W+){2}cat(\W+\w+){2}/Z/; print $_; word1 word2 Z word6 word7 word8 Quote: >== START CODE === >$string = "word1 word2 word3 cat word5 cat word7 word8 word9";
>== END CODE ===== >THIS SHOULD RETURN: >word2 word3 cat word5 cat >cat word5 cat word7 word8 >BUT INSTEAD RETURNS: >word3 > cat
Of course -- because a regex will return what you 'captured' in the ()'s. Try it again with m|((\w+ \W+){2}cat(\W+ \w+){2})|xg ^ ^ Now the tricky part which Bart was probably pointing out is that after you've captured the first cat, you're now past the second one. You somehow need to back up. I'm not familiar with doing that in RE's, so I won't offer a solution using RE's. At this point, I'd personally use an RE to split up the words and use an array and a simple algorithm to find the matches: #!/usr/bin/perl -wl use strict;
my $b = 2; # number of before words my $a = 2; # number of after words
} Quote: }
Rich --
Technical Consultant | I speak for me, | 19055 Pruneridge Ave. Development Alliances Lab| *not* HP | MS 46TU2 ESPD / E-Serv. Partner Division +--------------+---- Cupertino, CA 95014
|
Mon, 20 Oct 2003 08:49:32 GMT |
|
 |
Bart Lateu #14 / 15
|
 How to: Create Regex which extracts N number of words before target word
Quote:
>Now the tricky part which Bart was probably pointing out is that after >you've captured the first cat, you're now past the second one.
Yes. If the intention was to capture the first "cat", and get the (at most) n words before it, then you've gone too far. The regexes tries to match n words, whatever they are, and then "cat". Quote: > At this point, I'd personally use an >RE to split up the words and use an array and a simple algorithm to find >the matches:
I think I'd use split. my ($pre, $match, $post) = split /(cat)/, $_, 2; These behave just like $`and $' but without overhead for the other regexes. Now, continue matching with $pre: $pre =~ /((?:\S+\s+){0,2})$/; This will match AT MOST 2 words at the end of the prefix. If you want continue processing the string where you left of, continue with $post. -- Bart.
|
Mon, 20 Oct 2003 15:54:37 GMT |
|
 |
Barry #15 / 15
|
 How to: Create Regex which extracts N number of words before target word
|> usr/bin/perl -wl |> use strict; |> |> my $t = "word1 word2 word3 cat word5 cat word7 word8 word9 cat word11"; |> split(/\W+/, $t); |>
|> my $b = 2; # number of before words |> my $a = 2; # number of after words |>
|> } |> } Much obliged to Rich, Bart, and Ren for their masterly attempts to solve this problem. In my ignorance, I thought this was a short one-line solution. I now see I have a lot of work to do. Barry Krusch
|
Mon, 20 Oct 2003 21:00:14 GMT |
|
|