How to: Create Regex which extracts N number of words before target word 
Author Message
 How to: Create Regex which extracts N number of words before target word

Assume you have a target word, e.g. "cat", and you want to extract that word
and a certain number of words before it. How is one to do this in a
non-literal manner with a regular expression which will support any number
of pre-words to be extracted?

Following does not work. It should replace target and previous two words, to
wit:   word1 word2 Z

#code

$_ =  "word1 word2 word3 word4 cat";

s: \b.+\b{2}?cat :Z:xg;

print $_;

#end code



Fri, 17 Oct 2003 21:17:45 GMT  
 How to: Create Regex which extracts N number of words before target word


Quote:
> Assume you have a target word, e.g. "cat", and you want to extract
> that word and a certain number of words before it. How is one to do
> this in a non-literal manner with a regular expression which will
> support any number of pre-words to be extracted?

> Following does not work. It should replace target and previous two
> words, to wit:   word1 word2 Z

> #code

> $_ =  "word1 word2 word3 word4 cat";

> s: \b.+\b{2}?cat :Z:xg;

The quantifier ({2}) is quantifying the atom `\b'.  I don't think
that's what you meant.  The `?' following the quantifier {2} makes no
sense, since the {2} is not allowed any latitude -- it forces exactly
two.  

Here's one way to do what I think you want:

  $ perl -wle '$_="word1 word2 word3 word4 cat";'
  >  -e 's/\b(?:\w+ +){2}cat/Z/; print'
  word1 word2 Z
  $

Obligatory mention: see perlre.  

--
Garry Williams



Fri, 17 Oct 2003 22:34:53 GMT  
 How to: Create Regex which extracts N number of words before target word
Alas, this won't even compile on my machine.




|>
|>  Assume you have a target word, e.g. "cat", and you want to extract
|>  that word and a certain number of words before it. How is one to do
|>  this in a non-literal manner with a regular expression which will
|>  support any number of pre-words to be extracted?
|>
|>  Following does not work. It should replace target and previous two
|>  words, to wit:   word1 word2 Z
|>
|>  code
|>
|>  word1 word2 word3 word4 cat";
|>
|>  s: \b.+\b{2}?cat :Z:xg;
|>
|>  The quantifier ({2}) is quantifying the atom `\b'.  I don't think
|>  that's what you meant.  The `?' following the quantifier {2} makes no
|>  sense, since the {2} is not allowed any latitude -- it forces exactly
|>  two.
|>

GW:
|>  Here's one way to do what I think you want:
|>
|>  perl -wle '$_="word1 word2 word3 word4 cat";'
|>  e 's/\b(?:\w+ +){2}cat/Z/; print'
|>  word1 word2 Z
|>     $
|>
|>  Obligatory mention: see perlre.
|>
|>   --
|>  Garry Williams



Sat, 18 Oct 2003 19:36:04 GMT  
 How to: Create Regex which extracts N number of words before target word
This works for two words:

$_ = "aaahhh dog whale cat male xxx yyy";

s|[A-Za-z0-9\.]+ [A-Za-z0-9\.]+ cat [A-Za-z0-9\.]+ [A-Za-z0-9\.]+ |X|;

print $_;

=======

Quote:
> |>  Assume you have a target word, e.g. "cat", and you want to extract
> |>  that word and a certain number of words before it. How is one to do
> |>  this in a non-literal manner with a regular expression which will
> |>  support any number of pre-words to be extracted?



Sat, 18 Oct 2003 20:27:22 GMT  
 How to: Create Regex which extracts N number of words before target word

                         ^^^^^^^^
                         ^^^^^^^^
<off topic>
Be aware that someone out there (tucows, to be specific)
owns fake.com and has to handle (i.e., bounce) all of your
spam/misdirected replies.  If you want to use a fake email address,

see the email munging faq.
</off topic>

Quote:
>Assume you have a target word, e.g. "cat", and you want to extract that word
>and a certain number of words before it. How is one to do this in a
>non-literal manner with a regular expression which will support any number
>of pre-words to be extracted?

>Following does not work. It should replace target and previous two words, to
>wit:   word1 word2 Z

You would do yourself a great service by having perl give you warnings
about your code:

#!perl -w
use strict;

Quote:

>#code

>$_ =  "word1 word2 word3 word4 cat";

>s: \b.+\b{2}?cat :Z:xg;

>print $_;

>#end code

How about...

#!/usr/bin/perl -w
use strict;

$_ =  "word1 word2 word3 word4 cat";

s/(\w+\W+){2}cat/Z/xg;

print $_;

--

Technical Consultant     | I speak for me,     |   19055 Pruneridge Ave.
Development Alliances Lab|            *not* HP |                MS 46TU2
ESPD / E-Serv. Partner Division +--------------+---- Cupertino, CA 95014



Sat, 18 Oct 2003 08:00:44 GMT  
 How to: Create Regex which extracts N number of words before target word
That's quite brilliant. Most elegant.

I tried to extend this to also include 2 words after the word, but this
mucks up the elegant code

===

$_ = "word1 word2 word3 word4 cat word5 word6 word7";

s/(\w+\W+){2}cat(\w+\W+){2}/Z/xg;

print $_;

===



Quote:

> How about...

> #!/usr/bin/perl -w
> use strict;

> $_ =  "word1 word2 word3 word4 cat";

> s/(\w+\W+){2}cat/Z/xg;

> print $_;

> --

> Technical Consultant     | I speak for me,     |   19055 Pruneridge Ave.
> Development Alliances Lab|            *not* HP |                MS 46TU2
> ESPD / E-Serv. Partner Division +--------------+---- Cupertino, CA 95014



Sat, 18 Oct 2003 21:40:59 GMT  
 How to: Create Regex which extracts N number of words before target word

Quote:

> I tried to extend this to also include 2 words after the word, but
> this mucks up the elegant code

> ===

> $_ = "word1 word2 word3 word4 cat word5 word6 word7";

> s/(\w+\W+){2}cat(\w+\W+){2}/Z/xg;

> print $_;

The "\w+\W+" bit matches, in this case, a word and the space(s) after
it.  But for after "cat" you want to reverse that and match the
space(s) before the word and then the word:

s/(\s+\W+){2}cat(\W+\w+){2}/Z/g; # no x needed since no spaces in regex

--
Ren Maddox



Sat, 18 Oct 2003 22:48:46 GMT  
 How to: Create Regex which extracts N number of words before target word
: Assume you have a target word, e.g. "cat", and you want to extract that word
: and a certain number of words before it. How is one to do this in a
: non-literal manner with a regular expression which will support any number
: of pre-words to be extracted?
:
: Following does not work. It should replace target and previous two words, to
: wit:   word1 word2 Z
:
: $_ =  "word1 word2 word3 word4 cat";
: s: \b.+\b{2}?cat :Z:xg;
: print $_;

  $_ =  'word1 word2 word3 word4 cat';
  s/(\w+\s+){2}cat/Z/g;
  print;

Output is 'word1 word2 Z'.

--
   |   Craig Berry - http://www.cinenet.net/~cberry/
 --*--  "When the going gets weird, the weird turn pro."
   |               - Hunter S. Thompson



Sun, 19 Oct 2003 04:20:56 GMT  
 How to: Create Regex which extracts N number of words before target word

Quote:

>That's quite brilliant. Most elegant.

Thanks -- and thanks for updating your munging!  One last request for
now -- give up the habit of top posting.  It really annoys the locals,
and you're likely to get more help when you don't annoy the locals.

For a quoting guide:  http://www.netmeister.org/news/learn2quote2.html

Quote:
>I tried to extend this to also include 2 words after the word, but this
>mucks up the elegant code

>===

>$_ = "word1 word2 word3 word4 cat word5 word6 word7";

>s/(\w+\W+){2}cat(\w+\W+){2}/Z/xg;

>print $_;

I'll give you a hint because you're really close -- look up the meanings
of \w and \W in the perlre man page and adjust what you added
accordingly.

(You do understand the purpose of the ()'s, right?  The {2} applies to
the entire sub expression within the ()'s.)

Rich

--

Technical Consultant     | I speak for me,     |   19055 Pruneridge Ave.
Development Alliances Lab|            *not* HP |                MS 46TU2
ESPD / E-Serv. Partner Division +--------------+---- Cupertino, CA 95014



Sun, 19 Oct 2003 04:10:00 GMT  
 How to: Create Regex which extracts N number of words before target word

Quote:

>Assume you have a target word, e.g. "cat", and you want to extract that word
>and a certain number of words before it. How is one to do this in a
>non-literal manner with a regular expression which will support any number
>of pre-words to be extracted?

Careful with that.

        $_ = 'dog cat mouse cat parrot';
        /((?:\S+\s+){2})cat\b/ and print $1;
-->
        'cat mouse'

I hope you like what it matched. I think not.

--
        Bart.



Sun, 19 Oct 2003 17:25:50 GMT  
 How to: Create Regex which extracts N number of words before target word


Quote:

> Thanks -- and thanks for updating your munging!  One last request for
> now -- give up the habit of top posting.  It really annoys the locals,
> and you're likely to get more help when you don't annoy the locals.

> For a quoting guide:  http://www.netmeister.org/news/learn2quote2.html

Great tip! Thanks for the reference. Will do.

+++++++++++++++++++++++++++++++

Rich, I took your implicit suggestion to reverse the order of "\w\W", but
Bart pointed out a great counterexample which breaks the regex:

== START CODE ===

$string = "word1 word2 word3 cat word5 cat word7 word8 word9";


== END CODE =====

THIS SHOULD RETURN:

word2 word3 cat word5 cat
cat word5 cat word7 word8

BUT INSTEAD RETURNS:

word3
 cat



Sun, 19 Oct 2003 21:28:45 GMT  
 How to: Create Regex which extracts N number of words before target word

Quote:

> $string = "word1 word2 word3 cat word5 cat word7 word8 word9";



> == END CODE =====

> THIS SHOULD RETURN:

> word2 word3 cat word5 cat
> cat word5 cat word7 word8

A single invocation of a regex cannot give overlapping matches.
You'll have to be more clever.

Quote:
> BUT INSTEAD RETURNS:

> word3
>  cat

This output is mostly a result of the capturing parens you have in the
regex, and their interaction with the {2}.  Change it to:

/ (?: \w+ \W+ ){2} cat (?: \W+ \w+ ){2} /gx

and you will at least get the first match that you expect.

As for handling overlapping matches, here's one way to do it:

$string =~ m{
             (                         # start saving match
              \b                       # needed to limit back-track
              (?:\w+\W+){2}            # match two words
              cat
              (?:\W+\w+){2}            # match two words
              \b                       # needed to limit back-track
             )                         # ok, that's a match

             (?!)                      # force back-track for other matches
            }x;

Aren't experimental regex features fun!

--
Ren Maddox



Mon, 20 Oct 2003 00:04:07 GMT  
 How to: Create Regex which extracts N number of words before target word

Quote:

>Rich, I took your implicit suggestion to reverse the order of "\w\W", but
>Bart pointed out a great counterexample which breaks the regex:

But now you've also changed your requirements from a replacement
expression to a matching expression.  This still works as I would expect
it to -- I don't know what you expect:

#!/usr/bin/perl -w
use strict;

$_ =  "word1 word2 word3 word4 cat word5 cat word6 word7 word8";

s/(\w+\W+){2}cat(\W+\w+){2}/Z/;

print $_;

word1 word2 Z word6 word7 word8

Quote:
>== START CODE ===

>$string = "word1 word2 word3 cat word5 cat word7 word8 word9";



>== END CODE =====

>THIS SHOULD RETURN:

>word2 word3 cat word5 cat
>cat word5 cat word7 word8

>BUT INSTEAD RETURNS:

>word3
> cat

Of course -- because a regex will return what you 'captured' in the
()'s.  Try it again with m|((\w+ \W+){2}cat(\W+ \w+){2})|xg
                           ^                           ^

Now the tricky part which Bart was probably pointing out is that after
you've captured the first cat, you're now past the second one.  You
somehow need to back up.  I'm not familiar with doing that in RE's, so I
won't offer a solution using RE's.  At this point, I'd personally use an
RE to split up the words and use an array and a simple algorithm to find
the matches:

#!/usr/bin/perl -wl
use strict;



my $b = 2; # number of before words
my $a = 2; # number of after words






   }

Quote:
}

Rich
--

Technical Consultant     | I speak for me,     |   19055 Pruneridge Ave.
Development Alliances Lab|            *not* HP |                MS 46TU2
ESPD / E-Serv. Partner Division +--------------+---- Cupertino, CA 95014


Mon, 20 Oct 2003 08:49:32 GMT  
 How to: Create Regex which extracts N number of words before target word

Quote:

>Now the tricky part which Bart was probably pointing out is that after
>you've captured the first cat, you're now past the second one.

Yes. If the intention was to capture the first "cat", and get the (at
most) n words before it, then you've gone too far. The regexes tries to
match n words, whatever they are, and then "cat".

Quote:
> At this point, I'd personally use an
>RE to split up the words and use an array and a simple algorithm to find
>the matches:

I think I'd use split.

        my ($pre, $match, $post) = split /(cat)/, $_, 2;

These behave just like $`and $' but without overhead  for the other
regexes.

Now, continue matching with $pre:

        $pre =~ /((?:\S+\s+){0,2})$/;

This will match AT MOST 2 words at the end of the prefix.

If you want continue processing the string where you left of, continue
with $post.

--
        Bart.



Mon, 20 Oct 2003 15:54:37 GMT  
 How to: Create Regex which extracts N number of words before target word


|>  usr/bin/perl -wl
|>  use strict;
|>
|>  my $t =  "word1 word2 word3 cat word5 cat word7 word8 word9 cat word11";
|>  split(/\W+/, $t);
|>

|>  my $b = 2; # number of before words
|>  my $a = 2; # number of after words
|>





|>      }
|>   }

Much obliged to Rich, Bart, and Ren for their masterly attempts to solve
this problem. In my ignorance, I thought this was a short one-line solution.
I now see I have a lot of work to do.

Barry Krusch



Mon, 20 Oct 2003 21:00:14 GMT  
 
 [ 15 post ] 

 Relevant Pages 

1. regex for word whitespace word

2. creating MS word documents without installing MS word

3. words words words

4. Sort hash on 2nd word in muilti word key

5. Getting words between two other words within an array

6. Word count/ word position in text

7. Question: list files including words from a word-listing file

8. Finding the word after a word

9. Bad Word - Swear Word Checking - I Surrender!

10. How to Capitalize the first letter of a word an or whole word

11. word-by-word Text wrapping problem

12. Extract last word on line

 

 
Powered by phpBB® Forum Software