Extracting a range of words!
Author |
Message |
vivek_1231 #1 / 9
|
 Extracting a range of words!
I need perl help... say $text = qq (I confirm that sufficient information and detail have been reported in this technical report, that it is scientifically sound, and that appropriate conclusions have been included) i find the index for "sound". After that I just need the substring from {-5, +5} WORDS around that indexof(sound) i.e. my final answer shud be = report, that it is scientifically sound, and that appropriate conclusions have Is There a strategy or I have to do it in basic steps ?
|
Mon, 03 Jun 2013 16:52:35 GMT |
|
 |
Tad McClella #2 / 9
|
 Extracting a range of words!
Quote:
> $text = qq (I confirm that sufficient information and detail have been > reported in this technical report, that it is scientifically sound, > and that appropriate conclusions have been included) > i find the index for "sound". > After that I just need the substring from {-5, +5} WORDS around that > indexof(sound)
What do you want to do if there is only 1 word before the target word, eg. if searching for "confirm"? I will assume: show 0-5 words before and after. What do you want to do if the target word is in the string at more than one place, eg "been"? I will assume: find the first one. What do you want if "sound" appears within some other word such as "resoundingly"? I will assume: do not match. my $word = 'sound'; $text =~ s/.*? # leading stuff to strip ( # $1 is stuff to keep (\w+\W+){0,5} # 0-5 words \b$word\b\W* # the word to search for (\w+\W*){0,5} # 0-5 words ) .* # trailing stuff to strip /$1/sx; -- Tad McClellan email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/" The above message is a Usenet post. I don't recall having given anyone permission to use it on a Web site.
|
Mon, 03 Jun 2013 17:30:47 GMT |
|
 |
Uri Guttma #3 / 9
|
 Extracting a range of words!
TM> my $word = 'sound'; TM> $text =~ s/.*? # leading stuff to strip TM> ( # $1 is stuff to keep TM> (\w+\W+){0,5} # 0-5 words TM> \b$word\b\W* # the word to search for TM> (\w+\W*){0,5} # 0-5 words TM> ) TM> .* # trailing stuff to strip TM> /$1/sx; that was pretty much the regex i would write. but do you need the \b's in there? assuming $word is really \w chars, then the preceding \W will obviate the need for the \b. same for the trailing one. also i would use \s+\S+ since written words could contain apostrophes and some other punctuation. uri --
----- Perl Code Review , Architecture, Development, Training, Support ------ --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
|
Mon, 03 Jun 2013 18:41:40 GMT |
|
 |
ccc3180 #4 / 9
|
 Extracting a range of words!
Quote: > $text = qq (I confirm that sufficient information and detail have been > reported in this technical report, that it is scientifically sound, > and that appropriate conclusions have been included) > After that I just need the substring from {-5, +5} WORDS around that > indexof(sound)
Here's a pretty mindless way to do it. Split the string into an array, then iterate through the array looking for 'sound'. If necessary, you can use the word boundary markers. Then, starting at the index of the array element you found, print the ten elements starting at 'index - 5'. Like this: my $text = qq (I confirm that sufficient information and detail have been reported in this technical report, that it is scientifically sound, and that appropriate conclusions have been included);
my $index = 0;
{ last if ($element =~ /sound/); $index++; Quote: }
CC.
|
Mon, 03 Jun 2013 19:31:14 GMT |
|
 |
s.. #5 / 9
|
 Extracting a range of words!
Quote:
>I need perl help... >say >$text = qq (I confirm that sufficient information and detail have been >reported in this technical report, that it is scientifically sound, >and that appropriate conclusions have been included) >i find the index for "sound". >After that I just need the substring from {-5, +5} WORDS around that >indexof(sound) >i.e. >my final answer shud be = report, that it is scientifically sound, and >that appropriate conclusions have >Is There a strategy or I have to do it in basic steps ?
The only real strategy is that you have to know what WORDS are. Something like parsing a language. Its not enough trying to split on spaces. So you need to define the language first. That means there is a relationship between punctuation and whitespace, the usual separators of language WORDs. Its not easy. Free flowing wild englishy bad spelling, punctuation, etc, will not make this easy. Since you have no basis for a grammar, just an approximation is the best you could do. I like this one, uses punctuation and it enforces some rules. But it is impossible to get it always correct. -sln use strict; use warnings; my $text = qq (I confirm that sufficient information and detail have been reported in this technical report, that it' is "scientifically" sound, and that appropriate conclusion's have been included); if ( $text =~ / ( #1 ( #2 (?: (?:^|\s) [[:punct:]]* \w [\w[:punct:]]* [\s[:punct:]]* ){0,5} ) sound ( #3 (?: [\s[:punct:]]* \w [\w[:punct:]]* (?:$|\s) ){0,5} ) ) /x ) { print <<RES; \r 1= '$1'\n\n \r 2= '$2'\n\n \r 3= '$3'\n RES }
|
Mon, 03 Jun 2013 20:45:54 GMT |
|
 |
Tad McClella #6 / 9
|
 Extracting a range of words!
Quote:
> TM> my $word = 'sound'; > TM> $text =~ s/.*? # leading stuff to strip > TM> ( # $1 is stuff to keep > TM> (\w+\W+){0,5} # 0-5 words > TM> \b$word\b\W* # the word to search for > TM> (\w+\W*){0,5} # 0-5 words > TM> ) > TM> .* # trailing stuff to strip > TM> /$1/sx; > that was pretty much the regex i would write. but do you need the \b's > in there? assuming $word is really \w chars, then the preceding \W will > obviate the need for the \b. same for the trailing one.
It is needed for the trailing one, since the \W is zero or more, and it is zero or more so that the last $word in $text can be matched. Quote: > also i would use \s+\S+ since written words could contain apostrophes > and some other punctuation.
That would be an improvement, I'll use that in the future. -- Tad McClellan email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/" The above message is a Usenet post. I don't recall having given anyone permission to use it on a Web site.
|
Mon, 03 Jun 2013 20:57:52 GMT |
|
 |
Tad McClella #7 / 9
|
 Extracting a range of words!
Quote:
>> $text = qq (I confirm that sufficient information and detail have been >> reported in this technical report, that it is scientifically sound, >> and that appropriate conclusions have been included) >> After that I just need the substring from {-5, +5} WORDS around that >> indexof(sound) > Here's a pretty mindless way to do it. Split the string into an array, > then iterate through the array looking for 'sound'. If necessary, you > can use the word boundary markers. Then, starting at the index of the > array element you found, print the ten elements starting at 'index - > 5'. Like this: > my $text = qq (I confirm that sufficient information and detail have > been reported in this technical report, that it is scientifically > sound, and that appropriate conclusions have been included);
> my $index = 0;
> { > last if ($element =~ /sound/);
Try it with: last if ($element =~ /confirm/); so that $index = 1 Quote: > $index++; > }
^^^^^^^^^^ Negative indexes count backwards from the end of the array... -- Tad McClellan email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/" The above message is a Usenet post. I don't recall having given anyone permission to use it on a Web site.
|
Mon, 03 Jun 2013 21:03:05 GMT |
|
 |
s.. #8 / 9
|
 Extracting a range of words!
Quote:
>> TM> my $word = 'sound'; >> TM> $text =~ s/.*? # leading stuff to strip >> TM> ( # $1 is stuff to keep >> TM> (\w+\W+){0,5} # 0-5 words >> TM> \b$word\b\W* # the word to search for >> TM> (\w+\W*){0,5} # 0-5 words >> TM> ) >> TM> .* # trailing stuff to strip >> TM> /$1/sx; >> that was pretty much the regex i would write. but do you need the \b's >> in there? assuming $word is really \w chars, then the preceding \W will >> obviate the need for the \b. same for the trailing one. >It is needed for the trailing one, since the \W is zero or more, >and it is zero or more so that the last $word in $text can be >matched. >> also i would use \s+\S+ since written words could contain apostrophes >> and some other punctuation. >That would be an improvement, I'll use that in the future.
Neither \s+\S+ or \w+\W+ will work seperately, they have to be used together. But, since they overlap, its impossible to use together. This leaves \w plus \s plus punctuation as the foundation. -sln
|
Mon, 03 Jun 2013 21:08:30 GMT |
|
 |
Justin #9 / 9
|
 Extracting a range of words!
Quote: > I need perl help... > say > $text = qq (I confirm that sufficient information and detail have been > reported in this technical report, that it is scientifically sound, > and that appropriate conclusions have been included) > i find the index for "sound". > After that I just need the substring from {-5, +5} WORDS around that > indexof(sound) > i.e. > my final answer shud be = report, that it is scientifically sound, and > that appropriate conclusions have > Is There a strategy or I have to do it in basic steps ?
my $i = indexof(sound); # which you said you had
for my $word ( 0 .. $#words ) {
Quote: }
Justin. -- Justin C, by the sea.
|
Tue, 04 Jun 2013 09:53:57 GMT |
|
|
|