Extracting a range of words! 
Author Message
 Extracting a range of words!

I need perl help...

say

$text = qq (I confirm that sufficient information and detail have been
reported in this technical report, that it is scientifically sound,
and that appropriate conclusions have been included)

i find the index for "sound".

After that I just need the substring from {-5, +5} WORDS around that
indexof(sound)

i.e.

my final answer shud be = report, that it is scientifically sound, and
that appropriate conclusions have

Is There a strategy or I have to do it in basic steps ?



Mon, 03 Jun 2013 16:52:35 GMT  
 Extracting a range of words!

Quote:

> $text = qq (I confirm that sufficient information and detail have been
> reported in this technical report, that it is scientifically sound,
> and that appropriate conclusions have been included)

> i find the index for "sound".

> After that I just need the substring from {-5, +5} WORDS around that
> indexof(sound)

What do you want to do if there is only 1 word before the target word,
eg. if searching for "confirm"?

I will assume: show 0-5 words before and after.

What do you want to do if the target word is in the string
at more than one place, eg "been"?

I will assume: find the first one.

What do you want if "sound" appears within some other word
such as "resoundingly"?

I will assume: do not match.

    my $word = 'sound';
    $text =~ s/.*?              # leading stuff to strip
               (                # $1 is stuff to keep
                (\w+\W+){0,5}   # 0-5 words
                \b$word\b\W*    # the word to search for
                (\w+\W*){0,5}   # 0-5 words
               )
               .*               # trailing stuff to strip
              /$1/sx;

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.



Mon, 03 Jun 2013 17:30:47 GMT  
 Extracting a range of words!

  TM>     my $word = 'sound';
  TM>     $text =~ s/.*?              # leading stuff to strip
  TM>                (                # $1 is stuff to keep
  TM>                 (\w+\W+){0,5}   # 0-5 words
  TM>                 \b$word\b\W*    # the word to search for
  TM>                 (\w+\W*){0,5}   # 0-5 words
  TM>                )
  TM>                .*               # trailing stuff to strip
  TM>               /$1/sx;

that was pretty much the regex i would write. but do you need the \b's
in there? assuming $word is really \w chars, then the preceding \W will
obviate the need for the \b. same for the trailing one.

also i would use \s+\S+ since written words could contain apostrophes
and some other punctuation.

uri

--

-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------



Mon, 03 Jun 2013 18:41:40 GMT  
 Extracting a range of words!

Quote:
> $text = qq (I confirm that sufficient information and detail have been
> reported in this technical report, that it is scientifically sound,
> and that appropriate conclusions have been included)
> After that I just need the substring from {-5, +5} WORDS around that
> indexof(sound)

Here's a pretty mindless way to do it. Split the string into an array,
then iterate through the array looking for 'sound'. If necessary, you
can use the word boundary markers. Then, starting at the index of the
array element you found, print the ten elements starting at 'index -
5'. Like this:

my $text = qq (I confirm that sufficient information and detail have
been reported in this technical report, that it is scientifically
sound, and that appropriate conclusions have been included);

my $index = 0;

{
  last if ($element =~ /sound/);
  $index++;

Quote:
}



CC.



Mon, 03 Jun 2013 19:31:14 GMT  
 Extracting a range of words!

Quote:

>I need perl help...

>say

>$text = qq (I confirm that sufficient information and detail have been
>reported in this technical report, that it is scientifically sound,
>and that appropriate conclusions have been included)

>i find the index for "sound".

>After that I just need the substring from {-5, +5} WORDS around that
>indexof(sound)

>i.e.

>my final answer shud be = report, that it is scientifically sound, and
>that appropriate conclusions have

>Is There a strategy or I have to do it in basic steps ?

The only real strategy is that you have to know what WORDS are.
Something like parsing a language. Its not enough trying to split
on spaces. So you need to define the language first.

That means there is a relationship between punctuation and whitespace,
the usual separators of language WORDs.
Its not easy. Free flowing wild englishy bad spelling, punctuation, etc,
will not make this easy. Since you have no basis for a grammar, just an
approximation is the best you could do.

I like this one, uses punctuation and it enforces some rules.
But it is impossible to get it always correct.

-sln

use strict;
use warnings;

my $text = qq (I confirm that sufficient information and detail have been
reported in this technical report, that it' is "scientifically" sound,
and that appropriate conclusion's have been included);

 if ( $text =~ /
      (  #1
          ( #2
             (?:
                  (?:^|\s)
                  [[:punct:]]*
                  \w
                  [\w[:punct:]]*
                  [\s[:punct:]]*
             ){0,5}
          )
          sound
          ( #3
             (?:
                  [\s[:punct:]]*
                  \w
                  [\w[:punct:]]*
                  (?:$|\s)
              ){0,5}
           )
      )
  /x )
 {
    print <<RES;
    \r 1= '$1'\n\n
    \r 2= '$2'\n\n
    \r 3= '$3'\n
RES
 }



Mon, 03 Jun 2013 20:45:54 GMT  
 Extracting a range of words!

Quote:


>  TM>     my $word = 'sound';
>  TM>     $text =~ s/.*?              # leading stuff to strip
>  TM>                (                # $1 is stuff to keep
>  TM>                 (\w+\W+){0,5}   # 0-5 words
>  TM>                 \b$word\b\W*    # the word to search for
>  TM>                 (\w+\W*){0,5}   # 0-5 words
>  TM>                )
>  TM>                .*               # trailing stuff to strip
>  TM>               /$1/sx;

> that was pretty much the regex i would write. but do you need the \b's
> in there? assuming $word is really \w chars, then the preceding \W will
> obviate the need for the \b. same for the trailing one.

It is needed for the trailing one, since the \W is zero or more,
and it is zero or more so that the last $word in $text can be
matched.

Quote:
> also i would use \s+\S+ since written words could contain apostrophes
> and some other punctuation.

That would be an improvement, I'll use that in the future.

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.



Mon, 03 Jun 2013 20:57:52 GMT  
 Extracting a range of words!

Quote:


>> $text = qq (I confirm that sufficient information and detail have been
>> reported in this technical report, that it is scientifically sound,
>> and that appropriate conclusions have been included)
>> After that I just need the substring from {-5, +5} WORDS around that
>> indexof(sound)

> Here's a pretty mindless way to do it. Split the string into an array,
> then iterate through the array looking for 'sound'. If necessary, you
> can use the word boundary markers. Then, starting at the index of the
> array element you found, print the ten elements starting at 'index -
> 5'. Like this:

> my $text = qq (I confirm that sufficient information and detail have
> been reported in this technical report, that it is scientifically
> sound, and that appropriate conclusions have been included);

> my $index = 0;

> {
>   last if ($element =~ /sound/);

Try it with:

    last if ($element =~ /confirm/);

so that $index = 1

Quote:
>   $index++;
> }


                             ^^^^^^^^^^

Negative indexes count backwards from the end of the array...

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.



Mon, 03 Jun 2013 21:03:05 GMT  
 Extracting a range of words!

Quote:



>>  TM>     my $word = 'sound';
>>  TM>     $text =~ s/.*?              # leading stuff to strip
>>  TM>                (                # $1 is stuff to keep
>>  TM>                 (\w+\W+){0,5}   # 0-5 words
>>  TM>                 \b$word\b\W*    # the word to search for
>>  TM>                 (\w+\W*){0,5}   # 0-5 words
>>  TM>                )
>>  TM>                .*               # trailing stuff to strip
>>  TM>               /$1/sx;

>> that was pretty much the regex i would write. but do you need the \b's
>> in there? assuming $word is really \w chars, then the preceding \W will
>> obviate the need for the \b. same for the trailing one.

>It is needed for the trailing one, since the \W is zero or more,
>and it is zero or more so that the last $word in $text can be
>matched.

>> also i would use \s+\S+ since written words could contain apostrophes
>> and some other punctuation.

>That would be an improvement, I'll use that in the future.

Neither \s+\S+ or \w+\W+ will work seperately, they have to be used together.
But, since they overlap, its impossible to use together. This leaves
\w plus \s plus punctuation as the foundation.

-sln



Mon, 03 Jun 2013 21:08:30 GMT  
 Extracting a range of words!

Quote:
> I need perl help...

> say

> $text = qq (I confirm that sufficient information and detail have been
> reported in this technical report, that it is scientifically sound,
> and that appropriate conclusions have been included)

> i find the index for "sound".

> After that I just need the substring from {-5, +5} WORDS around that
> indexof(sound)

> i.e.

> my final answer shud be = report, that it is scientifically sound, and
> that appropriate conclusions have

> Is There a strategy or I have to do it in basic steps ?


my $i = indexof(sound); # which you said you had

for my $word ( 0 .. $#words ) {

Quote:
}

   Justin.

--
Justin C, by the sea.



Tue, 04 Jun 2013 09:53:57 GMT  
 
 [ 9 post ] 

 Relevant Pages 

1. How to: Create Regex which extracts N number of words before target word

2. extract a range of lines

3. extracting a range of lines from a text file

4. Extract last word on line

5. extracting word from line

6. Extracting Word from Variable

7. extract the first word in a string

8. Help please extracting data from a word document

9. Extracting an index from a MS Word 97 doc

10. Extract the nth word from a line

11. Extracting n-grams and words that contain them

12. Need to extract the text from Microsoft Word 6 files

 

 
Powered by phpBB® Forum Software