Reg Exp. Problem 
Author Message
 Reg Exp. Problem

I'm fairly new to regular expressions and PHP, but an enthusiastic about
both!  I've had some success thus far, but I've come across a problem
that's stumping me. I suppose the best way to
explain my predicament is by example, so here goes:

I have a paragraph I'd like to parse. It's saved in a variable as one
string. Ideally it looks like this:

$variable =

"Apples some text following\n
Oranges some text following\n
Bananas some text following\n"

However, at times it may look like this:

$variable =

"Apples some text\n
following\n
Oranges some text following\n
Bananas some \n
text following\n"

I'd like to match all of the text from either Apples, Oranges, or
Bananas up to BUT NOT INCLUDING the next occurence of Apples, Oranges,
or Bananas, or the end of the paragraph. So, I'd like to wind up with

array ([0]=> Apples blah blah [1]=> Oranges blah blah [2]=> Bananas blah
  blah).

The problem is the "up to but not including" part. I've tried the following:

preg_match("/(apples|oranges|bananas)(.+?)+[^apples|oranges|bananas]/si",
                    $variable, $new_variable);

Of course, this doesn't work, or else I wouldn't be posting. I can
successfully match from apples, oranges, or bananas to the end of each
line by using

preg_match("/(apples|oranges|bananas)(.+?)+/mi",
$variable, $new_variable);

so I'm fairly sure it's the way I'm trying to set up the "but not
including" portion of my expression that is flawed. Of course, I miss
lines if I do it just to the end of the line, which isn't good. I've
forced a work-around by splitting my variable at each newline, then
looping through the resulting array to concatenate all strings that
don't begin with apples, oranges, or bananas to lines that do. It works
fine, but seem very clunky and inelegant to me. I'd like to educate
myself further and make the regular expressions work properly. Might
anyone have any insight as to how I can go about doing this?

Thanks to all,
Nick



Sat, 30 Apr 2005 10:16:36 GMT  
 Reg Exp. Problem

Quote:
> $variable =

> "Apples some text\n
> following\n
> Oranges some text following\n
> Bananas some \n
> text following\n"

> I'd like to match all of the text from either Apples, Oranges, or
> Bananas up to BUT NOT INCLUDING the next occurence of Apples, Oranges,
> or Bananas, or the end of the paragraph. So, I'd like to wind up with

> array ([0]=> Apples blah blah [1]=> Oranges blah blah [2]=> Bananas blah
>   blah).
> The problem is the "up to but not including" part. I've tried the
following:

> preg_match("/(apples|oranges|bananas)(.+?)+[^apples|oranges|bananas]/si",
>                     $variable, $new_variable);

The reason it doesn't work is the [^ ... ] is a character class construct.
It looks for letters within (or not within in this case) the square
brackets.  You are trying to use this to exclude substrings.

Quote:
> Of course, this doesn't work, or else I wouldn't be posting. I can
> successfully match from apples, oranges, or bananas to the end of each
> line by using

> preg_match("/(apples|oranges|bananas)(.+?)+/mi",
> $variable, $new_variable);

This would be better as
preg_match_all("/(apples|oranges|bananas)(.*?)$/mi" ... );

Of course, it still doesn't do what you want.

Quote:
> so I'm fairly sure it's the way I'm trying to set up the "but not
> including" portion of my expression that is flawed.

That is one of the reasons.  Excluding substrings from regexp is something I
haven't figured out just yet ...

However, the (.*?) or (.+?) is too non-greedy.  It says 'grab as little as
you can'.

Consider : /Hello(.*?)/

on the string "Hello lovely world".

the (.*?) will match as little as possible.  Nothing. (.+?) will match
exactly one character.

In order for (.*?) [or (.+?)] to work as intended, it must be followed by
something:

/Hello(.*?)o/

on the string "Hello lovely world".

The (.*?) will match " lo" while (.*) would match " lovely wo"
(as little as possible vs. as much as possible)

Quote:
> Of course, I miss
> lines if I do it just to the end of the line, which isn't good. I've
> forced a work-around by splitting my variable at each newline, then
> looping through the resulting array to concatenate all strings that
> don't begin with apples, oranges, or bananas to lines that do. It works
> fine, but seem very clunky and inelegant to me. I'd like to educate
> myself further and make the regular expressions work properly. Might
> anyone have any insight as to how I can go about doing this?

I played with this for quite a while, and the only solutions I came up with
were hackish at best.  I'd like to see anything you came up with.

Another suggestion would be to take this question to a Perl group
(comp.lang.perl.misc).  You'll find regexp heavyweights there.  Of course,
you may get a Perl answer involving splits and loops and such, but if you
familiar with Perl, you should be able to translate back to php.

regards,
reggie.



Sat, 30 Apr 2005 15:06:41 GMT  
 Reg Exp. Problem

Quote:



>>$variable =

>>"Apples some text\n
>>following\n
>>Oranges some text following\n
>>Bananas some \n
>>text following\n"

>>I'd like to match all of the text from either Apples, Oranges, or
>>Bananas up to BUT NOT INCLUDING the next occurence of Apples, Oranges,
>>or Bananas, or the end of the paragraph. So, I'd like to wind up with

>>array ([0]=> Apples blah blah [1]=> Oranges blah blah [2]=> Bananas blah
>>  blah).

>>The problem is the "up to but not including" part. I've tried the

> following:

>>preg_match("/(apples|oranges|bananas)(.+?)+[^apples|oranges|bananas]/si",
>>                    $variable, $new_variable);

> The reason it doesn't work is the [^ ... ] is a character class construct.
> It looks for letters within (or not within in this case) the square
> brackets.  You are trying to use this to exclude substrings.

>>Of course, this doesn't work, or else I wouldn't be posting. I can
>>successfully match from apples, oranges, or bananas to the end of each
>>line by using

>>preg_match("/(apples|oranges|bananas)(.+?)+/mi",
>>$variable, $new_variable);

> This would be better as
> preg_match_all("/(apples|oranges|bananas)(.*?)$/mi" ... );

> Of course, it still doesn't do what you want.

>>so I'm fairly sure it's the way I'm trying to set up the "but not
>>including" portion of my expression that is flawed.

> That is one of the reasons.  Excluding substrings from regexp is something I
> haven't figured out just yet ...

> However, the (.*?) or (.+?) is too non-greedy.  It says 'grab as little as
> you can'.

> Consider : /Hello(.*?)/

> on the string "Hello lovely world".

> the (.*?) will match as little as possible.  Nothing. (.+?) will match
> exactly one character.

> In order for (.*?) [or (.+?)] to work as intended, it must be followed by
> something:

> /Hello(.*?)o/

> on the string "Hello lovely world".

> The (.*?) will match " lo" while (.*) would match " lovely wo"
> (as little as possible vs. as much as possible)

>>Of course, I miss
>>lines if I do it just to the end of the line, which isn't good. I've
>>forced a work-around by splitting my variable at each newline, then
>>looping through the resulting array to concatenate all strings that
>>don't begin with apples, oranges, or bananas to lines that do. It works
>>fine, but seem very clunky and inelegant to me. I'd like to educate
>>myself further and make the regular expressions work properly. Might
>>anyone have any insight as to how I can go about doing this?

> I played with this for quite a while, and the only solutions I came up with
> were hackish at best.  I'd like to see anything you came up with.

> Another suggestion would be to take this question to a Perl group
> (comp.lang.perl.misc).  You'll find regexp heavyweights there.  Of course,
> you may get a Perl answer involving splits and loops and such, but if you
> familiar with Perl, you should be able to translate back to php.

> regards,
> reggie.

I had a feeling the square brackets weren't the way to go, but saw no
other way of excluding a substring. I supposed the way I did it is
probably the easiest. I simply split the entire paragraph at each \n,
the tested each member of the resulting array to see if it began with my
keyword (e.g. Apples, Oranges, or Bananas). If not, I concatenated its
contents to the string prior to it in the array. So, it flows something
like this:

$lines = preg_split("/\n/m", $my_paragraph);
$last_item = count($lines) - 1;

for ($i = $last_item; $i >= 0; $i--){
        $lines[$i] = trim($lines[$i]);
        if (!preg_match("/(apples|oranges|bananas)/i", $lines[$i])){
                $lines[$i-1] = $lines[$i-1]." ".$lines[$i];
                }
      }

I had to start with the last item in the array, or else I'd run into
indexing problems, again resulting in missed lines. Doing it this way,
it seems to run peachy. I'm sure there's an easier way of doing this as
well. One thing I forgot to mention in my last is that the keywords
(apples, oranges, and bananas) only occur at the beginning of each line,
hence I can get away with not setting my reg. exp. to match occurences
of those words only at the beginning of the string. If this weren't the
case and I were forced to match them at the beginning of the string,
could I get away with

preg_match("/^(apples|oranges|bananas)/i", $lines[$i])

or would I have to use

preg_match("/(^apples|^oranges|^bananas)/i", $lines[$i])

Thanks for your time and patience with my naive questions.

Nick



Sun, 01 May 2005 03:38:52 GMT  
 Reg Exp. Problem
Hi Nick,

You said:
[...huge friggin snip...]

Quote:
>I had a feeling the square brackets weren't the way to go, but saw no
>other way of excluding a substring.

Try this:

| $variable1 = "Apples some text following\nOranges some text following\nBananas some text following";
| $variable2 = "Apples some text\nfollowing\nOranges some text following\nBananas some \ntext following";
|
| preg_match_all("/(apples|oranges|bananas).*?(?=apples|oranges|bananas)/is", $variable1 . "\napples", $array1);

^^^^^^^^^^^^
Note the kludge here - you have to append one of the three magic words for
the last element to be matched.

However, this seems to work from what I tried.

--
Joseph Birr-Pixton



Sun, 01 May 2005 04:48:25 GMT  
 Reg Exp. Problem

Quote:
> $lines = preg_split("/\n/m", $my_paragraph);
> $last_item = count($lines) - 1;

> for ($i = $last_item; $i >= 0; $i--){
> $lines[$i] = trim($lines[$i]);
> if (!preg_match("/(apples|oranges|bananas)/i", $lines[$i])){
> $lines[$i-1] = $lines[$i-1]." ".$lines[$i];
> }
>       }

Got an answer from the Perl guys.

$arr = preg_split("/\n(?=apple|banana|orange)/i", $my_paragraph);

Much nicer.

This says 'split on newlines followd by orange, banana, orange)

They also fixed the loop to:
$lines = preg_split("\n", $my_paragraph);
$result=array();
foreach ($lines as $value) {
    if (preg_match("/^(apple|orange|banana)/i", $value) {
        array_push ($result, $value);
    } else {
        $result[count($result)-1] .= "\n$value";
    }

Quote:
}

Both a little nicer than your original, but the split route is much better.

Also, this regexp will do the whole thing (using preg_match_all):

 /
    ^                                   # start of a 'line'
    (?: apple | orange | banana )       # one of these words
    .*                                  # the rest of the line
    (?:                                 # this chunk...
      \n                                # a newline
      (?! apple | orange | banana ) .*  # text that doesn't start with
                                        # any of those words
    )*                                  # ...zero or more times
  /mi;

You may not want to thank me,
instead thank Jeff "japhy" Pinyan from c.l.p.misc

regards,
reggie.



Sun, 01 May 2005 07:35:23 GMT  
 Reg Exp. Problem

Quote:



>>$lines = preg_split("/\n/m", $my_paragraph);
>>$last_item = count($lines) - 1;

>>for ($i = $last_item; $i >= 0; $i--){
>>$lines[$i] = trim($lines[$i]);
>>if (!preg_match("/(apples|oranges|bananas)/i", $lines[$i])){
>>$lines[$i-1] = $lines[$i-1]." ".$lines[$i];
>>}
>>      }

> Got an answer from the Perl guys.

> $arr = preg_split("/\n(?=apple|banana|orange)/i", $my_paragraph);

> Much nicer.

> This says 'split on newlines followd by orange, banana, orange)

> They also fixed the loop to:
> $lines = preg_split("\n", $my_paragraph);
> $result=array();
> foreach ($lines as $value) {
>     if (preg_match("/^(apple|orange|banana)/i", $value) {
>         array_push ($result, $value);
>     } else {
>         $result[count($result)-1] .= "\n$value";
>     }
> }

> Both a little nicer than your original, but the split route is much better.

> Also, this regexp will do the whole thing (using preg_match_all):

>  /
>     ^                                   # start of a 'line'
>     (?: apple | orange | banana )       # one of these words
>     .*                                  # the rest of the line
>     (?:                                 # this chunk...
>       \n                                # a newline
>       (?! apple | orange | banana ) .*  # text that doesn't start with
>                                         # any of those words
>     )*                                  # ...zero or more times
>   /mi;

> You may not want to thank me,
> instead thank Jeff "japhy" Pinyan from c.l.p.misc

> regards,
> reggie.

Nonetheless, thank you very much for your effort! Much nicer approach.
I'm very grateful for your input!

Nick



Sun, 01 May 2005 09:34:48 GMT  
 Reg Exp. Problem

Quote:



>>$lines = preg_split("/\n/m", $my_paragraph);
>>$last_item = count($lines) - 1;

>>for ($i = $last_item; $i >= 0; $i--){
>>$lines[$i] = trim($lines[$i]);
>>if (!preg_match("/(apples|oranges|bananas)/i", $lines[$i])){
>>$lines[$i-1] = $lines[$i-1]." ".$lines[$i];
>>}
>>      }

> Got an answer from the Perl guys.

> $arr = preg_split("/\n(?=apple|banana|orange)/i", $my_paragraph);

> Much nicer.

> This says 'split on newlines followd by orange, banana, orange)

> They also fixed the loop to:
> $lines = preg_split("\n", $my_paragraph);
> $result=array();
> foreach ($lines as $value) {
>     if (preg_match("/^(apple|orange|banana)/i", $value) {
>         array_push ($result, $value);
>     } else {
>         $result[count($result)-1] .= "\n$value";
>     }
> }

> Both a little nicer than your original, but the split route is much better.

> Also, this regexp will do the whole thing (using preg_match_all):

>  /
>     ^                                   # start of a 'line'
>     (?: apple | orange | banana )       # one of these words
>     .*                                  # the rest of the line
>     (?:                                 # this chunk...
>       \n                                # a newline
>       (?! apple | orange | banana ) .*  # text that doesn't start with
>                                         # any of those words
>     )*                                  # ...zero or more times
>   /mi;

> You may not want to thank me,
> instead thank Jeff "japhy" Pinyan from c.l.p.misc

> regards,
> reggie.

Reggie,

I tried out the revised split approach outlined in your last. I thought
this might be a problem, and sure enough, it is -- such a split will
discard the keywords which trip the match. Unfortunately, I have to know
what they are in order to make sense of each line. Also, I don't think
the preg_match_all will work either -- I'm parsing html text retrieved
from a remote site, so I have no idea ahead of time if the blocks of
text I retrieve will be in an ideal format (i.e. each line starting with
a keyword and ending with a newline), or not (i.e. text from some lines
wrapping onto lines of their own, and therefore not beginning with a
keyword). So I think I might be safest splitting on newlines alone, and
stepping through them as mentioned to piece them together into my ideal
format.

If only there were an easy way to exclude substrings in a match, an "up
to but not including" exclusion.

Thanks again for your time and effort. I'll talk on perl as well to see
what else I might try. Like I said, it works as is, but I think it could
be done in a more elegant manner. At the very least, though, it's been
fun to ponder!

Thanks again,
Nick



Sun, 01 May 2005 10:29:02 GMT  
 Reg Exp. Problem


Quote:



> >>$lines = preg_split("/\n/m", $my_paragraph);
> >>$last_item = count($lines) - 1;

> >>for ($i = $last_item; $i >= 0; $i--){
> >>$lines[$i] = trim($lines[$i]);
> >>if (!preg_match("/(apples|oranges|bananas)/i", $lines[$i])){
> >>$lines[$i-1] = $lines[$i-1]." ".$lines[$i];
> >>}
> >>      }

> > Got an answer from the Perl guys.

> > $arr = preg_split("/\n(?=apple|banana|orange)/i", $my_paragraph);

> > Much nicer.

> > This says 'split on newlines followd by orange, banana, orange)

> > They also fixed the loop to:
> > $lines = preg_split("\n", $my_paragraph);
> > $result=array();
> > foreach ($lines as $value) {
> >     if (preg_match("/^(apple|orange|banana)/i", $value) {
> >         array_push ($result, $value);
> >     } else {
> >         $result[count($result)-1] .= "\n$value";
> >     }
> > }

> > Both a little nicer than your original, but the split route is much
better.

> > Also, this regexp will do the whole thing (using preg_match_all):

> >  /
> >     ^                                   # start of a 'line'
> >     (?: apple | orange | banana )       # one of these words
> >     .*                                  # the rest of the line
> >     (?:                                 # this chunk...
> >       \n                                # a newline
> >       (?! apple | orange | banana ) .*  # text that doesn't start with
> >                                         # any of those words
> >     )*                                  # ...zero or more times
> >   /mi;

> > You may not want to thank me,
> > instead thank Jeff "japhy" Pinyan from c.l.p.misc

> > regards,
> > reggie.

> Reggie,

> I tried out the revised split approach outlined in your last. I thought
> this might be a problem, and sure enough, it is -- such a split will
> discard the keywords which trip the match.

Nope - the (?= makes sure that doesn't happen.  Cut and paste the above
example to see it in action.

$text= "Apples some text
following
Oranges some text following
Bananas some
text following";

$arr = preg_split("/\n(?=apple|banana|orange)/i", $text);

print_r($arr);

Quote:
> I'm parsing html text retrieved
> from a remote site, so I have no idea ahead of time if the blocks of
> text I retrieve will be in an ideal format (i.e. each line starting with
> a keyword and ending with a newline), or not (i.e. text from some lines
> wrapping onto lines of their own, and therefore not beginning with a
> keyword).

That is a problem.  In order for a regexp to work properly, you must define
the dataset explicitly.

The regexp works on the premise that (orange|banana|apple) will be preceded
immediately with a new line.

Quote:
> So I think I might be safest splitting on newlines alone, and
> stepping through them as mentioned to piece them together into my ideal
> format.

Your split on newlines + algorythm and:
$arr = preg_split("/\n(?=apple|banana|orange)/i", $text);
are functionally equivalent.

If you are worried about whitespace, you could add \s*? before the (?=

Quote:
> If only there were an easy way to exclude substrings in a match, an "up
> to but not including" exclusion.

(?! works to exclude substring in Perl - although I haven't tested this in
php

i.e.

(?!apple|orange|banana) will exlude strings apple, orange or banana from a
match

This is evident in:
preg_match_all(
    "/^(?:apples|oranges|bananas).*(?:\n(?!apple|orange|banana).*)*/mi",
    $text, $arr);

print_r($arr);

Quote:
> Thanks again for your time and effort. I'll talk on perl as well to see
> what else I might try. Like I said, it works as is, but I think it could
> be done in a more elegant manner. At the very least, though, it's been
> fun to ponder!

Yes - regular expressions are my current pet project, so this has been
educational for me too.

regards,
reggie.



Sun, 01 May 2005 11:32:19 GMT  
 Reg Exp. Problem

Quote:
> If you are worried about whitespace, you could add \s*? before the (?=

Well, what do you know. It works as advertised! Turns out the split was
failing because of a html tag I wasn't seeing in front of one of my
keywords (I'm new to php, and assumed a var_dump would show html tags as
text. Dumb mistake. I picked by viewing the page source on a hunch).
Once I refined my initial matching, it works great, and your advice
helped me eliminate 15-20 lines of code! Very nice. If I've learned
anything, it's to assume nothing, and to beef up my knowledge of regular
expressions.

Thanks again!

Nick



Tue, 03 May 2005 00:18:25 GMT  
 
 [ 9 post ] 

 Relevant Pages 

1. Reg Exp Search in Asm?

2. Is // a reg exp?

3. reg exp help needed

4. Reg Exp in awk

5. Why no reg exp in Forth?

6. variables in reg exp repetition

7. Reg-exp library?

8. Newbie Generic Reg Exp Pattern Matching Question

9. Newbie: searching an English dictionary for a reg exp

10. Reg Exp: Need advice concerning "greediness"

11. Expect reg-exp trouble

12. IA - Satellite Commun Project, sw engr exp., satellite com exp, ADA

 

 
Powered by phpBB® Forum Software