Reg Exp: Need advice concerning "greediness" 
Author Message
 Reg Exp: Need advice concerning "greediness"

Hello all,

I want to exchange font colors of headings of a certain level in HTML files.

I have a line containing a heading level 1, e.g.: <h1><font
COLOR="#FF0000">Heading Level 1</font></h1>.

Now I want to split this into 3 groups: Everything before "COLOR=xyz",
"COLOR=xyz" itself, and everything after "COLOR=xyz".

I tried:
sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
re.S), sRslt);

This returns [("<h1><font, , COLOR="#FF0000">Heading Level 1</font></h1>)].
I'd expected to receive [("<h1><font , COLOR="#FF0000", >Heading Level
1</font></h1>)].

It works if I replace (COLOR=.*?)* by (COLOR=.*?). But I need having the '*'
because there may be headings w/o the color attribute but with a face
attribute.

As I understood until now, '*' means 'zero or more of preceeding, but as
many as possible'. If a color attribute is present, 'as many as possible'
means 'the one that is there', doesn't it? If there is no such attribute,
well - then it's 'zero'.

What did I miss?

Best regards
Franz



Wed, 19 Mar 2003 03:00:00 GMT  
 Reg Exp: Need advice concerning "greediness"

Quote:
> I tried:
> sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
> print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
> re.S), sRslt);

> This returns [("<h1><font, , COLOR="#FF0000">Heading Level 1</font></h1>)].
> I'd expected to receive [("<h1><font , COLOR="#FF0000", >Heading Level
> 1</font></h1>)].

For me, using python2.0, I get this answer

[('<h1><font', '', ' COLOR=')]

which is different from what you got, and what you expected.  Also, what
you got is not syntactically correct, I think.  Could you paste the
output directly from the interpreter?

In general, for this sort of thing, you are better off learning to use
the htmllib module, imo.  It'll take you about the same amount of time
to learn it this time as to get the regexp correct, and you'll have a
far more appropriate framework for the next such problem that comes
along.

Alex.

--
Speak softly but carry a big carrot.



Wed, 19 Mar 2003 03:00:00 GMT  
 Reg Exp: Need advice concerning "greediness"
Thanks for your response!

Quote:
> For me, using python2.0, I get this answer

Sorry, I forgot to mention the platform: python 1.5.2 on NT4.0/SP 5.

Quote:
> the htmllib module, imo.  It'll take you about the same amount of time

Sounds promising. It's a Python std module, isn't it? Yet I could not find
sample scripts  showing me how to use it? Any idea how to begin with?

Thanks in advance
Franz GEIGER



Quote:

> > I tried:
> > sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
> > print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
> > re.S), sRslt);

> > This returns [("<h1><font, , COLOR="#FF0000">Heading Level
1</font></h1>)].
> > I'd expected to receive [("<h1><font , COLOR="#FF0000", >Heading Level
> > 1</font></h1>)].

> For me, using python2.0, I get this answer

> [('<h1><font', '', ' COLOR=')]

> which is different from what you got, and what you expected.  Also, what
> you got is not syntactically correct, I think.  Could you paste the
> output directly from the interpreter?

> In general, for this sort of thing, you are better off learning to use
> the htmllib module, imo.  It'll take you about the same amount of time
> to learn it this time as to get the regexp correct, and you'll have a
> far more appropriate framework for the next such problem that comes
> along.

> Alex.

> --
> Speak softly but carry a big carrot.



Thu, 20 Mar 2003 03:00:00 GMT  
 Reg Exp: Need advice concerning "greediness"

Quote:
> Sounds promising. It's a Python std module, isn't it? Yet I could not
> find sample scripts showing me how to use it? Any idea how to begin
> with?

Yeah, the docs for it are a bit hard to figure out.  Here's some sample
code from Alex Martelli.

http://www.python.org/pipermail/python-list/2000-August/114566.html

Alex.

--
Speak softly but carry a big carrot.



Thu, 20 Mar 2003 03:00:00 GMT  
 Reg Exp: Need advice concerning "greediness"
: Hello all,

: I want to exchange font colors of headings of a certain level in HTML files.

: I have a line containing a heading level 1, e.g.: <h1><font
: COLOR="#FF0000">Heading Level 1</font></h1>.

: Now I want to split this into 3 groups: Everything before "COLOR=xyz",
: "COLOR=xyz" itself, and everything after "COLOR=xyz".

: I tried:
: sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
: print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
: re.S), sRslt);

Beware of quotes in your example:

Quote:
>>> sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>"
>>> sRslt

'<h1><font COLOR='

(That explains weird results reported here)

As for your regexp, the following works:

Quote:
>>> print re.findall(re.compile(r'(.*?FONT[^">]+?)(COLOR=.*?)?([ |>].*)', re.I | re.S), sRslt);

[('<h1><font ', 'COLOR="#FF0000"', '>Heading Level 1</font></h1>')]

I used a negated character class to force an end for the first group before
a cpossible COLOR tag.  Otherwise, what I think is happening is that your
non-greedy search is indeed non-greedy, but the null-match of '(COLOR=.*?)*'
is included into it. BTW, I changed that '*' to '?', which is what you meant,
if I read correctly.

HTH, DCA

-- Daniel Calvelo Aros



Fri, 21 Mar 2003 03:00:00 GMT  
 Reg Exp: Need advice concerning "greediness"
Thanks, that clarified the use of htmllib (at least I think so). Great
module if one wants to extract certain information out from an HTML file!

As I played around I saw, that it does not fully fit my needs: I do not want
to extract information, I do want to change or insert information. If I got
everything right, I would have to define methods for EVERY tag occurring in
the HTML file, because otherwise the parser would omit information.

Or is there a way to let HTMPParser write the unextracted data into a buffer
for displaying them?

I want to change certain parts of HTML files e.g. font tags. All other parts
should remain unchanged. After having done my changes I'd like to save the
unchanged parts and the changed parts into a file for publishing on a
server.

Do I have to stick with regexp?

Or is it a good idea to write my own parser, taking idea and concept from
HTMLParser (really good idea to call hooks on the occurrence of certain
tags)?

Many thanks
and best regards
Franz GEIGER



Quote:

> > Sounds promising. It's a Python std module, isn't it? Yet I could not
> > find sample scripts showing me how to use it? Any idea how to begin
> > with?

> Yeah, the docs for it are a bit hard to figure out.  Here's some sample
> code from Alex Martelli.

> http://www.python.org/pipermail/python-list/2000-August/114566.html

> Alex.

> --
> Speak softly but carry a big carrot.



Sat, 22 Mar 2003 03:00:00 GMT  
 Reg Exp: Need advice concerning "greediness"

Quote:
> As I played around I saw, that it does not fully fit my needs: I do not want
> to extract information, I do want to change or insert information. If I got
> everything right, I would have to define methods for EVERY tag occurring in
> the HTML file, because otherwise the parser would omit information.

You're right, it's not a great fit, after all.  My apologies.

The sgmllib gets closer, but you would still have to define 5 or so
methods, which is a lot of overhead, and you would lose any formatting
of the html in the output.

Alex.

--
Speak softly but carry a big carrot.



Sat, 22 Mar 2003 03:00:00 GMT  
 Reg Exp: Need advice concerning "greediness"


Quote:
>Hello all,

>I want to exchange font colors of headings of a certain level in HTML files.

>I have a line containing a heading level 1, e.g.: <h1><font
>COLOR="#FF0000">Heading Level 1</font></h1>.

>Now I want to split this into 3 groups: Everything before "COLOR=xyz",
>"COLOR=xyz" itself, and everything after "COLOR=xyz".

>I tried:
>sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
>print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
>re.S), sRslt);

>This returns [("<h1><font, , COLOR="#FF0000">Heading Level 1</font></h1>)].
>I'd expected to receive [("<h1><font , COLOR="#FF0000", >Heading Level
>1</font></h1>)].

>It works if I replace (COLOR=.*?)* by (COLOR=.*?). But I need having the '*'
>because there may be headings w/o the color attribute but with a face
>attribute.

>As I understood until now, '*' means 'zero or more of preceeding, but as
>many as possible'. If a color attribute is present, 'as many as possible'
>means 'the one that is there', doesn't it? If there is no such attribute,
>well - then it's 'zero'.

>What did I miss?

>Best regards
>Franz

Here is some example code using sgmllib. Note that you should make
sure you are submitting valid html. This will also change the case of
your element and attribute names to lower case.

Make changes-additions where the comment tells you to
Add member variables to the class as required to keep track of where
you are.

Bob

#######################

from sgmllib import SGMLParser
import string

class MySGMLParser(SGMLParser):
    def __init__(self, verbose=0, outfile=None):
       if not hasattr(outfile, 'write'):
           raise "outfile must have attribute write"
       self.outfile = outfile
       SGMLParser.__init__(self, verbose)

    def handle_data(self, data):
        self.outfile.write(data)

    def handle_comment(self, data):
        self.outfile.write('<!--%s-->' % data)

    def unknown_starttag(self, tag, attrs):
        if not attrs:
            self.outfile.write('<' + tag + '>')
        else:
            self.outfile.write('<' + tag)
            for attr in attrs:
                self.outfile.write(' %s="%s"' % attr)
            self.outfile.write('>')

    def unknown_endtag(self, tag):
        self.outfile.write('</%s>' % tag)

    def unknown_entityref(self, ref):
        self.outfile.write('&%s;' % ref)
    # so known refs do not get translated
    handle_entityref = unknown_entityref

    def unknown_charref(self, ref):
        self.outfile.write('&#%s;' % ref)
    # so known refs do not get translated
    handle_charref = unknown_charref

    def close(self):
        SGMLParser.close(self)

    ## put tag handlers here,
    ## for my sample code I took the  www.python.org homepage and
    ## changed the bgcolor of the wrapper tables
    ## define start and end tag handlers as start_TAGNAME, end_TAGNAME

    def start_td(self, attrs):
        if not attrs:
            self.outfile.write('<td>')
        else:
            self.outfile.write('<td')
            for name, val in attrs:
                if string.lower(name) == 'bgcolor':
                    self.outfile.write(' %s="%s"' % (name, '#ffcc99'))
                else:
                    self.outfile.write(' %s="%s"' % (name, val))
            self.outfile.write('>')

    def end_td(self):
        self.outfile.write('</td>')

if __name__ == "__main__":
    import sys
    if len(sys.argv) != 3:
        print "usage: python changeattr.py infile, outfile"
        raise SystemExit
    infile = sys.argv[1]
    outfile = sys.argv[2]
    ofp = open(outfile, 'w')
    # this is a one shot parser
    p = MySGMLParser(outfile=ofp)
    p.feed(open(infile).read())
    p.close()
    ofp.close()



Sun, 23 Mar 2003 11:08:49 GMT  
 Reg Exp: Need advice concerning "greediness"

Quote:
> You're right, it's not a great fit, after all.  My apologies.

After all I gathered experience on a great module, showing good design ideas
on parsing. So there's no reason to apologize at all. You showed me how to
do it. Thanks again.

Thanks to all who contributed!

Best regards
Franz GEIGER



Quote:

> > As I played around I saw, that it does not fully fit my needs: I do not
want
> > to extract information, I do want to change or insert information. If I
got
> > everything right, I would have to define methods for EVERY tag occurring
in
> > the HTML file, because otherwise the parser would omit information.

> You're right, it's not a great fit, after all.  My apologies.

> The sgmllib gets closer, but you would still have to define 5 or so
> methods, which is a lot of overhead, and you would lose any formatting
> of the html in the output.

> Alex.

> --
> Speak softly but carry a big carrot.



Sun, 23 Mar 2003 14:35:49 GMT  
 Reg Exp: Need advice concerning "greediness"

Quote:
> I used a negated character class to force an end for the first group
before
> a cpossible COLOR tag.  Otherwise, what I think is happening is that your

That did the trick.

Quote:
> is included into it. BTW, I changed that '*' to '?', which is what you
meant,
> if I read correctly.

Yes.

As fascinating reg exp are, they are not always easy to understand and use,
especially for newbies.

Thanks a lot and
best regards

Franz



Quote:

> : Hello all,

> : I want to exchange font colors of headings of a certain level in HTML
files.

> : I have a line containing a heading level 1, e.g.: <h1><font
> : COLOR="#FF0000">Heading Level 1</font></h1>.

> : Now I want to split this into 3 groups: Everything before "COLOR=xyz",
> : "COLOR=xyz" itself, and everything after "COLOR=xyz".

> : I tried:
> : sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
> : print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
> : re.S), sRslt);

> Beware of quotes in your example:

> >>> sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>"
> >>> sRslt
> '<h1><font COLOR='

> (That explains weird results reported here)

> As for your regexp, the following works:

> >>> print re.findall(re.compile(r'(.*?FONT[^">]+?)(COLOR=.*?)?([ |>].*)',

re.I | re.S), sRslt);

- Show quoted text -

Quote:
> [('<h1><font ', 'COLOR="#FF0000"', '>Heading Level 1</font></h1>')]

> I used a negated character class to force an end for the first group
before
> a cpossible COLOR tag.  Otherwise, what I think is happening is that your
> non-greedy search is indeed non-greedy, but the null-match of
'(COLOR=.*?)*'
> is included into it. BTW, I changed that '*' to '?', which is what you
meant,
> if I read correctly.

> HTH, DCA

> -- Daniel Calvelo Aros




Sun, 23 Mar 2003 03:00:00 GMT  
 Reg Exp: Need advice concerning "greediness"

Quote:
> I used a negated character class to force an end for the first group
before
> a cpossible COLOR tag.  Otherwise, what I think is happening is that your

That did the trick.

Quote:
> is included into it. BTW, I changed that '*' to '?', which is what you
meant,
> if I read correctly.

Yes.

As fascinating reg exp are, they are not always easy to understand and use,
especially for newbies.

Thanks a lot and
best regards

Franz



Quote:

> : Hello all,

> : I want to exchange font colors of headings of a certain level in HTML
files.

> : I have a line containing a heading level 1, e.g.: <h1><font
> : COLOR="#FF0000">Heading Level 1</font></h1>.

> : Now I want to split this into 3 groups: Everything before "COLOR=xyz",
> : "COLOR=xyz" itself, and everything after "COLOR=xyz".

> : I tried:
> : sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
> : print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
> : re.S), sRslt);

> Beware of quotes in your example:

> >>> sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>"
> >>> sRslt
> '<h1><font COLOR='

> (That explains weird results reported here)

> As for your regexp, the following works:

> >>> print re.findall(re.compile(r'(.*?FONT[^">]+?)(COLOR=.*?)?([ |>].*)',

re.I | re.S), sRslt);

- Show quoted text -

Quote:
> [('<h1><font ', 'COLOR="#FF0000"', '>Heading Level 1</font></h1>')]

> I used a negated character class to force an end for the first group
before
> a cpossible COLOR tag.  Otherwise, what I think is happening is that your
> non-greedy search is indeed non-greedy, but the null-match of
'(COLOR=.*?)*'
> is included into it. BTW, I changed that '*' to '?', which is what you
meant,
> if I read correctly.

> HTH, DCA

> -- Daniel Calvelo Aros




Sun, 23 Mar 2003 03:00:00 GMT  
 Reg Exp: Need advice concerning "greediness"
That was definitly what I was lookin' for! Alex mentioned the SGML parser
already but supposed that there remains considerable work to do. But it was
rather painless to implement my stuff into this frame.

Thank you all again, great community!

Best regards
Franz GEIGER



Quote:


> >Hello all,

> >I want to exchange font colors of headings of a certain level in HTML
files.

> >I have a line containing a heading level 1, e.g.: <h1><font
> >COLOR="#FF0000">Heading Level 1</font></h1>.

> >Now I want to split this into 3 groups: Everything before "COLOR=xyz",
> >"COLOR=xyz" itself, and everything after "COLOR=xyz".

> >I tried:
> >sRslt = "<h1><font COLOR="#FF0000">Heading Level 1</font></h1>";
> >print re.findall(re.compile(r'(.*?FONT.*?)(COLOR=.*?)*([ |>].*)', re.I |
> >re.S), sRslt);

> >This returns [("<h1><font, , COLOR="#FF0000">Heading Level
1</font></h1>)].
> >I'd expected to receive [("<h1><font , COLOR="#FF0000", >Heading Level
> >1</font></h1>)].

> >It works if I replace (COLOR=.*?)* by (COLOR=.*?). But I need having the
'*'
> >because there may be headings w/o the color attribute but with a face
> >attribute.

> >As I understood until now, '*' means 'zero or more of preceeding, but as
> >many as possible'. If a color attribute is present, 'as many as possible'
> >means 'the one that is there', doesn't it? If there is no such attribute,
> >well - then it's 'zero'.

> >What did I miss?

> >Best regards
> >Franz

> Here is some example code using sgmllib. Note that you should make
> sure you are submitting valid html. This will also change the case of
> your element and attribute names to lower case.

> Make changes-additions where the comment tells you to
> Add member variables to the class as required to keep track of where
> you are.

> Bob

> #######################

> from sgmllib import SGMLParser
> import string

> class MySGMLParser(SGMLParser):
>     def __init__(self, verbose=0, outfile=None):
>        if not hasattr(outfile, 'write'):
>            raise "outfile must have attribute write"
>        self.outfile = outfile
>        SGMLParser.__init__(self, verbose)

>     def handle_data(self, data):
>         self.outfile.write(data)

>     def handle_comment(self, data):
>         self.outfile.write('<!--%s-->' % data)

>     def unknown_starttag(self, tag, attrs):
>         if not attrs:
>             self.outfile.write('<' + tag + '>')
>         else:
>             self.outfile.write('<' + tag)
>             for attr in attrs:
>                 self.outfile.write(' %s="%s"' % attr)
>             self.outfile.write('>')

>     def unknown_endtag(self, tag):
>         self.outfile.write('</%s>' % tag)

>     def unknown_entityref(self, ref):
>         self.outfile.write('&%s;' % ref)
>     # so known refs do not get translated
>     handle_entityref = unknown_entityref

>     def unknown_charref(self, ref):
>         self.outfile.write('&#%s;' % ref)
>     # so known refs do not get translated
>     handle_charref = unknown_charref

>     def close(self):
>         SGMLParser.close(self)

>     ## put tag handlers here,
>     ## for my sample code I took the  www.python.org homepage and
>     ## changed the bgcolor of the wrapper tables
>     ## define start and end tag handlers as start_TAGNAME, end_TAGNAME

>     def start_td(self, attrs):
>         if not attrs:
>             self.outfile.write('<td>')
>         else:
>             self.outfile.write('<td')
>             for name, val in attrs:
>                 if string.lower(name) == 'bgcolor':
>                     self.outfile.write(' %s="%s"' % (name, '#ffcc99'))
>                 else:
>                     self.outfile.write(' %s="%s"' % (name, val))
>             self.outfile.write('>')

>     def end_td(self):
>         self.outfile.write('</td>')

> if __name__ == "__main__":
>     import sys
>     if len(sys.argv) != 3:
>         print "usage: python changeattr.py infile, outfile"
>         raise SystemExit
>     infile = sys.argv[1]
>     outfile = sys.argv[2]
>     ofp = open(outfile, 'w')
>     # this is a one shot parser
>     p = MySGMLParser(outfile=ofp)
>     p.feed(open(infile).read())
>     p.close()
>     ofp.close()



Sun, 23 Mar 2003 03:00:00 GMT  
 
 [ 12 post ] 

 Relevant Pages 

1. Need help concerning "-setgrid"

2. needs advice on "visibility of features"

3. Advice needed to prevent "handle" death

4. Need advice on "asynchronous" commands

5. Concerning "Forth 97" Please Read

6. Newbie problem concerning the "exec"-command

7. reg exp help needed

8. string.join(["Tk 4.2p2", "Python 1.4", "Win32", "free"], "for")

9. Developing "Quicken" clone -- advice wanted

10. Seek advice - "Ada for Smart People"

11. request for advice: "PyString_FromStringAndSize(0, ..."?

12. executable "wrapper" gui -- advice sought

 

 
Powered by phpBB® Forum Software