Pattern matching using "re" 
Author Message
 Pattern matching using "re"

Hello!

I have a problem using the re module for a simple task.

I scan C++ files looking for all "words" (  [_a-zA-Z][_a-zA-Z0-9]*  )
and I need to access each of the word found, but everything I have tried
doesn't work.
Maybe I'm not using the correct module/functions?

def ScanFiles(ScanList):
        for FileName in  ScanList:
                File = open(FileName)
                if not File: continue
                while 1:
                        Line = File.readline()
                        if not Line: break

                        Match = re.search("[_a-zA-Z][_a-zA-Z0-9]*", Line)
                        if Match == None: continue
                        # HERE: what do I do with match object to access
each match and print it?
                        ???

                File.close()

TIA,
______________________________________________________
   Gaetan Corneau
   Software Developer (Toolsmiths Team)
   BaaN  Supply Chain Solutions  


   ICQ Number: 7395494            
   Tel: (418) 654-1454 ext. 252          
______________________________________________________
"Profanity is the one language all programmers know best"



Fri, 23 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"

Quote:

> Hello!

> I have a problem using the re module for a simple task.

> I scan C++ files looking for all "words" (  [_a-zA-Z][_a-zA-Z0-9]*  )
> and I need to access each of the word found, but everything I have tried
> doesn't work.
> Maybe I'm not using the correct module/functions?

> def ScanFiles(ScanList):
>    for FileName in  ScanList:
>            File = open(FileName)
>            if not File: continue
>            while 1:
>                    Line = File.readline()
>                    if not Line: break

>                    Match = re.search("[_a-zA-Z][_a-zA-Z0-9]*", Line)
>                    if Match == None: continue
>                    # HERE: what do I do with match object to access
> each match and print it?
>                    ???

You could try re.split("[^_a-zA-Z][^_a-zA-Z0-9]*",Line) which will
split the Line into a list of "words", but be aware:

1) It tends to return empty strings at the start and end of the
list. No real problem here.  

2) It will get _all_ the words, including those inside comments and
strings. If this is a problem comments and strings can probably be
stripped out with a couple of regular expressions. It does add to the
complexity though.

(If you're being really pedantic, or handling comments, you'll need to
look at backslash continuation too)

Hope this helps

--
Michael Hudson
Jesus College
Cambridge



Fri, 23 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"

GC> Hello!
GC> I have a problem using the re module for a simple task.

GC> I scan C++ files looking for all "words" (  [_a-zA-Z][_a-zA-Z0-9]*  )
GC> and I need to access each of the word found, but everything I have tried
GC> doesn't work.
GC> Maybe I'm not using the correct module/functions?

GC> def ScanFiles(ScanList):
GC>  for FileName in  ScanList:
GC>          File = open(FileName)
GC>          if not File: continue
GC>          while 1:
GC>                  Line = File.readline()
GC>                  if not Line: break

GC>                  Match = re.search("[_a-zA-Z][_a-zA-Z0-9]*", Line)
GC>                  if Match == None: continue
GC>                  # HERE: what do I do with match object to access
GC> each match and print it?
GC>                  ???

The library documentation (4.2.4 Match Objects) tells you the methods that
a match object has.

In this case Match.span() would give you the begin/end indices of the
matched word and Match.group() would give you the string.

Note that the Line could contain more matches so you will have to continue
after the match.

--

URL: http://www.cs.uu.nl/~piet [PGP]



Fri, 23 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"
Michael,

        You could try re.split("[^_a-zA-Z][^_a-zA-Z0-9]*",Line) which will
        split the Line into a list of "words", but be aware:
        [GC]  
        I discovered that "by accident" just after I sent my message. But I
still don't understand why I must use the caret "^": it seems the split
function always gives me the exact opposite of what I want. I don't
understand. And I find the re and regex documentation to be quite obscure
when compared with the other modules.

        2) It will get _all_ the words, including those inside comments and
        strings.
        [GC]  
        It is not a problem (it's even a feature). What the complete program
does is parse a header file containing all the constants that are messages
ids we use to retrieve messages for a particular language. Then, it scans
all the files in the project and checks if each word found is a message id:
if so, it is removed from the list. At the end, I'm left with the unused
messages ids: I can clean the database. It is possible that a call to the
message database has been commented out temporarily, so it is a good thing I
ignore comments.

        Hope this helps
        [GC]  
        Yes, thanks!
______________________________________________________
   Gaetan Corneau
   Software Developer (Toolsmiths Team)
   BaaN  Supply Chain Solutions  


   ICQ Number: 7395494            
   Tel: (418) 654-1454 ext. 252          
______________________________________________________
"Profanity is the one language all programmers know best"



Fri, 23 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"
Piet,

        The library documentation (4.2.4 Match Objects) tells you the
methods that
        a match object has.
        [GC]  
I know, but I don't understand much of it. It is not very clear (IMHO), but
I am a newbie. Worst, I'm french, so I sometimes have problems understanding
the documentation. In this case, I just realised that by "Split string by
the occurrences of pattern" means that my regular expression represents the
separators.

Sorry :)
______________________________________________________
   Gaetan Corneau
   Software Developer (Toolsmiths Team)
   BaaN  Supply Chain Solutions  


   ICQ Number: 7395494            
   Tel: (418) 654-1454 ext. 252          
______________________________________________________
"Profanity is the one language all programmers know best"



Fri, 23 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"

Gaetan Corneau a ecrit:

Quote:
> I am a newbie. Worst, I'm french,

How rude of me to enjoy a good laugh at Gaetan's choice of (English)
words here, but this made me laugh.

Oh, well.  Even worse, I am American,
sue  :-)  :-)



Fri, 23 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"
Sue,

Quote:
> > I am a newbie. Worst, I'm french,

> How rude of me to enjoy a good laugh at Gaetan's choice of (English)
> words here, but this made me laugh.

        [GC]  
        I'm happy I made you laugh :)
        What I was trying to say (I'm sure you understood anyway) was "I'm a
newbie AND I have trouble reading the english documentation" or "ma langue
maternelle n'est pas l'anglais". My apologies to all french-speaking readers
who read "I am a newbie, and I am stoopid"  ;)

Quote:
> Oh, well.  Even worse, I am American,

        [GC]  :)
        But you wrote "Gaetan Corneau a ecrit": how nice!  :)
______________________________________________________
   Gaetan Corneau
   Software Developer (Toolsmiths Team)
   BaaN  Supply Chain Solutions  


   ICQ Number: 7395494            
   Tel: (418) 654-1454 ext. 252          
______________________________________________________
"Profanity is the one language all programmers know best"


Fri, 23 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"
Here is a little test app I wrote ...

#!/usr/local/bin/python
# -*- Mode: Python; tab-width: 4; indent-tabs-mode: nil; py-indent-offset:
4 -*-

""" Prints out the 'words' of a document
: Limitations: it wont find last word of the document that is less than
three letters. """

usage =  "Usage : %s filename"

import sys
import os

def test_dowordscan(buf):
    import re
    import string

    wordlist = []
    reword = re.compile("[\W][a-zA-Z]")
    rebound = re.compile("\W")

    newpos = 0

    # see if there is a start of the word at the beginning of the buf
    caughtfirst = 0

    if (buf[0] in string.letters):
        caughtfirst = 1
        firststart = 0
        firstend = 1
    # end if

    result = None
    endresult = None

    while 1:

        if not caughtfirst:
            result = reword.search(buf[newpos:])
            if result is None:
                break
            # end if

            firststart, firstend = result.span(0)
        # end if

        caughtfirst = 0
        newpos = newpos + firstend

        endresult = rebound.search(buf[newpos:])
        if endresult is None:
            break
        # end if

        start, end = endresult.span(0)
        word = buf[newpos - 1: newpos - 1 + end]

        lowerword = string.lower(word)

        wordlist.append(lowerword)

        newpos = newpos + end - 1

        # somebody help me with python memory (I aint gotta clue)
        if result <> None:
            del result
        # end if

        if endresult <> None:
            del endresult
        # end if

    # end while
    return wordlist
# end def test_dowordscan

if __name__ == '__main__':
    args = sys.argv[1:]
    filename = None

    if (len(args) > 0):
        filename = args[0]
    # end if

    if (filename == None):
        print usage % sys.argv[0]
        print __doc__
        sys.exit(0)
    # end if

    try:
        import dirutil
        if (dirutil.isdirectory(filename)):
            filenames = dirutil.get_filenames(filename)
        else:
            filenames = [filename]
        # end if
    except:
        filenames = [filename]
    # end try

    for filename in filenames:
        buf = open(filename).read()
        wordlist = test_dowordscan(buf)
        i = 1
        for word in wordlist:
            print i, word
            i = i + 1
        # end for
        del buf
    # end for
# end if

Quote:
-----Original Message-----

Newsgroups: comp.lang.python

Date: Monday, October 05, 1998 9:39 AM
Subject: Re: Pattern matching using "re"


>GC> Hello!
>GC> I have a problem using the re module for a simple task.

>GC> I scan C++ files looking for all "words" (  [_a-zA-Z][_a-zA-Z0-9]*  )
>GC> and I need to access each of the word found, but everything I have
tried
>GC> doesn't work.
>GC> Maybe I'm not using the correct module/functions?

>GC> def ScanFiles(ScanList):
>GC> for FileName in  ScanList:
>GC> File = open(FileName)
>GC> if not File: continue
>GC> while 1:
>GC> Line = File.readline()
>GC> if not Line: break

>GC> Match = re.search("[_a-zA-Z][_a-zA-Z0-9]*", Line)
>GC> if Match == None: continue
>GC> # HERE: what do I do with match object to access
>GC> each match and print it?
>GC> ???

>The library documentation (4.2.4 Match Objects) tells you the methods that
>a match object has.

>In this case Match.span() would give you the begin/end indices of the
>matched word and Match.group() would give you the string.

>Note that the Line could contain more matches so you will have to continue
>after the match.

>--

>URL: http://www.cs.uu.nl/~piet [PGP]

python test_words.py test_words.py
1 usr
2 local
3 bin
4 python
5 mode
6 python
7 tab
8 width
9 indent
10 tabs
11 mode
12 nil
13 py
14 indent
15 offset
16 prints
17 out
18 the
19 words
20 of
21 a
22 document
23 usage
24 usage
25 s
26 filename
27 import
28 sys
29 import
30 os
31 def
32 test_dowordscan
33 buf
34 import
35 re
36 import
37 string
38 wordlist
39 buf
40 buf
41 reword
42 re
43 compile
44 w
45 a
46 za
47 z
48 rebound
49 re
50 compile
51 w
52 newpos
53 see
54 if
55 there
56 is
57 a
58 start
59 of
60 the
61 word
62 at
63 the
64 beginning
65 of
66 the
67 buf
68 caughtfirst
69 if
70 buf
71 in
72 string
73 letters
74 caughtfirst
75 firststart
76 firstend
77 end
78 if
79 result
80 none
81 endresult
82 none
83 while
84 if
85 not
86 caughtfirst
87 result
88 reword
89 search
90 buf
91 newpos
92 if
93 result
94 is
95 none
96 break
97 end
98 if
99 firststart
100 firstend
101 result
102 span
103 end
104 if
105 caughtfirst
106 newpos
107 newpos
108 firstend
109 endresult
110 rebound
111 search
112 buf
113 newpos
114 if
115 endresult
116 is
117 none
118 break
119 end
120 if
121 start
122 end
123 endresult
124 span
125 word
126 buf
127 newpos
128 newpos
129 end
130 lowerword
131 string
132 lower
133 word
134 wordlist
135 append
136 lowerword
137 newpos
138 newpos
139 end
140 somebody
141 help
142 me
143 with
144 python
145 memory
146 i
147 aint
148 gotta
149 clue
150 if
151 result
152 none
153 del
154 result
155 end
156 if
157 if
158 endresult
159 none
160 del
161 endresult
162 end
163 if
164 end
165 while
166 return
167 wordlist
168 end
169 def
170 test_dowordscan
171 if
172 args
173 sys
174 argv
175 filename
176 none
177 if
178 len
179 args
180 filename
181 args
182 end
183 if
184 if
185 filename
186 none
187 print
188 usage
189 sys
190 argv
191 print
192 sys
193 exit
194 end
195 if
196 try
197 import
198 dirutil
199 if
200 dirutil
201 isdirectory
202 filename
203 filenames
204 dirutil
205 get_filenames
206 filename
207 else
208 filenames
209 filename
210 end
211 if
212 except
213 filenames
214 filename
215 end
216 try
217 for
218 filename
219 in
220 filenames
221 buf
222 open
223 filename
224 read
225 wordlist
226 test_dowordscan
227 buf
228 i
229 for
230 word
231 in
232 wordlist
233 print
234 i
235 word
236 i
237 i
238 end
239 for
240 del
241 buf
242 end
243 for
244 end
245 if



Sat, 24 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"

Quote:

> Michael,

>    You could try re.split("[^_a-zA-Z][^_a-zA-Z0-9]*",Line) which will
>    split the Line into a list of "words", but be aware:
>    [GC]  
>    I discovered that "by accident" just after I sent my message. But I
> still don't understand why I must use the caret "^": it seems the split
> function always gives me the exact opposite of what I want. I don't
> understand. And I find the re and regex documentation to be quite obscure
> when compared with the other modules.

Think of string.split.

If you type string.split("foo,fooey,fooeyer",",") you expect ['foo',
'fooey', 'fooeyer'], not [ ',',',',',' ], i.e. you specify the
delimiter, not the field contents. When you're using re.split, it's
less clear whether you should specify the contents or the delimiter,
but I imagine the current interface was chosen to be consistent with
string.split.

I should think you find the regular expression documentation obscure
because regular expressions are very hard to explain. If you're on
Unix, try reading `man grep' or `man awk' for a different attempt at
explaining them.

--
Michael Hudson
Jesus College
Cambridge



Sat, 24 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"

Quote:
Gaetan Corneau writes:
>    I discovered that "by accident" just after I sent my message. But I
>still don't understand why I must use the caret "^": it seems the split
>function always gives me the exact opposite of what I want. I don't
>understand. And I find the re and regex documentation to be quite obscure
>when compared with the other modules.

        As the first character of a character class, ^ means to match
anything that isn't listed in the class.  For example, [a-z] matches
any lowercase letter, and [^a-z] matches any character that *isn't* a
lower-case letter.

        You might find the Regular Expression HOWTO, at
http://www.python.org/doc/howto/regex/, helpful; it's a bit less
compressed than the Library Reference's section on re.  You can also
point out sections of text that are unclear, and suggest that they be
improved.

--
A.M. Kuchling                   http://starship.skyport.net/crew/amk/
    Q. Does Kibo believe in furniture?
    A. No. Go away, furniture!
    -- The alt.religion.kibology FAQ



Sat, 24 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"

Quote:
Andrew Hanson writes:
>Here is a little test app I wrote ...
>""" Prints out the 'words' of a document
>: Limitations: it wont find last word of the document that is less than
>three letters. """

        Erm... you might not be aware that the re module has a \b
sequence which matches at word boundaries.  For example, if your re
module supports the findall method (only in 1.5.2 alpha, though):

Quote:
>>> re.findall(r'\b\w+\b', 'this is a test f')

['this', 'is', 'a', 'test', 'f']

        It's very important to use a raw string when using \b, because
in regular Python strings \b is a backspace.

--
A.M. Kuchling                   http://starship.skyport.net/crew/amk/
Testing? That's scheduled for first thing after 3.0 ships. Quality is job
Floating Point Error; Execution Terminated.
    -- Benjamin Ketcham, on applications for Microsoft Windows, in
       _comp.os.unix.advocacy_.



Sat, 24 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"

GC> Piet,
GC>  The library documentation (4.2.4 Match Objects) tells you the
GC> methods that
GC>  a match object has.
GC>  [GC]  
GC> I know, but I don't understand much of it. It is not very clear (IMHO), but
GC> I am a newbie. Worst, I'm french, so I sometimes have problems understanding
GC> the documentation.
I can imagine. I have the same feeling with French (the language).
The documenation is reasonably clear if you know what it is about. But if
the stuff is new for you you need a tutorial.
--

URL: http://www.cs.uu.nl/~piet [PGP]



Sat, 24 Mar 2001 03:00:00 GMT  
 Pattern matching using "re"

Quote:
> ...
> I think you find the regular expression documentation obscure
> because regular expressions are very hard to explain. If you're on
> Unix, try reading `man grep' or `man awk' for a different attempt
> at explaining them.

The re module's regexps mimic Perl's closely, and one of the reasons the
String SIG moved in that direction was so that people could take advantage
of the excellent regexp writeups available for Perl.

The ref manual strives more to be precise than gentle (it's a reference
manual, not a tutorial).  It is indeed a big topic, and requires more
introductory words than anyone had time to write -- or inclination, given
that the Perl folk already did it.

a-good-corporate-apologist-leaves-the-user-feeling-guilty-ly y'rs  - tim



Sun, 25 Mar 2001 03:00:00 GMT  
 
 [ 13 post ] 

 Relevant Pages 

1. Regular Expression Pattern Matching "State" Object

2. Match "ab" in "abc", but not in "abd"

3. string.join(["Tk 4.2p2", "Python 1.4", "Win32", "free"], "for")

4. "Grudge Match Pool" Beta

5. "Grudge Match Pool"

6. Problems with the term "match".

7. What does this match: sub(/^.*\(/,"")

8. simple question on matching "("

9. Need regex to match "^\n"

10. counter error no matching overload for "+"

11. "match any of these"

12. "match failure" error

 

 
Powered by phpBB® Forum Software