Newbie: searching an English dictionary for a reg exp 
Author Message
 Newbie: searching an English dictionary for a reg exp

I want to search through the entire English language for a given
regexp.

So, I'll have say a 'dictionary.txt' file which contains every English
word.

Since I don't understand how python regexps work I'm kinda stuck.

My guess is I should be doing something like:
#####################################
import re
import string

# heres our regpattern - once I get it going I'll read
# regpatterns from a file
regpattern = 'abas?' # should return 'abash' and 'abase'

dictionary = []
dictfile = open('dictionary.txt', 'r')
line = dictfile.readline()
while line != "":
    dictionary.append(line)
    line = string.strip(dictfile.readline())

reg = re.compile(regpattern)

# search thru dictionary - print all matches
for word in dictionary:
    m = reg.match(regpattern)
    if m:
        print m.group()

#####################################
This doesn't word properly - so how do I print each word that matches?

Have I gone about this in the correct way? Having the entire
dictionary in memory could be dodgy I guess...

Thanks
-Matt



Thu, 20 May 2004 17:31:48 GMT  
 Newbie: searching an English dictionary for a reg exp
I see a couple problems.  First, the regular expression you're using,
"abas?", actually means "match 'aba' with zero or one occurances of 's'
after it."  If you mean "match 'abas' with precisely one occurance of any
character after it," you want "aba." [sic].

Second, "reg.match(regpattern)" will always match, because you're asking a
regular object to match against itself.  You wanted to try and match the
pattern against each dictionary word in the loop, so you want to replace
"regpattern" in that line with "word", like:

Quote:
> # search thru dictionary - print all matches
> for word in dictionary:
>     m = reg.match(word)    #[EG] change is in this line
>  ....

Third, and this may not be a problem, it may have been intentional, the
pattern you've chosen there is case-sensitive because the "re" module does
case-sensitive matching unless you tell it otherwise.  To do
case-insensitive matching you would either change your pattern from 'abas.'
to "(?i)aba." or change the line where you compile it from "reg =
re.compile(regpattern)" to "reg = re.compile(regpattern, re.I)".

On another note, if you're going to be searching through the entire
dictionary every time, you might do the matching as you're going through the
dictionary line-by-line and only store in your list the words that match.
(Unless you needed to have stored the other words for some other purpose, I
mean.)

And finally, there are a couple of different ways to use the "re" module for
most problems.  In this reply you've seen there are two ways to turn on
case-insensitivity (I don't know if the one using re.compile is faster than
the other), and in the Python 2.1 Doc section 4.2.3 ([re] Module Contents)
right at the top they say you can use re.compile or just put the pattern
directly into re.match, but one can be more efficient.  Note also the subtle
difference between re.search and re.match - I've tried to use re.match more
than once when I really wanted re.search because I wanted to search in the
middle of the string....

    -Eugene


Quote:
> I want to search through the entire English language for a given
> regexp.

> So, I'll have say a 'dictionary.txt' file which contains every English
> word.

> Since I don't understand how python regexps work I'm kinda stuck.

> My guess is I should be doing something like:
> #####################################
> import re
> import string

> # heres our regpattern - once I get it going I'll read
> # regpatterns from a file
> regpattern = 'abas?' # should return 'abash' and 'abase'

> dictionary = []
> dictfile = open('dictionary.txt', 'r')
> line = dictfile.readline()
> while line != "":
>     dictionary.append(line)
>     line = string.strip(dictfile.readline())

> reg = re.compile(regpattern)

> # search thru dictionary - print all matches
> for word in dictionary:
>     m = reg.match(regpattern)
>     if m:
>         print m.group()

> #####################################
> This doesn't word properly - so how do I print each word that matches?

> Have I gone about this in the correct way? Having the entire
> dictionary in memory could be dodgy I guess...

> Thanks
> -Matt



Fri, 21 May 2004 05:54:24 GMT  
 Newbie: searching an English dictionary for a reg exp
On Sun, 2 Dec 2001 13:54:24 -0800, "Eugene" <import binascii; print

Quote:

>I see a couple problems.  First, the regular expression you're using,
>"abas?", actually means "match 'aba' with zero or one occurances of 's'
>after it."  If you mean "match 'abas' with precisely one occurance of any
>character after it," you want "aba." [sic].

Oops, typo. Thanks a lot for your help Eugene, it works now.
Now all I need is a .txt file containing every English word... I'm
searching now - anyone?

Hmm, I just found web2 at
ftp://ftp.funet.fi/pub/doc/dictionaries/English/Webster

which looks like what I'm after, except being Webster's it'll no doubt
be American spelling - I'm after British or New Zealand English...

-Matt



Fri, 21 May 2004 13:23:32 GMT  
 Newbie: searching an English dictionary for a reg exp
On Sun, 2 Dec 2001 13:54:24 -0800, "Eugene" <import binascii; print

Oops, looks like I spoke too soon with my last post.
The matches aren't working properly, in that it seems to be finding
too many incorrect matches:

def searchfor(regpattern):
    print "Searching for",regpattern,
    matchcount = 0
    reg = re.compile(regpattern, re.I)
    m = ""
    for word in dictionary:
        m = reg.match(word)
        if m:
            outfile.write(m.group()+' ')
            matchcount += 1

    print "Found",matchcount
    return matchcount

Produces:
Read 1187 words from the dictionary
Read 1 clues from clues.txt
Searching for abac. Found 14

Now, in my dictionary there are only 2 words that would match that
criteria - abaca and aback, right? It seems to be returning what is I
guess (abac)*:
dictionary =
...
Ababua
abac
abaca
abacate
abacay
abacinate
abacination
abaciscus
abacist
aback
abactinal
abactinally
abaction
abactor
abaculus
abacus
Abadite
...

I get the feeling I'm barking up the wrong tree by using m.group().
All I want is the word that matches! IMO the doco for MatchObject
(match-objects.html) is useless (or maybe I am).

Thanks for any help

-Matt



Fri, 21 May 2004 14:40:59 GMT  
 Newbie: searching an English dictionary for a reg exp
Oh, sorry about that.  "re.search" only requires that the pattern match at
the start of the target string; it does not also require that the pattern
match all the way to the end.  Put "$" at the end of your pattern and that
will fix it.  (The regular expression "$" means end-of-string.*)

And if you just want to print the word that matched, just put
"outfile.write( word + ' ' )" in place of what you have in your "if" block.
Once you know the word matches, that's good enough for you, right? :)

    -Eugene

* in Python's regular expression and in PERL's.  I have no idea what "$"
means in other languages' regular expression engines.


Quote:
> On Sun, 2 Dec 2001 13:54:24 -0800, "Eugene" <import binascii; print

> Oops, looks like I spoke too soon with my last post.
> The matches aren't working properly, in that it seems to be finding
> too many incorrect matches:

> def searchfor(regpattern):
>     print "Searching for",regpattern,
>     matchcount = 0
>     reg = re.compile(regpattern, re.I)
>     m = ""
>     for word in dictionary:
>         m = reg.match(word)
>         if m:
>             outfile.write(m.group()+' ')
>             matchcount += 1

>     print "Found",matchcount
>     return matchcount

> Produces:
> Read 1187 words from the dictionary
> Read 1 clues from clues.txt
> Searching for abac. Found 14

> Now, in my dictionary there are only 2 words that would match that
> criteria - abaca and aback, right? It seems to be returning what is I
> guess (abac)*:
> dictionary =
> ...
> Ababua
> abac
> abaca
> abacate
> abacay
> abacinate
> abacination
> abaciscus
> abacist
> aback
> abactinal
> abactinally
> abaction
> abactor
> abaculus
> abacus
> Abadite
> ...

> I get the feeling I'm barking up the wrong tree by using m.group().
> All I want is the word that matches! IMO the doco for MatchObject
> (match-objects.html) is useless (or maybe I am).

> Thanks for any help

> -Matt



Sat, 22 May 2004 12:57:58 GMT  
 Newbie: searching an English dictionary for a reg exp
Check out:

http://py-howto.sourceforge.net/regex/regex.html

I've found it very usefull.

-Chris

--
Christopher Barker,
Ph.D.                                                          




Water Resources Engineering       -------      ---------     --------    
Coastal and Fluvial Hydrodynamics --------------------------------------
------------------------------------------------------------------------



Sun, 23 May 2004 01:42:41 GMT  
 Newbie: searching an English dictionary for a reg exp
On Mon, 3 Dec 2001 20:57:58 -0800, "Eugene" <import binascii; print

Quote:

>Oh, sorry about that.  "re.search" only requires that the pattern match at
>the start of the target string; it does not also require that the pattern
>match all the way to the end.  Put "$" at the end of your pattern and that
>will fix it.  (The regular expression "$" means end-of-string.*)

Thanks, I didn't know I needed the $.

Quote:
>And if you just want to print the word that matched, just put
>"outfile.write( word + ' ' )" in place of what you have in your "if" block.
>Once you know the word matches, that's good enough for you, right? :)

Duh, how'd I miss that. Thanks again. Problem solved!

-Matt



Sun, 23 May 2004 13:22:02 GMT  
 
 [ 7 post ] 

 Relevant Pages 

1. Reg Exp Search in Asm?

2. Newbie Generic Reg Exp Pattern Matching Question

3. Is // a reg exp?

4. reg exp help needed

5. Reg Exp in awk

6. Why no reg exp in Forth?

7. variables in reg exp repetition

8. Reg-exp library?

9. Reg Exp. Problem

10. Reg Exp: Need advice concerning "greediness"

11. Expect reg-exp trouble

12. English Dictionary required

 

 
Powered by phpBB® Forum Software