comparing all values of a list to regex 
Author Message
 comparing all values of a list to regex

I have to compare all values of a list to a regex. If the regex
matches all list items the list should be added to a new list. If the
regex matches just one list item, the list should be added to another
new list. And last but not least, if the regex doesn't mutch at all,
the list should be added again to another new list.

Can someone help me with this? This is driving me crazy.

Thanks,
    Manuel

--
For ages I thought life was like fishing in a swimming pool. Now the water's
all drained out!



Sun, 13 Mar 2005 17:53:49 GMT  
 comparing all values of a list to regex

Quote:

> I have to compare all values of a list to a regex. If the regex
> matches all list items the list should be added to a new list. If the
> regex matches just one list item, the list should be added to another
> new list. And last but not least, if the regex doesn't mutch at all,
> the list should be added again to another new list.

> Can someone help me with this? This is driving me crazy.

E.g.:

num_matches = [ there.match(item) is not None for item in alist ].count(1)

if num_matches == 0:
    againanothernewlist.append(alist)
elif num_matches == 1:
    anothetnewlist.append(alist)
elif num_matches == len(alist):
    anewlist.append(alist)

Note that there's some ambiguity on what to do if alist is empty --
literally this is BOTH "matches all list items" AND "doesn't match
at all".  Here I've chosen the second interpretation, but depending
on your specs you can choose to test num_matches in different ways.

Alex



Sun, 13 Mar 2005 18:26:53 GMT  
 comparing all values of a list to regex

Quote:

> num_matches = [ there.match(item) is not None for item in
> alist ].count(1)

> if num_matches == 0:
>     againanothernewlist.append(alist)
> elif num_matches == 1:
>     anothetnewlist.append(alist)
> elif num_matches == len(alist):
>     anewlist.append(alist)

I would actually prefer to do this another way.

match_all = []
match_one = []
match_some = []
match_none = []

# Expanded out for clarity - could be one line
matches = map(regexp.match, alist)
matches = filter(None, matches)
matches = len(matches)

# Find out which list to append to

match_list = match_all

if not matches:
    match_list = match_none
elif matches == 1:
    match_list = match_one
elif matches != len(alist):
    match_list = match_some

match_list.append(alist)

Note that I added an additional case - where more than one matched, but not
all.

Tim Delaney



Mon, 14 Mar 2005 07:36:18 GMT  
 comparing all values of a list to regex

Quote:

> E.g.:

> num_matches = [ there.match(item) is not None for item in alist ].count(1)

> if num_matches == 0:
>     againanothernewlist.append(alist)
> elif num_matches == 1:
>     anothetnewlist.append(alist)
> elif num_matches == len(alist):
>     anewlist.append(alist)

> Note that there's some ambiguity on what to do if alist is empty --
> literally this is BOTH "matches all list items" AND "doesn't match
> at all".  Here I've chosen the second interpretation, but depending
> on your specs you can choose to test num_matches in different ways.

To be honest I don't understand that completely. Maybe, I have to
describe my problem a little bit more in detail and show the things
I've already done.

This is the input, it is "|" seperated textfile with about 15000 lines
of pop3 accounts:

|Number|String|Number|String(Domain)|String(Account)|String(Login/Email)|String(Password)|String|String|

These are the fields I care about.

String(Domain): This is the domain of the Account


also be a * for a catchall account.

String(Login/Email): This is the local pop3-account or a emailaddress,
or both, comma seperated.

String(Password): This is the password if String(Login/Email) is a
pop3-account.

This should be the output:

Three text files. One with only forwardings (only emailaddresses in
the String(Login/Email). One with only pop3-accounts in the
String(Login/Email), and one mixed, where String(Login/Email) has
pop3-accounts and emailaddresses.

This is what I've done already:

#!/usr/bin/env python
#
import sys, string, re

inputfile = open(sys.argv[-1], "r")

lines = inputfile.readlines()

# deleting the first three lines and the last line
lines = lines[3:-1]

# empty list for the domains
domains = []

# empty list for the new lines
neededlines = []

for line in lines:
    fields = string.split(line, "|")

    # check if the domain field contains a regualr domain
    p = re.compile("[0-9a-z\\.]+[a-z]{2,3}",
    re.IGNORECASE)
    m = p.match(string.strip(fields[4]))
    if m:
        # extract the needed fields in a  separate list and
        # split the gLogin to get  a list in the list
        neededfields = []
        neededfields = [string.strip(fields[4]),
                        string.strip(fields[5]),
                        string.split(string.strip(fields[6]),
                        ","),
                        string.strip(fields[7])]

        # put alle neededfields together to a neededlines list
        neededlines.append(neededfields)

    # this is to get a list of domains with each domain just once
    if neededfields[0] not in domains:
        domains.append(neededfields[0])
#Actually, I wanted to do this with a dictionary and the lists in the
#dictionary, but I couldn't get this working.

# empty list for lines per all domains
linesperdomains = []

for domain in domains:

    # empty list for lines
    domainlines = []
    for neededline in neededlines:
        if domain == neededline[0]:
            domainlines.append(neededline)

    # empty list for lines per domain
    linesperdomain = [domain, domainlines]

    # put all linesperdomain together to a linesperdomains list
    linesperdomains.append(linesperdomain)

The program seams to be quite slow, but that's not that important.
Important for me is, to get this problem solved.
The next step is the spliting up into the three different parts.

Can anyone help?

Thanks in advance,
    Manuel

--
It takes one tree to make a thousand matches, it takes one match to burn a
thousand trees.



Mon, 14 Mar 2005 19:23:22 GMT  
 comparing all values of a list to regex
        ...

Quote:
>> num_matches = [ there.match(item) is not None for item in
>> alist ].count(1)
        ...
> # Expanded out for clarity - could be one line
> matches = map(regexp.match, alist)
> matches = filter(None, matches)
> matches = len(matches)

The list-comprehension equivalent of this would be:

matches = len([ item for item in alist if there.match(item) ])

and may indeed be clearer than the .count approach I originally
proposed (except that I think that naming the variable matches
is ambiguous -- num_matches or numMatches & being preferable).

A more concise expression of your approach would of course be:

matches = len(filter(there.match, alist))

I don't see exactly what value the call to map adds in your case.

Quote:
> # Find out which list to append to

> match_list = match_all

> if not matches:
>     match_list = match_none
> elif matches == 1:
>     match_list = match_one
> elif matches != len(alist):
>     match_list = match_some

> match_list.append(alist)

Yes, choosing a list first and then appending to the chosen
list is nicer, but you cannot do that within the original
specs (for all I know the match_some list might grow
unbearably large over time, in this case).

Quote:
> Note that I added an additional case - where more than one matched, but
> not all.

Yep, that's the part that's out of the original specs.

The concise equivalent alternative here might be (e.g.):

{0:match_none, 1:match_one, len(alist):match_all}.get(
    matches, match_some).append(alist)

or to get back in the original specs:

match_list = {0:match_none, 1:match_one, len(alist):match_all}.get(matches)
if match_list is not None: match_list.append(alist)

Alex



Mon, 14 Mar 2005 21:55:58 GMT  
 comparing all values of a list to regex

Quote:

> I don't see exactly what value the call to map adds in your case.

None at all in this case - good catch :) I'll blame it on feeling really
{*filter*}ed yesterday (am today too for some reason - thank goodness it's only
about 7 hours until the weekend :)

Quote:
> > # Find out which list to append to

> > match_list = match_all

> > if not matches:
> >     match_list = match_none
> > elif matches == 1:
> >     match_list = match_one
> > elif matches != len(alist):
> >     match_list = match_some

> > match_list.append(alist)

> Yes, choosing a list first and then appending to the chosen
> list is nicer, but you cannot do that within the original
> specs (for all I know the match_some list might grow
> unbearably large over time, in this case).

Hmm - why not? Ignoring that the original specs didn't actually *have* that
case, choosing the list to append to, or doing it in-place is merely an
implementation detail. Of course, you would need instead to do something
like:

match_list = None

if not matches:
    match_list = match_none
elif matches == 1:
    match_list = match_one
elif matches == len(alist):
    match_list = match_all

if match_list is not None:
    match_list.append(alist)

Tim Delaney



Tue, 15 Mar 2005 07:39:56 GMT  
 comparing all values of a list to regex
        ...

Quote:
>> > elif matches != len(alist):
>> >     match_list = match_some

>> > match_list.append(alist)

>> Yes, choosing a list first and then appending to the chosen
>> list is nicer, but you cannot do that within the original
>> specs (for all I know the match_some list might grow
>> unbearably large over time, in this case).

> Hmm - why not? Ignoring that the original specs didn't actually *have*
> that case,

Not ignoring it, one would think the specs must be to do nothing
special in that case.

If your program was specified to "append to list1 if the time is
exactly 11:22:33, to list2 if the time is exactly 17:26:35", would
you consider it reasonable to make your program appent to list3
if it's any other time of day?

The problem could be that the program is called a billion times,
of which, say, about seven fall in the first case, about three in
the second one, and all the rest in the "neither" category.  If you
extend the program's spec to append to another list when "neither"
applies, your program will overflow memory and crash, while any
proper implementation of the specs would work correctly.

Quote:
> choosing the list to append to, or doing it in-place is merely
> an implementation detail.

Except that, to meet the specs, you must NOT append to match_some,
ever.  It's quite possible of course that the specs where incorrect
or incomplete, but just assuming that seems strange to me.

Quote:
> Of course, you would need instead to do something like:

> match_list = None

> if not matches:
>     match_list = match_none
> elif matches == 1:
>     match_list = match_one
> elif matches == len(alist):
>     match_list = match_all

> if match_list is not None:
>     match_list.append(alist)

Right (I did much the same in my dictionary-version of this
snippet) although of course the need for the final if is not
most elegant.

A good design pattern for such cases is Null Object, which
Dinu Gherman showed in recipe:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/68205

To quote him,
"""
Null objects are intended to provide first-class citizens as
a replacement for the primitive value None. Using them you
can avoid conditional statements in your code and express
algorithms with less checking for special values.
"""

With this DP, your snippet would become:

import null
match_list = null.Null

if not matches:
    match_list = match_none
elif matches == 1:
    match_list = match_one
elif matches == len(alist):
    match_list = match_all

match_list.append(alist)

The need for the guard goes away because null.Null is
basically designed so you can call any method on it
innocuously.  Of course, you get more mileage out of
it in more complicated situations.

{0:match_none, 1:match_one, len(alist):match_all}.get(
    matches, null.Null).append(alist)

gets greater benefit wrt the None-plus-guard.  Of course,
this more-concise snippet has a problem when len(alist)
is 0 -- your snippet, like my original one, is not
ambiguous in this case (appends to match_none), while a
dictionary-display with two identical keys has behavior
that is less obvious (I don't recall whether the Python
language specifies which of the two repetitions of the
same key "takes", but in any case most readers and
maintainers of this code would be puzzled).  Fortunately,
the OP clarified (if I read his followup post correctly)
than len(alist)==0 is a "can't happen" in his case.

Alex



Tue, 15 Mar 2005 15:09:37 GMT  
 comparing all values of a list to regex
<posted & mailed>

        ...

Quote:
> This is the input, it is "|" seperated textfile with about 15000 lines
> of pop3 accounts:

|Number|String|Number|String(Domain)|String(Account)|String(Login/Email)|String(Password)|String|String|

Quote:

> These are the fields I care about.

> String(Domain): This is the domain of the Account

> also be a * for a catchall account.
> String(Login/Email): This is the local pop3-account or a emailaddress,
> or both, comma seperated.
> String(Password): This is the password if String(Login/Email) is a
> pop3-account.

> This should be the output:

> Three text files. One with only forwardings (only emailaddresses in
> the String(Login/Email). One with only pop3-accounts in the
> String(Login/Email), and one mixed, where String(Login/Email) has
> pop3-accounts and emailaddresses.

OK, for each line you want fields line.split('|')[4:8], each field
space-stripped (at least, you strip them in the code you posted,
though that's not in the specs you give above), _plus_ you need
to know the set of domains (uniquely) and classify per field 4
(domain) -- you build this in the code you posted, althogh, again,
I can't see any trace of that requirement in the specs.

So we'll want an auxiliary function to tell us which of the three
"bins" to put an entry to depending on the login/email field, e.g.:

def classify(login_email):
    l_e = login_email.split(',')
    assert 1 <= len(l_e) <= 2, "More than one comma in (%s)" % login_email
    if len(l_e)==2: return 2         # both

    else return 0                    # other case (local account, I guess)

Now the rest, net of imports and file open/close ops:

classified = [ [] for i in range(3) ] # 3 separate empty lists
per_domanin = {}                      # initially-empty dict

for line in inputfile:
    fields = [ field.split() for field in line.split('|')[4:8] ]
    per_domain.setdefault(fields[0],[]).append(fields)
    classified[classify(field[2])].append(fields)

Now you only need the output -- presumaby each line must be
output in the same way, e.g. with another auxiliary function:

def outline(fileobj, fields):
    fileobj.write('|'.join(fields))
    fileobj.write('\n')

so you only need to loop on each of the three lists of lists
in 'classified', and on the keys of dictionary 'per_domain'
(sort them too, if you wish, of course), in order to emit
the results to appropriate files.  Presumably your exact
specs are not quite as I tried to guess them from a mix of
what you wrote and what you coded, but I hope this outline
can still be useful to you.

Alex



Tue, 15 Mar 2005 16:07:24 GMT  
 comparing all values of a list to regex
Sorry for forgetting some details.

Quote:

> def classify(login_email):
>     l_e = login_email.split(',')

I think that I nearly understand what you are doing here. But I don't
understand the assert part. I already checked at the documentations
but this doesn't helped me at all.

Quote:
>     assert 1 <= len(l_e) <= 2, "More than one comma in (%s)" % login_email
>     if len(l_e)==2: return 2         # both

>     else return 0                    # other case (local account, I guess)

Manuel

--
From '86 until the summer of last year, wherever I went, people would say,
"You would have made a great James Bond! Weren't you going to be James Bond?
You should have been, you could have been, you may have been." Yes, yes, yes,
yes, yes. It was like unfinished business in my life. I couldn't say no to it
this time around.
-Pierce Brosnan



Tue, 15 Mar 2005 17:13:25 GMT  
 comparing all values of a list to regex

Quote:

> Sorry for forgetting some details.


>> def classify(login_email):
>>     l_e = login_email.split(',')

> I think that I nearly understand what you are doing here. But I don't
> understand the assert part. I already checked at the documentations
> but this doesn't helped me at all.

>>     assert 1 <= len(l_e) <= 2, "More than one comma in (%s)" %
>>     login_email

assert is just a sanity check.  I want the program to bail out
with a clear error message if it's ever presented with an input
file that does NOT respect the specs you've given (the only
ones which the program is prepared to handle), rather than
keep chugging and producing meaningless, misleading results.

assert is nice to use for this purpose, because when you run
the program with python -O it silently goes away and you pay
no performance price whatsoever for having assert there.

By the same token, assert may not be the right statement in
this case, because I'm validating sanity of *input*, NOT
sanity of my program.  Even when the program is fully valid
it may still be presented with invalid input, and it needs
to scream when that happens.

So, you may prefer to change the assert into, e.g.:

    L = len(l_e)
    if L<1 or L>2:
        raise ValueError, "%s commas in (%r)" % (L-1, login_email)

Alex



Tue, 15 Mar 2005 18:57:37 GMT  
 comparing all values of a list to regex

Quote:


>         ...
> >> > elif matches != len(alist):
> >> >     match_list = match_some

> >> > match_list.append(alist)

> >> Yes, choosing a list first and then appending to the chosen
> >> list is nicer, but you cannot do that within the original
> >> specs (for all I know the match_some list might grow
> >> unbearably large over time, in this case).

> > Hmm - why not? Ignoring that the original specs didn't
> actually *have*
> > that case,

> Not ignoring it, one would think the specs must be to do nothing
> special in that case.

> If your program was specified to "append to list1 if the time is
> exactly 11:22:33, to list2 if the time is exactly 17:26:35", would
> you consider it reasonable to make your program appent to list3
> if it's any other time of day?

Well, of course. I thought you were talking about the implementation (since
you yourself mentioned the unspecified case) - not the conformance to the
specs.

Not having that case in the specs is not a cause for simply ignoring it.
What you should instead to is get an explicit statement of what is to occur
in that case into the specs (e.g. if more than one match is found, but not
everything, the line should be ignored).

It was an *incomplete* specification, which would require querying by the
(usually) system engineer in order to produce complete specifications. If
such specs made it to a developer I should {*filter*}y hope that they would say
"and what should I do in this case?".

Tim Delaney



Fri, 18 Mar 2005 08:31:39 GMT  
 
 [ 11 post ] 

 Relevant Pages 

1. Comparing value in an input field to any value from another file

2. SimpleParse performance compared to regex engine

3. Multiple-value-list with no values.

4. Turning a list value in a parameter into a list for eval without quasiquote/unquote

5. Removing the values from a list from another list

6. how to compare the value of a variable

7. Comparing negative string values

8. Help comparing the value of a button using the Property Node (I get a variant)

9. comparing variable width bus values with constants

10. Comparing stored value with CRC result

11. Compare a value in a COM file

12. Two variables, values zeroes, one compares greater than the other

 

 
Powered by phpBB® Forum Software