Performance while regexp pattern matching 
Author Message
 Performance while regexp pattern matching

First, my thanks to all those who helped with my qsort() questions.
I'm certainly keeping an archive of the thread.  On to the next
question... :)

I'm aware that there are a number of regex libraries out there, many
which make various claims of performance compared to the others.  I
imagine though, that performance also depends on how these library
functions are implemented.  Since all these libraries do the same basic
thing (compile a regexp, and then test a string against this compiled
regexp for a match), I imagine there might be some fundamentals in
implementing the functions of any of these libraries for best
performance.

I'm revising UQWK, a utility that processes email and usenet news
articles into a download packet suitable for use with several offline
newsreaders (and also processes the reply packets generated by these
same newsreaders).  UQWK includes scoring capability, where each news
article is evaluated according to rules contained in one or two score
files.  If the final score for a given article falls below a specified
value, that article is not included in the download packet.  UQWK uses
regexp pattern matching for this task.  I've built UQWK with GNU regex
0.12, and GNU rx 1.5, and in both cases scoring is painfully slow.  I'm
hoping that someone can suggest some improvements.

A UQWK scoring rule entry has the format

<score> ["pattern"] <Header> <regexp>

where <score> is an int that may be positive or negative, <Header> is
the particular header (From:, Newsgroups:, etc.) that is to be
examined, and <regexp> is the regexp that the specified header is to be
tested against.  For each score rule, if a match to <regexp> is found
in <Header>, then <score> is applied to the article.  The final article
score is the sum of the applied <score>s.  If <header> is the special
keyword "Header", then all header lines are tested for a match.

This scorefile format was adopted from one of the more popular offline
newsreaders, where the keyword "pattern" signifies that <regexp> is a
regexp.  Without the keyword, this newsreader takes <regexp> as a
literal string and makes a simple comparison to find a match.  UQWK
always treats <regexp> as a regexp, and ignores the "pattern" keyword
if encountered (for compatibility).

With all of that said, on to the code... :)

#ifdef  SCORES

#include "uqwk.h"
#include <regex.h>                        /* POSIX regular expressions */

struct scorest {
        struct scorest *next;
        regex_t rexp;
        int score;

Quote:
};

/* jeff: kill_list is built if a global score file exists, and
persists during news processing.  group_kill_list is rebuilt for
each individual newsgroup, if a score file for that group exists. */

struct scorest *kill_list = NULL;
struct scorest *group_kill_list = NULL;

/* jeff: these two functions are to be replaced by strtok() */

char *sws ( char *p ) {
char c;
        while((c = *p)) {
                if ((c == ' ') || (c == '\t')) ++p;
                else break;
        }
        return p;

Quote:
}

char *snf ( char *p ) {
char c;
        while((c = *p)) {
                if ((c != ' ') && (c != '\t')) ++p;
                else break;
        }
        return sws(p);

Quote:
}

int read_score_file ( const char *score_filename )
{
        FILE *score_fd;
        char score_pathname[PATH_LEN];
        char score_regex[BUF_LEN];
        char *p,*s;
        struct scorest *score_entry;
        struct scorest **score_link;
        int     global = 0;

        strcpy (score_pathname, kill_dir);
        strcat (score_pathname, "/");
        strcat (score_pathname, score_filename);

        global = !strcasecmp (score_filename, GLOBAL_SCORE_FILE);
        if(global) {
                score_link = &kill_list;
        } else {
                score_link = &group_kill_list;
        }

        if ((score_fd = fopen(score_pathname, "r")) == NULL) {
                printf(" ");
                return 0;                       /* no score file */
        }
        printf("+");

        /* jeff: Fgets() is a wrapper around fgets() that strips off
        the trailing \n or \r\n  */

        while (Fgets (buf, BUF_LEN, score_fd) != NULL) {
                p = sws (buf);

                /* jeff: skip blank lines and comments */
                if ( (*p == '#') || (*p == '\0') || (*p == ';') ) continue;

                /* jeff: Default "kill threshold" is 0.  Articles whose
                final score < kill_thresh are not packed for download. */
                if (!strncasecmp(p, "killthr", 7)) {
                        p = snf(p);
                        if (global) {
                                kill_thresh = atoi (p);
                        } else {
                                group_kill_thresh = atoi (p);
                        }
                        continue;
                }
                if ((score_entry = malloc (sizeof *score_entry)) == NULL) {
                        fprintf (stderr, "unable to allocate space for building score file\n");
                        return (0);
                }
                score_entry->next = NULL;

                score_entry->score = atoi(p);
                p = snf (p);
                if (strncasecmp(p, "pattern", 7) == 0) p = snf (p);
                s = score_regex;
                while (*p) {
                        if (*p == ' ') break;
                        *s++ = *p++;
                }
                *s = '\0';
                p = sws (p);

                /* jeff: This builds the actual regexp string that will
                be used.  "(<Header>)?.*<regexp>.*" doesn't make much
                sense to me, and I plan to make it
                "(^<Header>.*)?<regexp>"                 */

                if (strncasecmp(score_regex, "Header", 6) == 0)
                        *score_regex = '\0';
                else if (strncasecmp(score_regex, "Body", 4) == 0) {
                        fprintf (stderr, "Body scoring not supported\n");
                        free (score_entry);
                        return 0;
                }
                strcat(score_regex, ".*");
                strcat(score_regex, p);
                strcat(score_regex, ".*");

                if ( regcomp(&score_entry->rexp,
                     score_regex,
                     REG_ICASE | REG_NOSUB) ) {
                        fprintf(stderr, "bad regular expression\n");
                                free(score_entry);
                                return 0;
                } else {
                        *score_link = score_entry;
                         score_link = &score_entry->next;
                }
        }

        fclose (score_fd);
        return 1;

Quote:
}

int free_group_score(void)
{
        struct scorest *sp, *np;

        for (sp = group_kill_list ; sp ; sp = np)
        {
                np = sp->next;
                regfree(&sp->rexp);
                free(sp);
        }
        group_kill_list = NULL;

        return (0);

Quote:
}

/* jeff: Apply score rules to the current article... */
int Kill (FILE *art_fd)
{
        struct scorest *sp;
        int score = 0;

        /* scan through header lines, looking for regex matches */

        /* jeff: So we read the article file in, one line at a time,
        applying each score rule in turn to that line.  Might it be
        faster to read a number of lines into buf and then tell
        the regex functions to recognize \n as a line separator? */

        while (Fgets (buf, BUF_LEN, art_fd) != NULL) {
                if (strlen(buf) == 0) break; /* jeff: process headers only */
                for (sp = kill_list; sp ; sp = sp->next) {
                        if (regexec(&(sp->rexp), buf, 0, NULL, 0) == 0) {
                                score += sp->score;
                        }
                }
                for (sp = group_kill_list; sp ; sp = sp->next) {
                        if (regexec(&(sp->rexp), buf, 0, NULL, 0) == 0) {
                                score += sp->score;
                        }
                }
        }
        return score;

Quote:
}

#endif  /* SCORES */


Sat, 19 Jun 2004 06:01:18 GMT  
 Performance while regexp pattern matching

Quote:

>First, my thanks to all those who helped with my qsort() questions.
>I'm certainly keeping an archive of the thread.  On to the next
>question... :)

<snippity>

Quote:
>I'm revising UQWK, a utility that processes email and usenet news
>articles into a download packet suitable for use with several offline
>newsreaders (and also processes the reply packets generated by these
>same newsreaders).  UQWK includes scoring capability, where each news
>article is evaluated according to rules contained in one or two score
>files.  If the final score for a given article falls below a specified
>value, that article is not included in the download packet.  UQWK uses
>regexp pattern matching for this task.  I've built UQWK with GNU regex
>0.12, and GNU rx 1.5, and in both cases scoring is painfully slow.  I'm
>hoping that someone can suggest some improvements.

<really big snip>

First of all, I should point out that comp.lang.c discusses Standard C.
The program you posted uses a number of nonstandard extensions to the
language, not the least of which is POSIX regexps. That's not too big
a deal in this case, because the code itself *probably* isn't the problem.

Quote:
>    /* jeff: So we read the article file in, one line at a time,
>    applying each score rule in turn to that line.  Might it be
>    faster to read a number of lines into buf and then tell
>    the regex functions to recognize \n as a line separator? */

You're going to do m*n regexp comparisons, where m is the number of
lines and n is the number of score rules. That's going to get expensive,
and there isn't any way around it.

You didn't give any numbers for how slow scoring is, and "painfully slow"
is subjective. I'd suggest comparing UQWK against several newsreaders
which support similar scoring features. If you find UQWK to be much
slower, you might experiment with different regexp packages.

For what it's worth, on my machine slrn takes several minutes to score
the 500-1000 articles my newsserver normally has for comp.lang.c.

<remainder of code snipped>

--
Edgar Allan Poe?  Yeah, sure I love Poe.  He gives BOFHs some great
ideas on how to scare people shitless.
    Suresh Ramasubramanian in the scary devil monastary



Sat, 19 Jun 2004 17:00:19 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. Regular Expressions/Pattern Matching/Unordered pattern

2. RegExp - regular expression pattern

3. Pattern Matching in C -

4. Pattern Matching Tool

5. Pattern Matching in C

6. Help, Pattern used in Regex.Matches(...)

7. pattern matching and string replacement

8. help with a fast pattern matching utility requested

9. Pattern Matching in execution time

10. pattern matching with the function glob.

11. pattern matching in c

12. HTML scraping (WSDL HTML pattern matching)

 

 
Powered by phpBB® Forum Software