parsing words 
Author Message
 parsing words



Quote:
> I have created a guest book using perl. I would like to know of a way of
> detecting bad words. I currently have the program search for bad words
> but some clever users started using spaces, symbols to prevent the
> program from catching the bad words. For example assume DUCK is a bad
> word. Normally, the guest book will replace it with blank spaces but if
> the user types in
> D U C K or D*U*C*K or D-U-C-K or D__U__C__K, the program will not catch
> it. Is there anyway I can prevent this from happening.

Sam Holden has given you an extended philosophical answer about why this
approach probably won't work.  But if you want to pursue it, a regex
like this:

    /\bD\W*U\W*C\W*\K\b/i

would match those occurrences and many others, regardless of the cases
of the letters.  It is a nice little exercise to generate a list of
regexes like that, or a compound regex, from a list of words
automatically.  (map, split, join, eval)

I presume that tr/D/F/ would give a more realistic example.  :-)

--
(Just Another Larry) Rosler
Hewlett-Packard Laboratories
http://www.*-*-*.com/



Wed, 18 Jun 1902 08:00:00 GMT  
 parsing words

Quote:
> I have created a guest book using perl. I would like to know of a way
> of detecting bad words. I currently have the program search for bad
> words but some clever users started using spaces, symbols to prevent
> the program from catching the bad words. For example assume DUCK is a
> bad word. Normally, the guest book will replace it with blank spaces
> but if the user types in D U C K or D*U*C*K or D-U-C-K or D__U__C__K,
> the program will not catch it. Is there anyway I can prevent this from
> happening.

You need a program that's at least as smart as the people who are trying
to outsmart it. That's not a simple matter.

One system I heard of did a good job of suppressing bad words until the
dog breeders needed to talk about their {*filter*}es. Another wouldn't let
anyone discuss the region of Scunthorpe in Great Britain. Overkill.

I heard that someone made a closed caption decoder which tried to
substitute nicer words for the bad ones. It sanitized "Dick Van{*filter*}" into
"Jerk Van {*filter*}".

Sure, you could make a program which would strip the punctuation and scan
the rest, but the vandals will still find a way to sneak past with a few
     _ _      _                                 _
  __| (_)_ __| |_ _   _  __      _____  _ __ __| |___
 / _` | | '__| __| | | | \ \ /\ / / _ \| '__/ _` / __|
| (_| | | |  | |_| |_| |  \ V  V / (_) | | | (_| \__ \
 \__,_|_|_|   \__|\__, |   \_/\_/ \___/|_|  \__,_|___/
                  |___/

I suggest that you simply save each day's additions to a file for a human
being to check before appending. Cheers!

--
Tom Phoenix       Perl Training and Hacking       Esperanto
Randal Schwartz Case:     http://www.*-*-*.com/



Wed, 18 Jun 1902 08:00:00 GMT  
 parsing words

hi,

I have created a guest book using perl. I would like to know of a way of
detecting bad words. I currently have the program search for bad words
but some clever users started using spaces, symbols to prevent the
program from catching the bad words. For example assume DUCK is a bad
word. Normally, the guest book will replace it with blank spaces but if
the user types in
D U C K or D*U*C*K or D-U-C-K or D__U__C__K, the program will not catch
it. Is there anyway I can prevent this from happening.

Thanks a lot.

-rahul

Sent via Deja.com http://www.deja.com/
Before you buy.



Wed, 18 Jun 1902 08:00:00 GMT  
 parsing words

Quote:

>hi,

>I have created a guest book using perl. I would like to know of a way of
>detecting bad words. I currently have the program search for bad words
>but some clever users started using spaces, symbols to prevent the
>program from catching the bad words. For example assume DUCK is a bad
>word. Normally, the guest book will replace it with blank spaces but if
>the user types in
>D U C K or D*U*C*K or D-U-C-K or D__U__C__K, the program will not catch
>it. Is there anyway I can prevent this from happening.

You could disallow all text, so that all posts would be blank.

That's about all you can do...

People tend to be better at recognising words masked like this than computers.

Consider :

DUK DLICK DU(K DVCK KCUD.

Adding a human to the loop is probably the only way. Even if all that person
does is add new regexes to match bad words based on the ones which are
slipping through.

There is also the problem of false positives.

Foreign words (or English words) which contain 'duck' in them somewhere.
Place names, People's names, etc.

If you much censor things, then bulk deletion of messages with 'bad' words
will probably convince people not to bother.

Such a policy would make it hard to talk about wonderful things like
South Park though.

--
Sam

About the only thing most people know about black holes is they are
black, and now we have stuffed that up
        -- Dr Paul Francis (after reporting finding 'pink' holes)



Wed, 18 Jun 1902 08:00:00 GMT  
 parsing words
:or example assume DUCK is a bad word. Normally, the guest book will
:replace it with blank spaces but if the user types in D U C K or D*U*C*K
:or D-U-C-K or D__U__C__K, the program will not catch it. Is there anyway
:I can prevent this from happening.

It has been said before that true filtering is next to impossible.
However, one implementation of a regex might be the following.  Using
negated character classes for what _is_ allowed to be between characters.
This, of course, is far from full proof.

#!/usr/local/bin/perl
use strict;
$|++;

while ( <DATA> ) {
    s/D[^a-z]*U[^a-z]*C[^a-z]*K/ /ig;
    print;

Quote:
}

__DATA__
hi,

I have createdD a guest book using perl. I would like to know of a way of
detecting bad words. I currently have the program search for bad words
but some clever users started Uusing spaces, symbols to prevent the
program from Ccatching the bad words. For example assume DUCK is a bad
word. Normally, the guest book will replace it with blank spaces but if
the user types in
D U C K or D*U*C*K or D-U-C-K or D__U__C__K, the program will not catch
it. Is there anyway I can prevent this from happening.

Thanks a lot.K

This is a DRUNCK too.

Hello, this DUC'K.

--
   Casey R. Tweten         Y2K solution
    Web Developer             (free)
HighVision Associates        s/y/k/i;



Wed, 18 Jun 1902 08:00:00 GMT  
 
 [ 5 post ] 

 Relevant Pages 

1. Parsing Word to ASCII

2. Parsing Word Docs

3. How to parse words out of a string into an array?

4. words words words

5. Parsing Lines for Words?

6. Parse a word into three strings

7. parsing a template and replacing certain words (from a form)

8. Parsing line of text into words

9. How to parse MS word documents?

10. Sort hash on 2nd word in muilti word key

11. Getting words between two other words within an array

12. creating MS word documents without installing MS word

 

 
Powered by phpBB® Forum Software