Author |
Message |
Larry Rosl #1 / 5
|
 parsing words
Quote: > I have created a guest book using perl. I would like to know of a way of > detecting bad words. I currently have the program search for bad words > but some clever users started using spaces, symbols to prevent the > program from catching the bad words. For example assume DUCK is a bad > word. Normally, the guest book will replace it with blank spaces but if > the user types in > D U C K or D*U*C*K or D-U-C-K or D__U__C__K, the program will not catch > it. Is there anyway I can prevent this from happening.
Sam Holden has given you an extended philosophical answer about why this approach probably won't work. But if you want to pursue it, a regex like this: /\bD\W*U\W*C\W*\K\b/i would match those occurrences and many others, regardless of the cases of the letters. It is a nice little exercise to generate a list of regexes like that, or a compound regex, from a list of words automatically. (map, split, join, eval) I presume that tr/D/F/ would give a more realistic example. :-) -- (Just Another Larry) Rosler Hewlett-Packard Laboratories http://www.*-*-*.com/
|
Wed, 18 Jun 1902 08:00:00 GMT |
|
 |
Tom Phoeni #2 / 5
|
 parsing words
Quote: > I have created a guest book using perl. I would like to know of a way > of detecting bad words. I currently have the program search for bad > words but some clever users started using spaces, symbols to prevent > the program from catching the bad words. For example assume DUCK is a > bad word. Normally, the guest book will replace it with blank spaces > but if the user types in D U C K or D*U*C*K or D-U-C-K or D__U__C__K, > the program will not catch it. Is there anyway I can prevent this from > happening.
You need a program that's at least as smart as the people who are trying to outsmart it. That's not a simple matter. One system I heard of did a good job of suppressing bad words until the dog breeders needed to talk about their {*filter*}es. Another wouldn't let anyone discuss the region of Scunthorpe in Great Britain. Overkill. I heard that someone made a closed caption decoder which tried to substitute nicer words for the bad ones. It sanitized "Dick Van{*filter*}" into "Jerk Van {*filter*}". Sure, you could make a program which would strip the punctuation and scan the rest, but the vandals will still find a way to sneak past with a few _ _ _ _ __| (_)_ __| |_ _ _ __ _____ _ __ __| |___ / _` | | '__| __| | | | \ \ /\ / / _ \| '__/ _` / __| | (_| | | | | |_| |_| | \ V V / (_) | | | (_| \__ \ \__,_|_|_| \__|\__, | \_/\_/ \___/|_| \__,_|___/ |___/ I suggest that you simply save each day's additions to a file for a human being to check before appending. Cheers! -- Tom Phoenix Perl Training and Hacking Esperanto Randal Schwartz Case: http://www.*-*-*.com/
|
Wed, 18 Jun 1902 08:00:00 GMT |
|
 |
r.. #3 / 5
|
 parsing words
hi, I have created a guest book using perl. I would like to know of a way of detecting bad words. I currently have the program search for bad words but some clever users started using spaces, symbols to prevent the program from catching the bad words. For example assume DUCK is a bad word. Normally, the guest book will replace it with blank spaces but if the user types in D U C K or D*U*C*K or D-U-C-K or D__U__C__K, the program will not catch it. Is there anyway I can prevent this from happening. Thanks a lot. -rahul Sent via Deja.com http://www.deja.com/ Before you buy.
|
Wed, 18 Jun 1902 08:00:00 GMT |
|
 |
Sam Hold #4 / 5
|
 parsing words
Quote:
>hi, >I have created a guest book using perl. I would like to know of a way of >detecting bad words. I currently have the program search for bad words >but some clever users started using spaces, symbols to prevent the >program from catching the bad words. For example assume DUCK is a bad >word. Normally, the guest book will replace it with blank spaces but if >the user types in >D U C K or D*U*C*K or D-U-C-K or D__U__C__K, the program will not catch >it. Is there anyway I can prevent this from happening.
You could disallow all text, so that all posts would be blank. That's about all you can do... People tend to be better at recognising words masked like this than computers. Consider : DUK DLICK DU(K DVCK KCUD. Adding a human to the loop is probably the only way. Even if all that person does is add new regexes to match bad words based on the ones which are slipping through. There is also the problem of false positives. Foreign words (or English words) which contain 'duck' in them somewhere. Place names, People's names, etc. If you much censor things, then bulk deletion of messages with 'bad' words will probably convince people not to bother. Such a policy would make it hard to talk about wonderful things like South Park though. -- Sam About the only thing most people know about black holes is they are black, and now we have stuffed that up -- Dr Paul Francis (after reporting finding 'pink' holes)
|
Wed, 18 Jun 1902 08:00:00 GMT |
|
 |
Casey R. Twete #5 / 5
|
 parsing words
:or example assume DUCK is a bad word. Normally, the guest book will :replace it with blank spaces but if the user types in D U C K or D*U*C*K :or D-U-C-K or D__U__C__K, the program will not catch it. Is there anyway :I can prevent this from happening. It has been said before that true filtering is next to impossible. However, one implementation of a regex might be the following. Using negated character classes for what _is_ allowed to be between characters. This, of course, is far from full proof. #!/usr/local/bin/perl use strict; $|++; while ( <DATA> ) { s/D[^a-z]*U[^a-z]*C[^a-z]*K/ /ig; print; Quote: }
__DATA__ hi, I have createdD a guest book using perl. I would like to know of a way of detecting bad words. I currently have the program search for bad words but some clever users started Uusing spaces, symbols to prevent the program from Ccatching the bad words. For example assume DUCK is a bad word. Normally, the guest book will replace it with blank spaces but if the user types in D U C K or D*U*C*K or D-U-C-K or D__U__C__K, the program will not catch it. Is there anyway I can prevent this from happening. Thanks a lot.K This is a DRUNCK too. Hello, this DUC'K. -- Casey R. Tweten Y2K solution Web Developer (free) HighVision Associates s/y/k/i;
|
Wed, 18 Jun 1902 08:00:00 GMT |
|
|
|