
complex regular expression question
Quote:
> I'm playing with something that compares two long documents together.
> It basically looks for new words that have been added, and new
> instances of words. Unfortunately the code is a bit unwieldy, and not
> that robust. I'm thinking of moving it all to using regular
> expressions. Does anyone have any ideas of the sorts of match
> patterns and logic that might work for this?
I'm not sure what your exact definition of a document is but if it is (or
can be represented as) plain text...
The simple example below uses two locally defined string variables. These
each represent the total text of two separate text document sources. For
example, they could be the contents of two text files read using FSO
methods.
A RegExp is used to parse the text into words. Two dictionary objects are
then used to keep a word count of the unique word occurences in each text
source. The dictionaries use text rather than binary compare mode, so words
tracked are case insensitive.
It then uses the two dictionaries to determine the words that are common to
each text source and also the words that are unique to each.
strThis = "one two, three. four"
strThat = "seven one, nine. four"
Set sdThisWords = CreateObject("Scripting.Dictionary")
sdThisWords.comparemode = vbTextCompare
Set sdThatWords = CreateObject("Scripting.Dictionary")
sdThatWords.comparemode = vbTextCompare
set reWordParse = new regexp
reWordParse.pattern = "\w+"
reWordParse.global = true
for each match in reWordParse.execute(strThis)
'counts occurences of unique words (case insensitive)...
word = cstr(match.value)
sdThisWords(word) = sdThisWords(word) + 1
'just tracks occurences of unique words
'sdThisWords(match.value) = true 'could be any value...
next
for each match in reWordParse.execute(strThat)
'counts occurences of unique words (case insensitive)...
word = cstr(match.value)
sdThatWords(word) = sdThatWords(word) + 1
'just tracks occurences of unique words
'sdThisWords(match) = true 'could be any value...
next
wscript.echo "strThis=",strThis
wscript.echo "strThat=",strThat
wscript.echo
wscript.echo "words in strThis that are also in strThat"
for each word in sdThisWords.keys
if sdThatWords.Exists(word) then
wscript.echo word
end if
next
wscript.echo
wscript.echo "words in strThis that are NOT in strThat"
for each word in sdThisWords.keys
if Not sdThatWords.Exists(word) then
wscript.echo word
end if
next
wscript.echo
wscript.echo "words in strThat that are NOT in strThis"
for each word in sdThatWords.keys
if Not sdThisWords.Exists(word) then
wscript.echo word
end if
next
--
Michael Harris
Microsoft.MVP.Scripting
Seattle WA US