Code Parsing Redux - stateless regex-based subtractive tokenizing 
Author Message
 Code Parsing Redux - stateless regex-based subtractive tokenizing

Another twist on tokenizing VBScript (actually useful for almost any text)...

The technique I use below is somewhat different. We read in an entire block of
text, then begin using a regex anchored to the start of the text.  If one type
of regex fails, then we go to the next.

The regexes here have one flaw with string literal parsing. In general, for a
well-ordered and complete enough set of regular expressions, this might be
workable for tokenizing VBScript - it already does it OK.

sData = ReadFile(WScript.ScriptFullName)

Dim rx
set rx = new RegExp
rx.Multiline = False
rx.Global = True
rx.IgnoreCase = True

' next is our set of regex patterns
' array: pattern, symbolic name, whether to print literally
StringLiteral = Array("(" & Chr(34) & "[\x23-\x7f]*" & Chr(34) _
 & ")+", "<StringLiteral>", True)
NewLine = Array("\r*(\r|\n)\r*", "<NewLine>", False)
Whitespace = Array("[ \f\t\v]+(_[ \f\t\v]*(\r|\n|\r\n)){0,1}", _
 "<Linear whitespace>", False)
' this overlooks Rem, but REM is evil anyway...
Comment = Array("'.*(\r*(\r|\n)\r*)+","<Comment>", True)
IntegerLiteral = Array("[0-9]+", "<IntegerLiteral>", True)
Identifier = Array("[a-zA-Z]+[a-zA-Z.0-9_]*", "<Identifier>", True)
Other = Array("\S+", "<Other>", True)
ErrorCount = 0
PassCount = 0
Do While Len(sData) > 10
 PassCount = PassCount + 1
 Select Case true
  ' regex breakdown of data
  Case Found(NewLine)
   Process NewLine
  Case Found(Comment)
   Process Comment
  Case Found(Whitespace)
   Process Whitespace
  Case Found(StringLiteral)
   Process StringLiteral
  Case Found(IntegerLiteral)
   Process IntegerLiteral
  Case Found(Identifier)
   Process Identifier
  Case Found(Other)
   Process Other
  Case Else
   WScript.Echo "It's not whitespace, it's not NOT whitespace - ?"
   ErrorCount = ErrorCount + 1
   If ErrorCount > 10 then
    WScript.Echo "Bailing Will Robinson!"
    WScript.Echo Left(sData, 10)
    WScript.Quit
   End If
 End Select
Loop

Sub Process(regex)
 rx.Pattern = "^" & regex(0)
 Set Matches = rx.Execute(sData)   ' Execute search.
 For Each Match in Matches   ' Iterate Matches collection.
  If regex(2) Then
   WScript.Echo "pass", PassCount & ":", regex(1), _
    Match.Value
  else
   WScript.Echo "pass", PassCount & ":", regex(1), _
    Escape(Match.Value)
  end if
 Next
 sData = rx.Replace(sData, "")
End Sub

Function Found(regex)
 rx.Pattern = "^" & regex(0)
 'WScript.echo "pattern = " & regex(0)
 Found = rx.Test(sData)
End Function

Function ReadFile(FilePath)
 'Given the path to a file, will return entire contents
 ' works with either ANSI or Unicode
 Dim FSO, CurrentFile
 Const ForReading = 1, TristateUseDefault = -2, _
 DoNotCreateFile = False
 Set FSO = createobject("Scripting.FileSystemObject")
 If FSO.FileExists(FilePath) Then
  If FSO.GetFile(FilePath).Size>0 Then
   Set CurrentFile = FSO.OpenTextFile(FilePath, ForReading, _
   False, TristateUseDefault)
   If CurrentFile.AtEndOfStream <> True Then
    ReadFile = CurrentFile.ReadAll: CurrentFile.Close
   End If
  End If
 End If
End Function

--
Please respond in the newsgroup so everyone may benefit.
  http://www.*-*-*.com/
(email requests for support contract information welcomed)
 ----------
 Subscribe to Microsoft's Security Bulletins:
  http://www.*-*-*.com/



Tue, 16 Aug 2005 03:51:16 GMT  
 Code Parsing Redux - stateless regex-based subtractive tokenizing
Nothing like posting it to suddenly show me what I did wrong. ;)

Here's an updated version.

s = "this is a string with spaces"
sData = ReadFile(WScript.ScriptFullName)

Dim rx
set rx = new RegExp
rx.Multiline = False
rx.Global = True
rx.IgnoreCase = True

' next is our set of regex patterns
' array: pattern, symbolic name, whether to print literally
StringLiteral = Array("(" & Chr(34) & "[ \x09\x21\x23-\x7f]*" & Chr(34) _
 & ")+", "<StringLiteral>", True)
NewLine = Array("\r*(\r|\n)\r*", "<NewLine>", False)
Whitespace = Array("[ \f\t\v]+(_[ \f\t\v]*(\r|\n|\r\n)){0,1}", _
 "<Linear whitespace>", False)
' this overlooks Rem, but REM is evil anyway...
Comment = Array("'.*(\r*(\r|\n)\r*)+","<Comment>", True)
IntegerLiteral = Array("[0-9]+", "<IntegerLiteral>", True)
Identifier = Array("[a-zA-Z]+[a-zA-Z.0-9_]*", "<Identifier>", True)
Relation = Array("[=<>]+", "<Relation>", True)
Separator = Array("[)(.,:]", "<Separator>", True)
Operator = Array("[-\\\/*^+&]", "<Operator>", True)
Other = Array("\S+", "<Other>", True)
ErrorCount = 0
PassCount = 0

Do While Len(sData) > 0
 PassCount = PassCount + 1
 Select Case true
  ' regex breakdown of data
  Case Found(NewLine)
   Process NewLine
  Case Found(Comment)
   Process Comment
  Case Found(Whitespace)
   Process Whitespace
  Case Found(StringLiteral)
   Process StringLiteral
  Case Found(IntegerLiteral)
   Process IntegerLiteral
  Case Found(Identifier)
   Process Identifier
  Case Found(Relation)
   Process Relation
  Case Found(Separator)
   Process Separator
  Case Found(Operator)
   Process Operator
  Case Found(Other)
   Process Other
  Case Else
   WScript.Echo "It's not whitespace, it's not NOT whitespace - ?"
   ErrorCount = ErrorCount + 1
   If ErrorCount > 10 then
    WScript.Echo "Bailing Will Robinson!"
    WScript.Echo Left(sData, 10)
    WScript.Quit
   End If
 End Select
Loop

Sub Process(regex)
 rx.Pattern = "^" & regex(0)
 Set Matches = rx.Execute(sData)   ' Execute search.
 For Each Match in Matches   ' Iterate Matches collection.
  If regex(2) Then
   WScript.Echo "pass", PassCount & ":", regex(1), _
    Match.Value
  else
   WScript.Echo "pass", PassCount & ":", regex(1), _
    Escape(Match.Value)
  end if
 Next
 sData = rx.Replace(sData, "")
End Sub

Function Found(regex)
 rx.Pattern = "^" & regex(0)
 'WScript.echo "pattern = " & regex(0)
 Found = rx.Test(sData)
End Function

Function ReadFile(FilePath)
 'Given the path to a file, will return entire contents
 ' works with either ANSI or Unicode
 Dim FSO, CurrentFile
 Const ForReading = 1, TristateUseDefault = -2, _
 DoNotCreateFile = False
 Set FSO = createobject("Scripting.FileSystemObject")
 If FSO.FileExists(FilePath) Then
  If FSO.GetFile(FilePath).Size>0 Then
   Set CurrentFile = FSO.OpenTextFile(FilePath, ForReading, _
   False, TristateUseDefault)
   If CurrentFile.AtEndOfStream <> True Then
    ReadFile = CurrentFile.ReadAll: CurrentFile.Close
   End If
  End If
 End If
End Function

' just some math so I can see how it parses at the end...
f = 9 + 3
e = f*6^3
a = b Mod 2
a = f \ e



Tue, 16 Aug 2005 04:13:10 GMT  
 Code Parsing Redux - stateless regex-based subtractive tokenizing


Quote:
> Nothing like posting it to suddenly show me what I did wrong. ;)

> Here's an updated version.

I don't understand it all yet but I wanted to comment on how clever the
"Select case true" construct is!  Almost like a try/catch.


Wed, 17 Aug 2005 03:51:08 GMT  
 Code Parsing Redux - stateless regex-based subtractive tokenizing

Quote:


> > Nothing like posting it to suddenly show me what I did wrong. ;)

> > Here's an updated version.

> I don't understand it all yet but I wanted to comment on how clever the
> "Select case true" construct is!  Almost like a try/catch.

I probably picked that up from either Torgeir or Michael, of course...

The rest of it isn't that hard.  Actually the title is a lie - this is actually
"lexing" (tokenizing the code) and not really parsing.

--
Please respond in the newsgroup so everyone may benefit.
 http://dev.remotenetworktechnology.com
(email requests for support contract information welcomed)
 ----------
 Subscribe to Microsoft's Security Bulletins:
 http://www.microsoft.com/technet/security/bulletin/notify.asp



Wed, 17 Aug 2005 05:39:19 GMT  
 Code Parsing Redux - stateless regex-based subtractive tokenizing
Hi, Alex
I like it.  Still can't figure out those regular expressions.

One question, though:
Did you intend that the newline at the end of comment lines be part of the comment?

-Paul Randall



Wed, 17 Aug 2005 10:44:14 GMT  
 Code Parsing Redux - stateless regex-based subtractive tokenizing


Quote:
> Hi, Alex
> I like it.  Still can't figure out those regular expressions.

Just use them without understanding them  for a while, then get really
frustrated and stare at them when they don't work.  At least that's how I
finally figured them out. ;)

Quote:
> Did you intend that the newline at the end of comment lines be part of the

comment?

I hadn't even noticed - thanks for pointing that out!  I _should_ have seen it,
since it causes irregular blank lines in the output.

Your question got me looking at this again.  For some reason I can't get
non-capturing matches to work in VBScript's RegExp, which makes producing really
regular output with separate terminators difficult to do.

That's definitely a detail that needs to be considered in parsing the tokenized
output.

I also found a weakness in the newline definition I was using.  I was making the
NewLine symbol correpspond to Unicode.org's definition of a NewLine.  Pual Vick
of Microsoft pointed out that to VBScript/VB.NET a  _logical_ newline, though,
ignoring continuations, is actually of this form:

    ((\r\n)|\r|\n)

Here's an edited form of the NewLine, Whitespace, and Comment definitions:

NewLine = Array("((\r\n)|\r|\n)", "<NewLine>", False)
Whitespace = Array( _
 "[ \f\t\v]+(_[ \f\t\v]*((\r\n)|\r|\n)[ \f\t\v]*){0,1}", _
 "<Linear whitespace>", False)
' this overlooks Rem, but REM is evil anyway...
Comment = Array("'.*((\r\n)|\r|\n){0,1}","<Comment>", True)

--
Please respond in the newsgroup so everyone may benefit.
 http://dev.remotenetworktechnology.com
(email requests for support contract information welcomed)
 ----------
 Subscribe to Microsoft's Security Bulletins:
 http://www.microsoft.com/technet/security/bulletin/notify.asp



Wed, 17 Aug 2005 19:31:26 GMT  
 Code Parsing Redux - stateless regex-based subtractive tokenizing

Quote:
> Another twist on tokenizing VBScript (actually useful for almost any text)...

> The technique I use below is somewhat different. We read in an entire block of
> text, then begin using a regex anchored to the start of the text.  If one type
> of regex fails, then we go to the next.

This trick is kind of OK, and certainly has the advantage of "easy".
It has a few problems:
1) If you have hundreds of tokens (e.g., hundreds of regular expressions),
   it can be pretty slow.  Might not matter much for processing
   small scripts.  This is why people using LEX-like tools.
2) If one regular expression matches something much longer
   than another one that also matches, how do you decide which one
   is right?  For instance,
   For keyword matches "FOR"
   Identifier matches  "FOR17"
   You need some rule to choose.
   The problem with REM is one of these; it it an identifier? a keyword?
   a comment?
   Typical trick is to order the RE matches so the "winner" comes first.
3) Some RE partially matches a prefix which is much longer than
   another one that does.  How much do you read ahead, to make sure
   you have enough to match?  VBScript tends to break tokens at line
   boundaries, so I suppose for VBScript you can read just one line
   and succeed.  Languages that have  string literals containing line
   breaks aren't so nice for this.
4) One soon finds out that VBScript programs can be pure, or mixed
   into ASP/HTML scripts.   Since recogning HTML isn't the same as
   recognizing VBScript, you can't do it with one set of regular expressions.
   ( <HTML/>  is NOT the sequence of tokens "<", IDENTIFIER, "/", ">"!).
   You need multiple lexing engines.

And, as you point out in a later message, this is LEXing, not parsing.
Parsing requires another entire set of machinery.

I think something else is being missed here, though.  The purpose
of all this lexing/parsing/ ... is ... something.

Typically, one wants to analyze programs for dumb stuff,
and/or to modify programs containing dumb stuff so
they aren't so dumb.

To carry through on analysis/modification, one needs machinery
way beyond lexing and parsing.   You need ways to write down
the analysis you wish to do, to write down the changes you
wish to make (usually conditional on some analysis results),
and a way to regenerate the program with changes in it.

Most folks that want to do lex/parse  usually think that if
they just solve those two problems, the rest is easy,
and so they bash their heads repeatedly on trying to solve
the lex/parse problem.   But what they will find is that
success in lex/parse will lead them to bash their heads
on the *next* set of problems!  I'm not trying to actively
discourage anyone, but you should be aware of how much
machinery you *really* need.

So I'm going to insert here an egregious plug for our
tools, the DMS Software Reengineering Toolkit, which
is generalized compiler technology covering all phases
of software analysis and modification:
lexing/parsing/analyzing/translating/optimizing/formatting
See http://www.semdesigns.com/Products/DMS/DMSToolkit.html.

And yes, DMS can parse/prettyprint VBScript at this point.
Its other machinery could be used to build custom
analysis/modifications to VBScript programs.



Thu, 18 Aug 2005 02:39:01 GMT  
 Code Parsing Redux - stateless regex-based subtractive tokenizing


Quote:
> Alex,
> I have made a minor addition in your code, for the evil REM.
> You can check the attached text.
> Tsakalidis G. Evangelos

That was the one piece I had no desire to deal with myself - cool!

I'm reworking this as a class object now.

--
Please respond in the newsgroup so everyone may benefit.
 http://dev.remotenetworktechnology.com
(email requests for support contract information welcomed)
 ----------
 Subscribe to Microsoft's Security Bulletins:
 http://www.microsoft.com/technet/security/bulletin/notify.asp



Thu, 18 Aug 2005 22:26:44 GMT  
 Code Parsing Redux - stateless regex-based subtractive tokenizing
Wow - especially your post in the 'Stripping VBScript prior to tokenizing'
thread... <g>

Some VBScript-specific comments below.



Quote:
> > ... using a regex anchored to the start of the text.  If one type
> > of regex fails, then we go to the next.
> This trick is kind of OK, and certainly has the advantage of "easy".
> 1) If you have hundreds of tokens (e.g., hundreds of regular expressions),
>    it can be pretty slow.  Might not matter much for processing
>    small scripts.  This is why people using LEX-like tools.

Oddly enough, it may be faster than a character-processor in VBScript since the
regex engine is a binary.  Obviously best-suited to tiny things with simple
grammar...

Quote:
> 2) If one regular expression matches something much longer
>    than another one that also matches, how do you decide which one
>    is right?
>    Typical trick is to order the RE matches so the "winner" comes first.

I didn't cover that, even though I ordered the regexes.  To me this is the
fundamental weakness in this approach - as "types" of interest increase, the
potential boundaries between them become more complex.

Quote:
> 4) One soon finds out that VBScript programs can be pure, or mixed
>    into ASP/HTML scripts.   Since recogning HTML isn't the same as
>    recognizing VBScript, you can't do it with one set of regular expressions.
>    ( <HTML/>  is NOT the sequence of tokens "<", IDENTIFIER, "/", ">"!).
>    You need multiple lexing engines.

For doing something like this in VBScript, the usual trick is to _not_ do it in
VBScript.  COM is great for this since it helps compartmentalize issues - it's a
matter of a few lines of script to extract the script text using the HTML DOM.
(example at bottom).

This also illustrates the limits in VB family languages and certain Windows
tools areas - more below where it is more relevant.

Quote:
> And, as you point out in a later message, this is LEXing, not parsing.
> Parsing requires another entire set of machinery.

As I'm coming to understand this, thanks to Ira and to Peter Jinks, a lecturer
at the University of Manchester, this is another issue which was covered in
depth in the '70's or so - multiple stages to first "lex" characters into
tokens, then do "other things".

The "other things" _originally_ were targeted at compiling code, and were the
origin of lex and yacc.  One way this is done conceptually is like this, if I
have it right:

characters --> tokens --> statements --> structures --> program

Quote:
> I think something else is being missed here, though.  The purpose
> of all this lexing/parsing/ ... is ... something.

You'll be the first to know when I find out.<g>

Quote:
> Typically, one wants to analyze programs for dumb stuff,
> and/or to modify programs containing dumb stuff so
> they aren't so dumb.

> To carry through on analysis/modification, one needs machinery
> way beyond lexing and parsing.   You need ways to write down
> the analysis you wish to do, to write down the changes you
> wish to make (usually conditional on some analysis results),
> and a way to regenerate the program with changes in it.

> Most folks that want to do lex/parse  usually think that if
> they just solve those two problems, the rest is easy,
> and so they bash their heads repeatedly on trying to solve
> the lex/parse problem.   But what they will find is that
> success in lex/parse will lead them to bash their heads
> on the *next* set of problems!  I'm not trying to actively
> discourage anyone, but you should be aware of how much
> machinery you *really* need.

This is where the real limitation of code parsing in VBScript becomes apparent.
One of the key merits of tools like VBScript is the ability to do prototypical
analysis quickly.  The fundamental tools for this are simply not easily
leveraged by VBScript - and worst of all, analytic ability rests on a known
grammar for VBScript, something which does not exist.  ( I don't mean it isn't
public, I mean that the language as-written by Microsoft does not have an
accurate grammar behind it; it is apparently a hand-coded tool.  The
prototypical grammar for VB languages Microsoft has made available is only an
approximation).

Where an approach like this does have inherent value is in parsing arbitrary
structured files which are not in a tagged form like XML.

Where it will run into problems is advanced analysis.  The performance of
VBScript drops off rapidly as a function of size and complexity of parsed data.
Although it is possible to port this to VB/VB.NET, the worst problem is that the
analytic tools for this are simply NOT easily accessible to people working in a
COM/.NET world.  There are lex/yacc/bison on *nix or the Cygwin tools if you can
handle C; there is ANTLR if you know Java.  Theer are of course mini-tools which
are pseudo-ported to C#, etc., but parsing is a huge, complex area which is
explored in depth by very few Windows programmers, and in which the vast
majority of the tools are primarily *nix-based to this day.

Quote:
> So I'm going to insert here an egregious plug for our
> tools, the DMS Software Reengineering Toolkit, which
> is generalized compiler technology covering all phases
> of software analysis and modification:
> lexing/parsing/analyzing/translating/optimizing/formatting
> See http://www.semdesigns.com/Products/DMS/DMSToolkit.html.

figured you would...;)

--
Please respond in the newsgroup so everyone may benefit.
 http://dev.remotenetworktechnology.com
(email requests for support contract information welcomed)
 ----------
 Subscribe to Microsoft's Security Bulletins:
 http://www.microsoft.com/technet/security/bulletin/notify.asp



Fri, 19 Aug 2005 00:38:15 GMT  
 Code Parsing Redux - stateless regex-based subtractive tokenizing

Quote:




>>> Alex,
>>> I have made a minor addition in your code, for the evil REM.
>>> You can check the attached text.
>>> Tsakalidis G. Evangelos

>> I'm reworking this as a class object now.

> Done!

> Another triumph of Hellenic science. ;)

No kidding - I'd hate to see you two as co-authors on a paper or
presentation - especially if I had to pronounce it correctly. =) Then
again the head of my department is "Stamatios Krimigis", so maybe I'd
better start practicing...


Fri, 19 Aug 2005 03:18:34 GMT  
 Code Parsing Redux - stateless regex-based subtractive tokenizing

Quote:





> >>> Alex,
> >>> Tsakalidis G. Evangelos
> > Another triumph of Hellenic science. ;)

> No kidding - I'd hate to see you two as co-authors on a paper or
> presentation - especially if I had to pronounce it correctly. =) Then
> again the head of my department is "Stamatios Krimigis", so maybe I'd
> better start practicing...

With Americanized Greeks it's much easier.  Just say the name with confidence
and you can get away with any mangling you choose. :)

--
Please respond in the newsgroup so everyone may benefit.
 http://dev.remotenetworktechnology.com
(email requests for support contract information welcomed)
 ----------
 Subscribe to Microsoft's Security Bulletins:
 http://www.microsoft.com/technet/security/bulletin/notify.asp



Fri, 19 Aug 2005 10:56:55 GMT  
 
 [ 11 post ] 

 Relevant Pages 

1. Code Parsing Redux - stateless regex-based subtractive tokenizing

2. Using RegEx with a Hash based collection

3. Need suggestions for Regex code

4. Run Query based on Listbox results - text parsing?

5. Parsing a text file based on the caharacter ^

6. Parsing a text file based on the caharacter ^

7. Redux: ItemSend event not getting raised

8. Application settings (ini file redux)

9. Modal forms redux

10. Reusing Command Object Redux... Run-time Error 3021

11. Looking for code snippet to parse sentence into words

12. Date.parse() broke my code for 2k

 

 
Powered by phpBB® Forum Software