Issues with interpolating regexs 
Author Message
 Issues with interpolating regexs

Hello all,

Having recently decided to update my version-3-ish Perl skills to
something more modern, and having a professional interest in XML, I
decided to tackle the problem of building an XML 1.0 compliant
non-validating XML processor in pure 5.6.x-compatible Perl.  (Yes, I am
aware of the XML::Parser module.  This project has educational and
philosophical leanings that overshadow simple practicality.)

On a lark, I decided to try to get Perl's regex facilities to match and
process an entire XML document.  To do so at all I am delving into code
assertions.  To make the code comprehensible and hopefully maintainable,
I am interpolating appropriately named qr// patterns into larger
patterns.  In hope of producing something with a semblance of
efficiency, I am working with non-backtracking subpatterns.  In all of
this I am using the third edition of the camel book and the Perl's
extensive manpages as my primary references.

My problem right now -- the one I want help with, at least -- is that
I am having trouble grasping the big picture of the semantics of
interpolating regex subpatterns.  I have found by experimentation that
I don't need to worry about index numbers of backreferences; i.e. I
can interpolate qr/(["']).*?\1/s anywhere without worrying that the \1
might suddenly refer to something other than the opening ["'].  I
think I saw the same to be true for $1, etc. in code assertions.  I
also found that code assertions seem to operate in the scope in which
they were declared (should I say compiled?) rather than in the scope
in which the larger pattern is matched.  Artificial example: the
following code never prints anything other than zero, no matter the
value of $text

        my $depth = 0;
        my $regex = qr/ (?: { (?{$depth++}) [^{]* )+ /x;
        { my $depth = 0; $text =~ /${regex}/ && print $depth; }

apparently because it is the $depth in the outer scope that is
incremented, but that in the inner scope that is printed. I wonder
about, but have not yet tested, how interpolating a regex subpattern
into a non-backtracking cluster will affect things.  E.g., in

        my $pat = qr/ [ \x20 \t \r \n ]+ [^a-z] /x;
        / (?> ${pat} | < | \x20 [a-z] ) /xo;

can the third alternative ever be matched?

I think I could answer those and other, similar questions for myself
(without devising a programmatic test for each) if I better understood
in general what the semantics of performing such interpolations are
supposed to be.  I can't seem to find this level of detail in the
resources I am using, however.  Can anyone help?


John Bollinger

Fri, 04 Jun 2004 23:58:26 GMT  
 [ 1 post ] 

 Relevant Pages 

1. Split / combine dbases

2. Handling Blobs with SQL

3. delete... Delphi/Oracle drivers bug ?

4. Email code for TPW

5. Paradox Multiuser Problems - Lock Timeout

6. telnet chars and chat regexs

7. Index and Regexs

8. maps and regexs

9. Matching many similar regexs

10. RegExs and question marks...a possible bug?

11. Compiling regexs for efficency

12. difference in execution time of regexs


Powered by phpBB® Forum Software