need URL parsing help 
Author Message
 need URL parsing help

i'm using Tom C's "plaintext URL -> hyperlinked URL" regexp from TPJ #1. sadly,
it doesn't handle the situation in which the URL is surrounded by HTML-encoded
characters (i've tested so far with angle brackets and with quotes).


    < http://www.*-*-*.com/ ;

i want:

    &lt;<a href=" http://www.*-*-*.com/ ;> http://www.*-*-*.com/ ;/a>&gt;

but the regexp (correctly) gives me:

    &lt;<a href=" http://www.*-*-*.com/ ;> http://www.*-*-*.com/ ;/a>;

what's the best way to modify the regexp to handle this situation?

---- here's the sub:

sub _urlify {
    my $line = shift;
    my $urls = '(' . join ('|', qw{http telnet gopher file wais ftp}) . ')';
    my $ltrs = '\w';

    my $punc = '.:?\-';
    my $any = "${ltrs}${gunk}${punc}";
    $line =~ s{
       \b           # start at word boundary
       (            # begin $1
         $urls :    #  need resource and a colon
         [$any] +?  #  followed by one or more
                    #  of any valid character, but
                    #  be conservative and take
                    #  only what you need to....
       )            # end $1
       (?=          # a look-ahead,
                    #  non-consumptive assertion
           [$punc]* # either 0 or more punctuation
           [^$any]  #  followed by a non-url char
         |          # or else
            $       # then end of the string
     }{<a href="$1">$1</a>}igox;
    return $line;



* brian moseley *
{ perl warrior | agent of chaos => critical path }

Fri, 16 Jun 2000 03:00:00 GMT  
 need URL parsing help


> given:
>     &lt;>
> i want:
>     &lt;<a href=""></a>&gt;
> but the regexp (correctly) gives me:
>     &lt;<a href=">">></a>;

to solve that problem (for this case), you might want to extract the
data found between &lt; and &gt; and then run them through the
_urlify routine.

   s/   &lt;        #beginning ampersand escape
        (?:URL:)?   #possible RFC compliance, not captured
        (.*?)       #anything, including whitespace
        (?=&gt;)    #up to the ending ampersand escape

    / '&lt;' . &_urlify($1) /egxs;

and then add a line in the _urlify routine to strip out any
whitespace (as is allowed by the RFC).

good luck :)

brian d foy                                 <>
perhaps the routine should be _referencify since the input is
already a URL.

Sat, 17 Jun 2000 03:00:00 GMT  
 [ 2 post ] 

 Relevant Pages 

1. Need Perl Script to parse out URL's

2. get URL of page - help needed

3. URL Redirect based on REMOTE_HOST - Help Needed

4. HELP: URL encoding algorithm needed

5. Help needed with automatic calling of url's

6. Help needed: parsing newsgroup for email

7. need help with foreach construct and parsing of list

8. Need help with a parsing routine

9. Trying to parse a file...need help

10. Need help parsing data

11. need help parsing Excel CSV data

12. Parse::RecDescent: need help with actions and return values


Powered by phpBB® Forum Software