pattern matching with non-existence of a string 
Author Message
 pattern matching with non-existence of a string

Hi,

Sorry if this has been asked before, but I've taken this code from two parts
of the perl cookbook and I'm having problems making them work together.  My
goal is to take URLs in a block of text and re-write them according to the
following rules:

   1) Absolute links (containing a protocol and dns) are to be left alone
   2) links with an absolute path will have a protocol and DNS pre-pended
   3) relative links will have a protocol, DNS and patch pre-pended.

So, that's probably not entirely clear.  How about a code snippet:

$server = " http://www.*-*-*.com/ "
$path   = "/absolute/path/"
$html   = '
   <a href="/absolute/no/dns">absolute with no dns</a>
   <a href=" http://www.*-*-*.com/ ;> http://www.*-*-*.com/ ;/a>
   <a href="relative/without/dns.html">relative/without/dns.html</a>
   <a href="relative2/without/dns.html">relative2 without dns.html</a>
';

$html =~ s/
      (<\s*
         (?:a|img|area)
         [^>]+?(?:href|src)
         \s*=\s*
         ["']?
      )
      (
         [^'"\/>]
         [^'" >]+?
      )
      ([ '"]?>)
      /
      $1.sprintf("%s%s", $path, $2).$3
      /sigex;

This bit works okay for the first one and the last two, but the middle case
(http://) fails because (clearly) I don't have any case that tells it to
avoid a leading protocol string (something like http://, ftp://, gopher://,

and tried this:

$html =~ s/
      (<\s*
         (?:a|img|area)
         [^>]+?(?:href|src)      
         \s*=\s*
         ["']?
      )
      (
         [^'"\/>(http|telnet|gopher|file|wais|ftp)]
         [^'" >]+?
      )
      ([ '"]?>)
      /
      $1.sprintf("%s%s", $path, $2).$3
      /sigex;

Which is _far_ worse since now none of the cases matches.  (Well, not
entirely true, if I remove the non-match for a leading '/' I can get the
first case to match, but that's exactly not what I want.)

I guess this is my question:  How can I do a non-match on a string?  I want
to prevent the http:// links from matching, but I can't seem to get it to
play nice.  Has anyone else done this?

Oh, and don't worry about the full DNS pre-pending, it's the same problem so
when I fix one, the other comes for free.  But it someone might have a
suggestion on how I could do this all with one pass, I'd love to hear it, as
it is I'm planning on doing two passes, the first with the path, the second
with the DNS and protocol info.

Thanks,
-Joe.

--
-Joe MacDonald.
------------------------------------------------------------------------
X windows:
        Japan's secret weapon.



Fri, 05 Sep 2003 23:11:42 GMT  
 pattern matching with non-existence of a string

Quote:

> Hi,

> Sorry if this has been asked before, but I've taken this code from two parts
> of the perl cookbook and I'm having problems making them work together.  My
> goal is to take URLs in a block of text and re-write them according to the
> following rules:

>    1) Absolute links (containing a protocol and dns) are to be left alone
>    2) links with an absolute path will have a protocol and DNS pre-pended
>    3) relative links will have a protocol, DNS and patch pre-pended.

> So, that's probably not entirely clear.  How about a code snippet:

Sure it's clear, if a url does not have an http://host/path/ on the front,
add it, if it has a /path/ on the front, but no http://host/ add the
http://host/, if it has both an http://host/ and a /path/, leave it alone.

[snip]

Quote:
> Oh, and don't worry about the full DNS pre-pending, it's the same problem so
> when I fix one, the other comes for free.  But it someone might have a
> suggestion on how I could do this all with one pass, I'd love to hear it, as
> it is I'm planning on doing two passes, the first with the path, the second
> with the DNS and protocol info.

I think you may be able to do it with one pass, using HTML::Filter and URI.
The Filter of course is a filter, it reads stuff in and prints it out.
The URI module does all the pattern matching on the URLs.

#! perl -w
use strict;
package MakePathsAbsolute;
use base HTML::Parser;
use URI;

sub new {
        my $uri = shift or do {
                require Carp;
                Carp::croak( "Must pass a url for the basepath" ) };
        my $self = &SUPER::new;
        $uri = URI->new($uri) unless ref $uri;
        $self->{base_path} = $uri;
        $self;

Quote:
}

my %p = ( img => "src", a => "href", area => "href" );
sub start {

        if( $tag =~ /^(a|img|area)$/ ) {
                my $loc = \$attr->{ $p{$tag} };
                my $uri = URI->new_bas( $$loc, $self->{base_path} );
                $$loc = $uri->canonical->as_string;
        }
        &$self->SUPER::start;
Quote:
}

1;
package main;
use URI;

my $baseurl = URI->new "http://foo.com/bar/";
my $fixurls = MakePathsAbsolute->new $baseurl;
$fixurls->parse_file(\*DATA);
exit 1;

__DATA__
<a href="/absolute/no/dns">absolute with no dns</a>
<a href="http://absolute.with/dns.html">http://absolute.with/dns.html</a>
<a href="relative/without/dns.html">relative/without/dns.html</a>
<a href="relative2/without/dns.html">relative2 without dns.html</a>
__END__

Note that the above code is untested.

--

7835 1ddf07 23a871 72656b63); print map {$e=1;for(split//){$e*=$_};pack
'V',$e} qw(ptk ppppprre pperrrlau ppppphc ppjsa r);



Thu, 13 Nov 2003 03:49:30 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. Sharing BDE Session Between Delphi And Paradox

2. yet another runtime error 200 prob...

3. Pattern Match - substitute a string after the match

4. pattern matching and non-printable characters

5. matching non-ordered string

6. Non-Greedy String Match

7. Testing for the non-existence of a variable.

8. testing for non-existence in an associative array???

9. Testing for the non-existence of a variable.

10. pattern matching with a string variable?

11. pattern matching with string variable?

12. Help on pattern matching, string size

 

 
Powered by phpBB® Forum Software