Brain Dead & Parsing 
Author Message
 Brain Dead & Parsing

Folks,
I can really use some help. I am sure it is pretty simple for most
of you but I have been trying to figure it out for a couple hours.

I am trying to parse documents for my company from a remote web site. I am
just using lynx to get the source then store it in an array.

I am trying to search for a certain word like Current:
once it finds current it will print the next 5 lines that have href tags on
it

or

once it finds current. If will go until it hits the next word it is
searching for like Archive.

I am trying to pull a list of all the links on a page and the titles for
each are Current and Archive. The 5 lines of href tags can change to 3
sometimes, so I would prefer to have it just print any href tags between
Current and Archive. Any help would be greatly arppreciated
I have tried a lot and have been unsuccessful at this point. Thanks..

                                                -- Andy



Mon, 07 Jan 2002 03:00:00 GMT  
 Brain Dead & Parsing

Quote:

>I am trying to parse documents for my company from a remote web site. I am
>just using lynx to get the source then store it in an array.

>I am trying to search for a certain word like Current:
>once it finds current it will print the next 5 lines that have href tags on
>it
>or
>once it finds current. If will go until it hits the next word it is
>searching for like Archive.

How about this -- it should get you at least part of the way there, and can be
added to until it's exactly what you need.  Read the manpage for HTML::Parser
for more information.  I haven't paid too much attention to what you actually
need to pull from the page, so this code just grabs up to 5 anchor containers
between "Current:" and "Archive:".  Since it's just a quick cut and paste job
from some of my own code, it may contain an error or two, but the basic idea is
there.

...Steve

*** start of sample code ***

#!/bin/perl
#
# Extract interesting things from web pages
#

# Subclass HTML::Parser and do useful stuff with it:
#
package MyParser;

use HTML::Parser;
use strict;


sub declaration {}
sub comment {}

sub start {

   return unless $self->{interested};

   $self->{in_anchor} = 1 if $tag eq 'a';
   return unless $self->{in_anchor};

   $self->{acount}++;
   $self->{saved} .= $origtext;

Quote:
}

sub end {

   $self->{saved} .= $origtext if $self->{in_anchor} && $self->{interested};
   $self->{in_anchor} = 0 if $tag eq 'a';
   return;

Quote:
}

sub text {

   $self->{interested} = 1 if $text =~ /Current:/;
   $self->{interested} = 0 if $self->{acount} > 5 || $text =~ /Archive:/;
   $self->{saved} .= $text if $self->{in_anchor} && $self->{interested};

Quote:
}

sub out { return $_[0]->{saved}; }

#---------------------------------------------------------------------

package main;

use LWP::UserAgent;

use strict;

# Could use LWP::Simple here...
#
my $url = 'http://something....';
my $ua  = new LWP::UserAgent;
my $res = $ua->request(HTTP::Request->new(GET => $url));

exit unless $res->is_success;

my $doc = $res->content;

# Parse and extract what you want:
#
my $w = MyParser->new();
$w->parse($doc);
$w->eof;

print $w->out;

*** end of sample code ***

--
Steve van der Burg
Technical Analyst, Information Services
London Health Sciences Centre
London, Ontario, Canada



Mon, 07 Jan 2002 03:00:00 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. Perl's brain-dead floating point math

2. Perl is Dead, dead, dead!

3. Perl4 is dead dead dead

4. perl parsing bug - if &foo and scalar(&bar) doesn't work correctly

5. !!! STOP POSTING HERE -- THIS GROUP IS DEAD DEAD DEAD !!!

6. overzealous -w or is my brain dead? [1/1]

7. Forced to use brain-dead perl 4 -- how do I accomplish task that is simple in perl 5 ?

8. Q: parsing @ARVG & PWD

9. ANNOUNCE: Argv::Parse New Powerful & Flexible getopt

10. File Parsing & 2D Arrays

11. error: undefined subroutine &Date::Parse::strptime

12. date::parse & date::format

 

 
Powered by phpBB® Forum Software