Extract link text from htmlpage 
Author Message
 Extract link text from htmlpage

Hi,

I would like to extract link urls + the link text from an html page.
The code below works if the link is formatted like this :
<a href=" http://www.*-*-*.com/ ;>link text here</a>

But how can I modify this code so that it also works with eg.
<a href=" http://www.*-*-*.com/ ;><u>link</u> text here</a>
<a href=" http://www.*-*-*.com/ ;><font>link text here</font></a>
...

the markup tags within <a></a> are not important; I am only interested in
the link text.

Here is the code that I currently use:

sub callback {

    return 1 unless ref($node);
    return 1 unless $start;
    if($node->tag() eq 'a') {
      print $node->attr('href');
      print "\n";

      print "\n";
    }

Quote:
}

  $ua = LWP::UserAgent->new();
  $ua->agent($myagent);
  $request = HTTP::Request->new('GET',$siteurl);
  $response = $ua->request($request);
  if($response->is_success()){
    $doc = $response->content();
    $tree = HTML::TreeBuilder->new();
    $tree->parse($doc);
    $tree->traverse(\&callback);
    $tree->delete();
    ....

Any help would be appreciated !
Thanks in advance.

Best Regards,
Pieter.



Thu, 12 Aug 2004 20:23:05 GMT  
 Extract link text from htmlpage
...

Quote:
> I would like to extract link urls + the link text from an html page.
> The code below works if the link is formatted like this :
> <a href="http://www.site.com/page.html">link text here</a>

> But how can I modify this code so that it also works with eg.
> <a href="http://www.site.com/page.html"><u>link</u> text here</a>
> <a href="http://www.site.com/page.html"><font>link text here</font></a>
> ...

> the markup tags within <a></a> are not important; I am only interested in
> the link text.
...
> Pieter.

Well, you have already discovered how difficult HTML parsing is.  Try:

    use HTML::Parser;

You really need a real parser to properly parse HTML, particularly if it
is HTML done by someone besides yourself.
--
Bob Walton



Thu, 12 Aug 2004 20:27:39 GMT  
 Extract link text from htmlpage

Quote:
> I would like to extract link urls + the link text from an html page.
> The code below works if the link is formatted like this :
> <a href="http://www.site.com/page.html">link text here</a>

> But how can I modify this code so that it also works with eg.
> <a href="http://www.site.com/page.html"><u>link</u> text here</a>
> <a href="http://www.site.com/page.html"><font>link text here</font></a>
> ...

use HTML::Parser;

One of the examples which come with this module actually does exactly what
you are asking for. You just need to copy-and-paste it.

jue



Fri, 13 Aug 2004 19:20:21 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. Extract links and text from HTML

2. Please help me how is easiest way to extract text between some variable text

3. Extracting links!

4. Extracting links from an HTML page - help!

5. Extracting types of HTML links.

6. extracting JS links with Perl

7. HELP: How to extract href link??

8. Extracting HTML links to ASCII

9. Extracting all links from HTML?

10. Extracting Links from HTML Page

11. extracting e-mail addresses from text file database

12. Extracting ASCII text form PDF files

 

 
Powered by phpBB® Forum Software