Extracting Hyperlink Title 
Author Message
 Extracting Hyperlink Title

I'm trying to find a way of extracting the descriptive text between
the <A> and </A> anchors of a hyperlink in a page of HTML.

The following short cgi script retrieves a page of HTML and extracts
the URLs, converting relative URLs into absolute ones. (It's based on
examples in Clinton Wong's Web Client Programming). However, it
doesn't retain the title of the URL i.e. the text description between
the opening and closing anchors. I wonder if anyone knows if there is
any way of modifying the coding so that this text can be displayed?

     Alec Wearing

     -----------------------------------------------
     #!/usr/local/bin/perl

     # *** retrieve.cgi ***
     # *** Displays Anchor Links as Clickable Hyperlinks ***

     use LWP::Simple;
     use HTML::Parse;
     use HTML::Element;
     use URI::URL;

     $page = ' http://www.*-*-*.com/ ';

     $html = get $page;
     $parsed_html = HTML::Parse::parse_html($html);

     print "Content-type: text/html\n\n";



     $link =$_->[0];
     $url =new URI::URL $link;
     $full_url =$url->abs($page);
     print "<A HREF=\"$full_url\">$full_url</A>\n\n";
     }
     --------------------------------------------------------



Wed, 27 Sep 2000 03:00:00 GMT  
 Extracting Hyperlink Title


Quote:

> I'm trying to find a way of extracting the descriptive text between
> the <A> and </A> anchors of a hyperlink in a page of HTML.

Alec,

extract_links actually returns a reference to an array of arrays.
The inner array is [ URL, reference to an HTML::Element object ].
In weird cases where the description of an anchor tag contains
tags (example: <a href="..."><b>SecurID</b></a>), one needs
to traverse the tree until a scalar is found.

This type of detail should be available in the second edition
of Web Client Programming.

Regards,
Clinton
------modified version of your code below-------------------------
#!/usr/local/bin/perl -w

use LWP::Simple;
use HTML::Parse;
use HTML::Element;
use URI::URL;
use strict;

my $page = 'http://www.perl.com/';

my $html = get $page;
my $parsed_html = HTML::Parse::parse_html($html);

print "Content-type: text/html\n\n";



  my $link =$_->[0];
  my $description = $_->[1];
  do {
    $description = $description->content->[0];
  } until (! ref $description);

  # if there isn't plain text between the anchor tags, use empty string.
  if (! defined $description) { $description='' }

  my $url = new URI::URL $link;
  my $full_url = $url->abs($page);
  print "<A HREF=\"$full_url\">$description</A>\n\n";

Quote:
}



Thu, 28 Sep 2000 03:00:00 GMT  
 Extracting Hyperlink Title

Quote:

>                                I wonder if anyone knows if there is
> any way of modifying the coding so that this text can be displayed?

Take a look at this example:
----------------------------

use HTML::TreeBuilder;

$p = HTML::TreeBuilder->new;
$p->parse_file("xxx.html");
$p->traverse(\&extract_alinks, 1);

sub extract_alinks
{

   return 1 unless ref($elem) && $elem->tag eq "a";

   my $link = $elem->attr('href');
   my $text = extract_atext($elem);
   print "$text => $link\n";

   return 0;  # no need to traverse futher down

Quote:
}

sub extract_atext
{
   my $a = shift;
   return $a unless ref($a);
   my $text = "";
   $a->traverse(
       sub {

           $text .= $elem, return 0 unless ref($elem);
           if ($elem->tag eq "img") {
              $text .= $elem->attr('alt') || "[Image]";
           }
           1;
       });
   # clean spaces in the string
   $text =~ s/\s+/ /g;
   $text =~ s/^\s+//;
   $text =~ s/\s+$//;
   $text;

Quote:
}

--
Gisle Aas


Thu, 28 Sep 2000 03:00:00 GMT  
 Extracting Hyperlink Title

Many Thanks.

I'm trying my hand at writing a script to retrieve, and amalgamate
into a single page, the links on web sites which carry details of news
stories of e-commerce developments; hence the need for extracting the
titles of URLs.

Incidentally, I'm one of those people referred to in the preface to
your fascinating book as Tinkerers :-)

Regards,

Alec



Quote:
>Alec,

>extract_links actually returns a reference to an array of arrays.
>The inner array is [ URL, reference to an HTML::Element object ].
>In weird cases where the description of an anchor tag contains
>tags (example: <a href="..."><b>SecurID</b></a>), one needs
>to traverse the tree until a scalar is found.

>This type of detail should be available in the second edition
>of Web Client Programming.

>Regards,
>Clinton



Fri, 29 Sep 2000 03:00:00 GMT  
 Extracting Hyperlink Title

Thank you.

The example you give is clearly very useful, not least because it
covers situations where links are in the form of clickable images.
From the point of view of retrieving links to news stories, News.com
(www.news.com) is one site which uses images to provide the hyperlink
for some of its stories.

Regards,

Alec


Quote:

>>                                I wonder if anyone knows if there is
>> any way of modifying the coding so that this text can be displayed?

>Take a look at this example:



Fri, 29 Sep 2000 03:00:00 GMT  
 
 [ 5 post ] 

 Relevant Pages 

1. Help needed: script to extract Hyperlink

2. Regular expression: need help to extract title

3. REGEXP question : Trying to extract title from HTML

4. Extract title

5. Troubles with listing files/extracting titles, etc.

6. Generating a title for a HTML Document without a title tag

7. $title and ($title) as lvalue

8. Anyone know how i can extract <Title> </title> from html source

9. Extracting HTML <TITLE>...</TITLE> using Perl?

10. Absolute newbie question regarding hyperlinks and WML

11. perl/cgi and dynamic hyperlink

12. Hyperlink indexing script?

 

 
Powered by phpBB® Forum Software