Downloading images with HTTP/LWP libraries 
Author Message
 Downloading images with HTTP/LWP libraries
I'm trying to figure out how to download an embedded image
in a web page using the HTTP/LWP libraries.  Right now,
I'm resorting to submitting a request with the image's URL.
This creates two problems:
  1) The image I get when I save $request->content to a
         file is very distorted.
  2) In response to a request like http://www.*-*-*.com/
         some websites give an html page with bar.jpg embedded.
         In this case, all I get in $request->content is the HTML code.

Here's the code I'm using now:

  use HTTP::Request;
  use LWP::UserAgent;

  $ua = LWP::UserAgent->new;
  $request = HTTP::Request->new('GET',
       ' http://www.*-*-*.com/ ');
  $request->header(Accept => "image/*");
  $response = $ua->request($request);

  open (OUTPUT, ">result.jpg");
  print OUTPUT $response->content;
  close OUTPUT;

Any suggestions on how to do this better?

Thanks.

-Rob



Tue, 21 Oct 2003 11:20:19 GMT  
 Downloading images with HTTP/LWP libraries

Quote:

> I'm trying to figure out how to download an embedded image
> in a web page using the HTTP/LWP libraries.  Right now,
> I'm resorting to submitting a request with the image's URL.
> This creates two problems:
>   1) The image I get when I save $request->content to a
>          file is very distorted.
>   2) In response to a request like http://www.foo.com/bar.jpg
>          some websites give an html page with bar.jpg embedded.
>          In this case, all I get in $request->content is the HTML code.
<code deleted>
> Any suggestions on how to do this better?
...
> -Rob

Your image is probably distored because (I'm guessing here) you are
using a brain-dead OS that requires the use of binmode to correctly
write binary files, like image files.  perldoc -f binmode.

Also, test your open's for success by adding or die "Oops, $!" to them.
You'll be glad you did.  As far as that goes, test the results of
fetching the web page to see if the fetch worked before attempting to
use the returned results.

As for getting HTML back when fetching an image file, that sounds like a
mis-configured web server to me.  Does the web page which references
these images work OK in your browser?  If so, is the web page actually
pointing to those image files, or perhaps to other image files
somewhere?  Maybe the web server is trying to be too smart by paying
attention to what kind of brower you say you are when giving the
request.  If so, you could pretend to be whatever brower it is that
works with the web server (by using $ua->agent('whatever');).  Who knows
what the server thinks you are if you don't say?  Maybe it assumes
text-only?
--
Bob Walton



Tue, 21 Oct 2003 12:37:41 GMT  
 Downloading images with HTTP/LWP libraries

Quote:

> I'm trying to figure out how to download an embedded image
> in a web page using the HTTP/LWP libraries.  Right now,
> I'm resorting to submitting a request with the image's URL.
> This creates two problems:
>   1) The image I get when I save $request->content to a
>          file is very distorted.
>   2) In response to a request like http://www.foo.com/bar.jpg
>          some websites give an html page with bar.jpg embedded.
>          In this case, all I get in $request->content is the HTML code.
> Here's the code I'm using now:

(snipped)

You are investing a lot of code effort for something
very simple. You have the needed URL addresses for
these images. Download them via a browser and save
them to your local system. This is quite logical.

Should you elect to do this in a less than logical
manner, binary mode your Standard Output. Cross
your fingers, this might help.

use LWP::Simple;

$image = get ("url/path/to/image.jpg");

binmode STDOUT;
open (FILEHANDLE, ">image.jpg");
print FILEHANDLE $image;
close (FILEHANDLE);

You should know Win32 systems will not correctly
store a binary image file using this method. You
may also encounter other operating systems which
exhibit problems storing binary images. Use of
binmode may or may not correct this problem.

On images embedded in an html page, you may have
to add a referral variable appropriate for a site.
Some sites do not allow offsite links to an image.

With little effort, you can extract image URLs
from an html page. There is really no need to
write such elaborate code for this task. You
will discover LWP Simple works well.

Best approach is to simply download these images
via a browser and save them to a local diskdrive.

Godzilla!



Tue, 21 Oct 2003 13:52:28 GMT  
 Downloading images with HTTP/LWP libraries


Quote:
> Your image is probably distored because (I'm guessing here) you are
> using a brain-dead OS that requires the use of binmode to correctly
> write binary files, like image files.  perldoc -f binmode.

This works.  Thanks.

Quote:
> As for getting HTML back when fetching an image file, that sounds like a
> mis-configured web server to me.  Does the web page which references
> these images work OK in your browser?  If so, is the web page actually
> pointing to those image files, or perhaps to other image files
> somewhere?

When I enter the image's URL in Internet Explorer, I get a web page with
a banner ad and a back button, along with the image.  The relative
pathname of the embedded image (/dir2/bar.jpg) is the end of the URL
of the page (http://www.foo.com/dir1/dir2/bar.jpg).

Quote:
> Maybe the web server is trying to be too smart by paying
> attention to what kind of brower you say you are when giving the
> request.  If so, you could pretend to be whatever brower it is that
> works with the web server (by using $ua->agent('whatever');).

If I identify myself as 'Mozilla/5.0' I get different html (no banner ad)
than
if I identify myself as 'IE/5.0' or if I don't identify myself.  (No banner
ad with Mozilla.)  It's still html in any case.

Is it possible to retrieve an html file and the embedded images with a
single HTTP
request?

Thanks for your help.

-Rob



Tue, 21 Oct 2003 14:32:50 GMT  
 Downloading images with HTTP/LWP libraries
Bob Walton wrote in comp.lang.perl.misc:

Quote:
}
} As for getting HTML back when fetching an image file, that sounds like a
} mis-configured web server to me.  Does the web page which references
} these images work OK in your browser?  If so, is the web page actually
} pointing to those image files, or perhaps to other image files
} somewhere?  Maybe the web server is trying to be too smart by paying
} attention to what kind of brower you say you are when giving the
} request.  If so, you could pretend to be whatever brower it is that
} works with the web server (by using $ua->agent('whatever');).  Who knows
} what the server thinks you are if you don't say?  Maybe it assumes
} text-only?

Another guess : the web server issues a redirect from the JPEG file to
an HTML file when some header is not present (probably User-Agent or
Referer). Try to use $ua->simple_request instead of $ua->request to
avoid being redirected and investigate the status code of the response :
is it 200 ? Try to add headers to the original request.

--
Rafael Garcia-Suarez / http://rgarciasuarez.free.fr/perl/biscuit.html



Tue, 21 Oct 2003 15:02:48 GMT  
 Downloading images with HTTP/LWP libraries

MCMXCIII in <URL::">
{}  Bob Walton wrote in comp.lang.perl.misc:
{}  }
{}  } As for getting HTML back when fetching an image file, that sounds like a
{}  } mis-configured web server to me.  Does the web page which references
{}  } these images work OK in your browser?  If so, is the web page actually
{}  } pointing to those image files, or perhaps to other image files
{}  } somewhere?  Maybe the web server is trying to be too smart by paying
{}  } attention to what kind of brower you say you are when giving the
{}  } request.  If so, you could pretend to be whatever brower it is that
{}  } works with the web server (by using $ua->agent('whatever');).  Who knows
{}  } what the server thinks you are if you don't say?  Maybe it assumes
{}  } text-only?
{}  
{}  Another guess : the web server issues a redirect from the JPEG file to
{}  an HTML file when some header is not present (probably User-Agent or
{}  Referer). Try to use $ua->simple_request instead of $ua->request to
{}  avoid being redirected and investigate the status code of the response :
{}  is it 200 ? Try to add headers to the original request.

Of course, people could make the false assumption that when a URL ends
in '.jpg', it will return a JPEG image.

On the HTTP level, there is no relationship between the URL and the
content/type.

Abigail
--
BEGIN {my $x = "Knuth heals rare project\n";
       $^H {integer} = sub {my $y = shift; $_ = substr $x => $y & 0x1F, 1;
       $y > 32 ? uc : lc}; $^H = hex join "" => 2, 1, 1, 0, 0}
print 52,2,10,23,16,8,1,19,3,6,15,12,5,49,21,14,9,11,36,13,22,32,7,18,24;



Tue, 21 Oct 2003 21:33:50 GMT  
 Downloading images with HTTP/LWP libraries

Robert> I'm trying to figure out how to download an embedded image
Robert> in a web page using the HTTP/LWP libraries.  Right now,
Robert> I'm resorting to submitting a request with the image's URL.
Robert> This creates two problems:
Robert>   1) The image I get when I save $request->content to a
Robert>          file is very distorted.
Robert>   2) In response to a request like http://www.foo.com/bar.jpg
Robert>          some websites give an html page with bar.jpg embedded.
Robert>          In this case, all I get in $request->content is the HTML code.

Robert> Here's the code I'm using now:

Robert>   use HTTP::Request;
Robert>   use LWP::UserAgent;

Robert>   $ua = LWP::UserAgent->new;
Robert>   $request = HTTP::Request->new('GET',
Robert>        'http://www.foo.com/bar.jpg');
Robert>   $request->header(Accept => "image/*");
Robert>   $response = $ua->request($request);

Robert>   open (OUTPUT, ">result.jpg");
Robert>   print OUTPUT $response->content;
Robert>   close OUTPUT;

Robert> Any suggestions on how to do this better?

Yes, stop typing so much:

    use LWP::Simple;
    mirror("http://www.foo.com/bar.jpg", "result.jpg");

First most important advice:

        read "perldoc lwpcook" from beginning to end, ONCE.

Shall I repeat that?

        read "perldoc lwpcook" from beginning to end, ONCE.

There are far too many tips on higher-level interfaces in that
document to not be aware of them.  It's a false Laziness not
to read it.

print "Just another Perl hacker,";

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095

Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



Tue, 21 Oct 2003 22:20:43 GMT  
 Downloading images with HTTP/LWP libraries

Godzilla!> use LWP::Simple;

Godzilla!> $image = get ("url/path/to/image.jpg");

Godzilla!> binmode STDOUT;

Of course, binmode on STDOUT has nothing to do with the following
code.  It's a red herring.  Waste of characters.  Distracting.
Vestigial no doubt.

Godzilla!> open (FILEHANDLE, ">image.jpg");

You need to insert binmode FILEHANDLE in here.

Godzilla!> print FILEHANDLE $image;
Godzilla!> close (FILEHANDLE);

The rest of Kira's advice is sound, however.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095

Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



Tue, 21 Oct 2003 22:22:33 GMT  
 Downloading images with HTTP/LWP libraries

Quote:

>Is it possible to retrieve an html file and the embedded images with a
>single HTTP request?

No, each URL has to be retrieved independantly. But HTML::LinkExtor
(part of the HTML::Parser package) can extract embedded URL's in a page,
and convert relative URL's to absolute ones.

--
        Bart.



Tue, 21 Oct 2003 23:02:04 GMT  
 Downloading images with HTTP/LWP libraries

Quote:

>binmode STDOUT;
>open (FILEHANDLE, ">image.jpg");
>print FILEHANDLE $image;

Whoops, wrong filehandle. And use binmode() after the file has been
opened.

--
        Bart.



Tue, 21 Oct 2003 23:03:20 GMT  
 Downloading images with HTTP/LWP libraries

Quote:


> > use LWP::Simple;
> > $image = get ("url/path/to/image.jpg");
> > binmode STDOUT;
> Of course, binmode on STDOUT has nothing to do with the following
> code.  It's a red herring.  Waste of characters.  Distracting.
> Vestigial no doubt.

Well shoot Randal, as a {*filter*}ager, I lost my vestigial
in the backseat of a '63 Chevy Impala, hence my daughter.

Quote:
> > open (FILEHANDLE, ">image.jpg");
> You need to insert binmode FILEHANDLE in here.
> > print FILEHANDLE $image;
> > close (FILEHANDLE);
> The rest of Kira's advice is sound, however.

Much to my annoyance, neither my mind nor my rearend
are as sound as my advice.

* notices Randal raises an eyebrow *

Thanks for pointing out my error.

Godzilla!



Tue, 21 Oct 2003 23:55:35 GMT  
 Downloading images with HTTP/LWP libraries

Quote:


> >binmode STDOUT;
> >open (FILEHANDLE, ">image.jpg");
> >print FILEHANDLE $image;
> Whoops, wrong filehandle.

Perhaps I should have used a broomhandle. This
would be more fitting for my personality.

Quote:
> And use binmode() after the file has been
> opened.

Yeah, Randal robbed me of my vestigial on this one.
Ya know, I thought something didn't look right but
I couldn't think of what. I glanced through some of
my books but failed to find my mistake.

Ok, I confess. I didn't do as much homework as
I should have. I will write on my chalkboard,
one-hundred times, "I will do my homework."

This reminds me of our two-room schoolhouse
teacher back in Oklahoma, when I was young
and still a vestigial.

She instructed me to write on her chalkboard,
one hundred times, "I will not talk in class."
Fine. Ok.

Instead of writing horizontally, I found it quicker
to write vertically...

I will not talk
I will not talk
I will not
I will not
I will not
...

As our Good Lord would have it, She sent a devilish
angel to whisper in our teacher's ear. I was caught,
her chalkboard erased and, I had to write my punishment
horizontally five-hundred times along with gritting
my teeth during a switching by my Grandpa at home,
later in the day. Devilish whispered messages do
tend to spread fast and far.

I never did stop talking in class. I became more
efficient at not being caught, just as I became
more efficient at writing punishment on her old
chalkboard, bless her heart. A fine woman she is.

Try this chalkboard punishment today and, as a teacher,
you will find yourself in civil court facing litigation
for emotionally abusing a student.

* thinks today's parents need a Grandpa switching *

Godzilla!



Wed, 22 Oct 2003 00:16:48 GMT  
 Downloading images with HTTP/LWP libraries

Quote:



[snip]
> Is it possible to retrieve an html file and the embedded images with a
> single HTTP request?

Well, sort of.  You can use a keep-alive type connection, and perform
multiple requests over the same connection.  Since for most users, the
longest part of retrieving an http object is the connect, not the actual
downloading, this can speed things up significantly.

However, this is still multiple requests.  You ask for the first
document, passing a "connection: keep-alive" header along with
everything else, and then read the html, and parse it to learn the URL
of the image, then send another request (for the path from that URL),
read the image, and then close the connection.

However, I don't see why you are downloading the html file, unless the
URL of the image changes from time to time.

--
Customer: "I would like to try on that suit in the window."
Salesman: "Sorry sir, you will have to use the dressing room."



Mon, 10 Nov 2003 13:30:18 GMT  
 
 [ 13 post ] 

 Relevant Pages 

1. http download of images in perl

2. LWP::Simple downloads many image files easily

3. http download with lwp

4. LWP HTTP::Request and HTTP::Cookies

5. Downloading images from web pages using perl.

6. How to download page/image?

7. Content-type to download an image!

8. IO::Socket::INET to download images?

9. Download image from a html page using perl CGI

10. Image can view but cannot download?

11. prevent download of images

12. Help in downloading html and images through CGI

 

 
Powered by phpBB® Forum Software