grabbing stuff from web pages 
Author Message
 grabbing stuff from web pages

Part of my web site has recommended books. I use the cover jpegs from
Amazon as part of my page. The rule with Amazon is that you have to
download the picture and put it on your site, not link to theirs.

So I was starting out with Dave and Andy's example:

require 'net/http'

h = Net::HTTP.new('www.xprogramming.com', 80)
resp, data = h.get('/index.htm', nil)
if resp.message == "OK"
        puts data
  data.scan(/<img src="(.*?)"/) { |x| puts x }
end

 figuring that I could then come up with some way to get and save the
image ... I remember doing it once before as an experiment, though I
can't find the code now.

But when you set up to hook to the amazon book page, which can be
accessed this way in a web browser

http://www.*-*-*.com/

so that the code looks like this:

require 'net/http'

h = Net::HTTP.new('www.amazon.com', 80)
resp, data = h.get('/exec/obidos/ASIN/0201708426/', nil)
if resp.message == "OK"
        puts data
  data.scan(/<img src="(.*?)"/) { |x| puts x }
end

 You get an error 302. I suppose that it's all that magic redirection
that they're using to get you to the real page, but I don't know enough
to be sure.

So I'd appreciate help on

  a) getting the right page address given that I only know the number
that comes after ASIN, and
  b) saving the jpeg once I get its src address.

Thanks!

Ronald E Jeffries
http://www.*-*-*.com/
http://www.*-*-*.com/
I'm giving the best advice I have. You get to decide whether it's true for you.



Sat, 03 Jul 2004 22:29:08 GMT  
 grabbing stuff from web pages

Quote:

> Part of my web site has recommended books. I use the cover jpegs from
> Amazon as part of my page. The rule with Amazon is that you have to
> download the picture and put it on your site, not link to theirs.

I have not tested it, but RAA contains a library called WebFetcher
which is designed to ease this kind of things. I don't know if it will
solve the 'Error 302' issue though.

--
Pierre-Charles David (pcdavid <at> emn <dot> fr)
Computer Science PhD Student, cole des Mines de Nantes, France
Homepage: http://purl.org/net/home/pcdavid



Sat, 03 Jul 2004 22:48:05 GMT  
 grabbing stuff from web pages
what an honor to help Ron Jeffries ;-)
b) just this morning I *hacked* (i.e. to be refactored ;-)) this working
code snippet:

require 'net/http'

def getJpeg(webserver, urlPath)
  print "getting #{urlPath}\n"  
  h = Net::HTTP.new(webserver, 80)
  begin
    resp, data = h.get(urlPath, nil )
  rescue
    print "   FAILED!\n"
    return
  end
  # writing the jpeg
  jpegFilename = urlPath.sub(/.*\/(.*?)$/, '\1')
  File.open(jpegFilename,"w") { |fh|
    fh.syswrite(data)
  }
end
..

getJpeg should work for you, too.
Regards Clemens

Quote:
> -----Original Message-----

> Sent: Dienstag, 15. Januar 2002 15:33

> Subject: grabbing stuff from web pages

> Part of my web site has recommended books. I use the cover jpegs from
> Amazon as part of my page. The rule with Amazon is that you have to
> download the picture and put it on your site, not link to theirs.

> So I was starting out with Dave and Andy's example:

> require 'net/http'

> h = Net::HTTP.new('www.xprogramming.com', 80)
> resp, data = h.get('/index.htm', nil)
> if resp.message == "OK"
>    puts data
>   data.scan(/<img src="(.*?)"/) { |x| puts x }
> end

>  figuring that I could then come up with some way to get and save the
> image ... I remember doing it once before as an experiment, though I
> can't find the code now.

> But when you set up to hook to the amazon book page, which can be
> accessed this way in a web browser

> http://www.amazon.com/exec/obidos/ASIN/0201708426/

> so that the code looks like this:

> require 'net/http'

> h = Net::HTTP.new('www.amazon.com', 80)
> resp, data = h.get('/exec/obidos/ASIN/0201708426/', nil)
> if resp.message == "OK"
>    puts data
>   data.scan(/<img src="(.*?)"/) { |x| puts x }
> end

>  You get an error 302. I suppose that it's all that magic redirection
> that they're using to get you to the real page, but I don't
> know enough
> to be sure.

> So I'd appreciate help on

>   a) getting the right page address given that I only know the number
> that comes after ASIN, and
>   b) saving the jpeg once I get its src address.

> Thanks!

> Ronald E Jeffries
> http://www.XProgramming.com
> http://www.objectmentor.com
> I'm giving the best advice I have. You get to decide whether
> it's true for you.



Sat, 03 Jul 2004 22:53:27 GMT  
 grabbing stuff from web pages
a) the following snippet is from the "Programming Ruby" book (just before
http://www.rubycentral.com/book/lib_network.html#Net::HTTP.new):
"
The code below illustrates the handling of an HTTP status 301, a redirect.
It uses Tomoyuki Kosimizu's URI package, available in the RAA.

h = Net::HTTP.new(ARGV[0] || 'www.ruby-lang.org', 80)
url = ARGV[1] || '/'

begin
  resp, data = h.get(url, nil) { |a| }
rescue Net::ProtoRetriableError => detail
  head = detail.data

  if head.code == "301"
    uri = URI.create(head['location'])

    host = uri['host']
    url  = uri['path']
    port = uri['port']

    h.finish
    h = Net::HTTP.new(host, port)

    retry
  end
end
"

The 302 error also seems to provide "head['location']" which can be used for
the "next try". I have not found the herein mentioned uri-module though.

Clemens

Quote:
> -----Original Message-----

> Sent: Dienstag, 15. Januar 2002 15:33

> Subject: grabbing stuff from web pages

> Part of my web site has recommended books. I use the cover jpegs from
> Amazon as part of my page. The rule with Amazon is that you have to
> download the picture and put it on your site, not link to theirs.

> So I was starting out with Dave and Andy's example:

> require 'net/http'

> h = Net::HTTP.new('www.xprogramming.com', 80)
> resp, data = h.get('/index.htm', nil)
> if resp.message == "OK"
>    puts data
>   data.scan(/<img src="(.*?)"/) { |x| puts x }
> end

>  figuring that I could then come up with some way to get and save the
> image ... I remember doing it once before as an experiment, though I
> can't find the code now.

> But when you set up to hook to the amazon book page, which can be
> accessed this way in a web browser

> http://www.amazon.com/exec/obidos/ASIN/0201708426/

> so that the code looks like this:

> require 'net/http'

> h = Net::HTTP.new('www.amazon.com', 80)
> resp, data = h.get('/exec/obidos/ASIN/0201708426/', nil)
> if resp.message == "OK"
>    puts data
>   data.scan(/<img src="(.*?)"/) { |x| puts x }
> end

>  You get an error 302. I suppose that it's all that magic redirection
> that they're using to get you to the real page, but I don't
> know enough
> to be sure.

> So I'd appreciate help on

>   a) getting the right page address given that I only know the number
> that comes after ASIN, and
>   b) saving the jpeg once I get its src address.

> Thanks!

> Ronald E Jeffries
> http://www.XProgramming.com
> http://www.objectmentor.com
> I'm giving the best advice I have. You get to decide whether
> it's true for you.



Sat, 03 Jul 2004 23:46:40 GMT  
 grabbing stuff from web pages
[Pierre-Charles David]:

Quote:
> I have not tested it, but RAA contains a library called WebFetcher
> which is designed to ease this kind of things. I don't know if it will
> solve the 'Error 302' issue though.

Yes it does. I'm the author so I should know ;)

Here is a webfetcher script for downloading the cover image:

  require 'webfetcher'

  url = 'http://www.amazon.com/exec/obidos/ASIN/0201708426/'
  im = WebFetcher::Page.url(url).images.find {|i| i.url['MZ']}.save('cov.jpg')

If you want to try the module you can get the latest version (0.5.3)
here:

    http://www.acc.umu.se/~r2d2/programming/ruby/webfetcher/

The RAA entry is not completely up to date. I will fix it as soon as I
can remember my passphrase. ;)

// Niklas



Sun, 04 Jul 2004 00:02:41 GMT  
 grabbing stuff from web pages

Quote:

>what an honor to help Ron Jeffries ;-)

Just another fat old ignorant guy when it comes to this stuff.

Thanks!

Ronald E Jeffries
http://www.XProgramming.com
http://www.objectmentor.com
I'm giving the best advice I have. You get to decide whether it's true for you.



Sun, 04 Jul 2004 00:36:20 GMT  
 grabbing stuff from web pages
On Tue, 15 Jan 2002 14:48:05 GMT, Pierre-Charles David

Quote:

>I have not tested it, but RAA contains a library called WebFetcher
>which is designed to ease this kind of things. I don't know if it will
>solve the 'Error 302' issue though.

Thanks for the pointer ... it seems neat. Doesn't read my amazon pages
though ... but I'll make good use of it somehow!

Ronald E Jeffries
http://www.XProgramming.com
http://www.objectmentor.com
I'm giving the best advice I have. You get to decide whether it's true for you.



Sun, 04 Jul 2004 05:30:19 GMT  
 grabbing stuff from web pages


Quote:
>[Pierre-Charles David]:
>> I have not tested it, but RAA contains a library called WebFetcher
>> which is designed to ease this kind of things. I don't know if it will
>> solve the 'Error 302' issue though.

>Yes it does. I'm the author so I should know ;)

>Here is a webfetcher script for downloading the cover image:

>  require 'webfetcher'

>  url = 'http://www.amazon.com/exec/obidos/ASIN/0201708426/'
>  im = WebFetcher::Page.url(url).images.find {|i| i.url['MZ']}.save('cov.jpg')

>If you want to try the module you can get the latest version (0.5.3)
>here:

>    http://www.acc.umu.se/~r2d2/programming/ruby/webfetcher/

Did that, but it doesn't work on my machine ... says this:

GET http://www.amazon.com/exec/obidos/ASIN/0201708426//cygdrive/c/ruby/li...
`fetch_1_1': undefined method `response' for
#<Net::ProtoRetriableError:0xa123530> (NameError)
        from
/cygdrive/c/ruby/lib/ruby/site_ruby/1.6/webfetcher.rb:365:in `fetch'
        from
/cygdrive/c/ruby/lib/ruby/site_ruby/1.6/webfetcher.rb:441:in `resp'
        from
/cygdrive/c/ruby/lib/ruby/site_ruby/1.6/webfetcher.rb:456:in `html?'
        from
/cygdrive/c/ruby/lib/ruby/site_ruby/1.6/webfetcher.rb:578:in `extract'
        from
/cygdrive/c/ruby/lib/ruby/site_ruby/1.6/webfetcher.rb:656:in `images'
        from C:\Data\Ruby\CodeManager\workspace3.rb:4

Tool completed with exit code 1

Advice will be most welcome!

Ronald E Jeffries
http://www.XProgramming.com
http://www.objectmentor.com
I'm giving the best advice I have. You get to decide whether it's true for you.



Sun, 04 Jul 2004 05:41:02 GMT  
 grabbing stuff from web pages

Quote:
> Thanks for the pointer ... it seems neat. Doesn't read my amazon pages
> though ... but I'll make good use of it somehow!

I dinked with the redirects some, but didn't have time to figure it all out
(plus I think my problem was compounded by accessing a secure site).

If you're willing & able to use vbs & ie, the following uses xmlhttp (part
of ie) to download a file. xmlhttp has all the built-in smarts to properly
handle redirects. I used it to download all my mp3s from myplay when they
shut things down. The following is a Windows Scripting Component that I
called from ruby through win32ole. Requires MDAC 2.5, I believe, because the
adodb.stream object didn't exist prior to that. There's also a way to do it
without adodb.stream, but it requires a little more code. I believe the
google link in the comments can direct you there as well.

<?XML version="1.0"?>
<package>
<?component error="true" debug="true"?>

<component id="cLabsDownloadFile">
   <registration
      progid="cLabs.Util.DownloadFile"
      description="cLabs File Download Utility"
      version="1.0"
      clsid="{a911de59-3a11-47a9-94f9-88ed96cfca78}"/>

   <public>
      <method name="DownloadFile"/>
   </public>

   <comment>
     The object tag instantiates the specified progid
   </comment>
   <object id="oHTTP" progid="Microsoft.XMLHTTP" />
   <object id="stream" progid="adodb.stream" />

   <script language="VBScript">
      <![CDATA[
        Function DownloadFile(sSource, sDest)
          ' code modified from
          '
http://groups.google.com/groups?hl=en&selm=%23v1%23CmirAHA.2132%40tkm...
5

          '--Begin user variables--
          'sSource =
"http://myplay.winamp.com/mp/download/dl?isRio=1<id=CWAl5rkx3I7mkQup...
00"
          'sDest = "d:\temp\acorn.mp3"
          '---End user variables---

          oHTTP.open "GET", sSource, False
          oHTTP.send

          const adTypeBinary = 1
          const adSaveCreateOverwrite = 2

          stream.type = adTypeBinary
          stream.mode = adModeReadWrite
          stream.open
          stream.write oHTTP.responseBody
          stream.savetofile sDest, adSaveCreateOverwrite
          stream.close

          DownloadFile = true
        End Function
      ]]>
   </script>
</component>
</package>



Sun, 04 Jul 2004 06:01:09 GMT  
 grabbing stuff from web pages

Quote:
>>> I have not tested it, but RAA contains a library called WebFetcher
>>> which is designed to ease this kind of things. I don't know if it will
>>> solve the 'Error 302' issue though.

> >Yes it does. I'm the author so I should know ;)

> Did that, but it doesn't work on my machine ... says this:

> GET http://www.amazon.com/exec/obidos/ASIN/0201708426/
> /cygdrive/c/ruby/lib/ruby/site_ruby/1.6/webfetcher.rb:413:in
> `fetch_1_1': undefined method `response' for
> #<Net::ProtoRetriableError:0xa123530> (NameError)

Hmm... someone else reported a similar error, but I have been unable to
reproduce it. The code in my post runs fine for me under 1.6.6, 1.6.5
and 1.7.1 (2001-10-22).

I would like to get this bug fixed. Which version of ruby are you running?

// Niklas



Sun, 04 Jul 2004 06:06:28 GMT  
 grabbing stuff from web pages


Quote:
>>>> I have not tested it, but RAA contains a library called WebFetcher
>>>> which is designed to ease this kind of things. I don't know if it will
>>>> solve the 'Error 302' issue though.

>> >Yes it does. I'm the author so I should know ;)

>> Did that, but it doesn't work on my machine ... says this:

>> GET http://www.amazon.com/exec/obidos/ASIN/0201708426/
>> /cygdrive/c/ruby/lib/ruby/site_ruby/1.6/webfetcher.rb:413:in
>> `fetch_1_1': undefined method `response' for
>> #<Net::ProtoRetriableError:0xa123530> (NameError)

>Hmm... someone else reported a similar error, but I have been unable to
>reproduce it. The code in my post runs fine for me under 1.6.6, 1.6.5
>and 1.7.1 (2001-10-22).

>I would like to get this bug fixed. Which version of ruby are you running?

1.6.4. Time to upgrade anyway I spose ... what would you like me to try?
(Don't make it anything hard, I may be old but I'm a Rewbie Newbie.)

Ronald E Jeffries
http://www.XProgramming.com
http://www.objectmentor.com
I'm giving the best advice I have. You get to decide whether it's true for you.



Sun, 04 Jul 2004 07:40:35 GMT  
 grabbing stuff from web pages
On Tue, 15 Jan 2002 22:01:09 GMT, "Morris, Chris"

Quote:

>If you're willing & able to use vbs

We'll hold off on that until I'm really desperate, I'd have to learn too
much!

Thanks though!

Ronald E Jeffries
http://www.XProgramming.com
http://www.objectmentor.com
I'm giving the best advice I have. You get to decide whether it's true for you.



Sun, 04 Jul 2004 07:45:48 GMT  
 grabbing stuff from web pages
[Ron Jeffries]:

Quote:
>>> Did that, but it doesn't work on my machine ... says this:

>>> GET http://www.amazon.com/exec/obidos/ASIN/0201708426/
>>> /cygdrive/c/ruby/lib/ruby/site_ruby/1.6/webfetcher.rb:413:in
>>> `fetch_1_1': undefined method `response' for
>>> #<Net::ProtoRetriableError:0xa123530> (NameError)
>>I would like to get this bug fixed. Which version of ruby are you running?
> 1.6.4. Time to upgrade anyway I spose ... what would you like me to try?
> (Don't make it anything hard, I may be old but I'm a Rewbie Newbie.)

Hopefully you won't have to try anything. I think I have squashed the
bug. It seems that a method in Net::ProtoRetriableError was renamed from
#data to #response between 1.6.4 and 1.6.5.

I have created a new version (0.5.4) of the module which uses #data if
#response is not available. It should work with ruby 1.6.4. You can
download it at:

    http://www.acc.umu.se/~r2d2/programming/ruby/webfetcher/

// Niklas



Sun, 04 Jul 2004 08:30:25 GMT  
 grabbing stuff from web pages


Quote:
>Hopefully you won't have to try anything. I think I have squashed the
>bug. It seems that a method in Net::ProtoRetriableError was renamed from
>#data to #response between 1.6.4 and 1.6.5.

>I have created a new version (0.5.4) of the module which uses #data if
>#response is not available. It should work with ruby 1.6.4. You can
>download it at:

>    http://www.acc.umu.se/~r2d2/programming/ruby/webfetcher/

I'll get right on it. Thanks!!

Ronald E Jeffries
http://www.XProgramming.com
http://www.objectmentor.com
I'm giving the best advice I have. You get to decide whether it's true for you.



Sun, 04 Jul 2004 09:01:54 GMT  
 grabbing stuff from web pages


Quote:
>I have created a new version (0.5.4) of the module which uses #data if
>#response is not available. It should work with ruby 1.6.4. You can
>download it at:

Works like a charm, thanks! Great! I owe you a beer or a book or
something!

Ronald E Jeffries
http://www.XProgramming.com
http://www.objectmentor.com
I'm giving the best advice I have. You get to decide whether it's true for you.



Sun, 04 Jul 2004 09:12:54 GMT  
 
 [ 17 post ]  Go to page: [1] [2]

 Relevant Pages 

1. rexx ap to grab web pages

2. How to grab a part of web page?

3. any thoughts on creating jsp counterpart (rsp, ruby server pages) for dynamic web pages

4. dB Web Builder - Insert xBase Expressions into HTML to Generate Web Pages

5. web grabbing.

6. New stuff on my page

7. paging, and some other stuff

8. memory manager with page translation stuff needed

9. paging, and some other stuff

10. New Python Stuff Page

11. Some new stuff on my Tcl page...

12. Advice on some pretty basic web stuff

 

 
Powered by phpBB® Forum Software