Loading and parsing HTML page 
Author Message
 Loading and parsing HTML page

Is there a simple way to load an HTML page from the web without
displaying it and make the HTML source available to my program?  I would
like to parse the page and display only certain information.

Any help would be greatly appreciated.

Kyle



Sun, 17 Mar 2002 03:00:00 GMT  
 Loading and parsing HTML page
Kyle,

You can use the Microsoft Internet Transfer Control to accomplish this.  The
example below just stores the HTML text from the Microsoft website in a
string variable.  You can also store the returned text in a byte array if
you wish (probably faster, but a little more coding).

    Dim sHTML As String

    With Inet1
        .URL = "http://www.microsoft.com"
        .AccessType = icDirect
        sHTML = .OpenURL(datatype:=icString)
    End With
    Debug.Print sHTML

Regards,
Jake


Quote:
> Is there a simple way to load an HTML page from the web without
> displaying it and make the HTML source available to my program?  I would
> like to parse the page and display only certain information.

> Any help would be greatly appreciated.

> Kyle



Sun, 17 Mar 2002 03:00:00 GMT  
 Loading and parsing HTML page
You have three or four options (without splashing out more cash for
third-party controls, that is).

You could:

Investigate the Inet (MS Internet Transfer Control) for various ways of
doing this (asynchronously, and synchronously, the latter being the
simplest).

If you're feeling brave, you could use the MS Winsock Control and "do it
raw", but it's harder to handle things like "Transfer-Encoding Chunked",
which some websites talk. Inet will deal with this transparently.

If you're feeling extremely brave, you could do it all with Win32API
calls (effectively skipping over part of the Inet control's upper-levels
to get at the OS calls directly), but this isn't really worth it in VB,
IMHO.

Depending on what kind of parsing you intend to do (and how much of your
own parser you're capable of writing), you might even find it's easiest
to use the WebBrowser control, (though this has a few "problems" with
being made invisible) or an invisible instance of Internet Explorer
called via Automation. The benefit of both of these is that the parsing
"comes for free", using the DHTML/Document Object Model. Basically you
can read collections of elements, raw-text, tags, links, etc in single
operations, and step through them using "For Each" code, which is really
handy.

The trouble with writing your own parser (as you would have to do for
the first three options) is that so many people write "bad HTML" it
means you keeping having to go back and code around it - whereas IE &
WebBrowser just ignore it). Believe me, spending three days fault-
finding, and then re-coding a parser routine, just because some{*filter*}
decided to forget that closing comment tags were "-->" and not "--!>" is
not my idea of efficiency! ;-)

The downside with doing it with IE or WB is that both are pigs when it
comes to setting them up to NOT download JAVA Applets, execute SCRIPT
tags, and get all internal page elements on the fly (SRC= images, etc,
etc). However, you can turn these off, quite safely, if you either
personally set up IE's options before running the app (the long way), or
setting the appropriate Registry Keys in your code, before and after
each run (the hard, but neat way).

FWIW, after many months of experimenting with Winsock raw, Inet (async
and sync modes) and now finally WebBrowser/IE, I'm sticking with latter,
primarily because its parser makes up for all its other inconveniences.

YMMV.

And good luck - it's going to get scarier from here on in... ;-)

Cheers
Neil



Quote:
>Is there a simple way to load an HTML page from the web without
>displaying it and make the HTML source available to my program?  I would
>like to parse the page and display only certain information.

>Any help would be greatly appreciated.

>Kyle

--
Neil D. Jackson


Mon, 18 Mar 2002 03:00:00 GMT  
 Loading and parsing HTML page


Fri, 19 Jun 1992 00:00:00 GMT  
 Loading and parsing HTML page
Do you know how programs like Net Submitter Professional and
others submit to a search engine without giving away the fact
that a program is doing the submitting? I know that NSP uses
embedded functions that are included with IE4/5 but I'm not sure
what or how they do it.

Thanks,
Bill Karoly

You have three or four options (without splashing out more cash
for
third-party controls, that is).

You could:

Investigate the Inet (MS Internet Transfer Control) for various
ways of
doing this (asynchronously, and synchronously, the latter being
the
simplest).

If you're feeling brave, you could use the MS Winsock Control
and "do it
raw", but it's harder to handle things like "Transfer-Encoding
Chunked",
which some websites talk. Inet will deal with this
transparently.

If you're feeling extremely brave, you could do it all with
Win32API
calls (effectively skipping over part of the Inet control's
upper-levels
to get at the OS calls directly), but this isn't really worth it
in VB,
IMHO.

Depending on what kind of parsing you intend to do (and how much
of your
own parser you're capable of writing), you might even find it's
easiest
to use the WebBrowser control, (though this has a few "problems"
with
being made invisible) or an invisible instance of Internet
Explorer
called via Automation. The benefit of both of these is that the
parsing
"comes for free", using the DHTML/Document Object Model.
Basically you
can read collections of elements, raw-text, tags, links, etc in
single
operations, and step through them using "For Each" code, which
is really
handy.

The trouble with writing your own parser (as you would have to
do for
the first three options) is that so many people write "bad HTML"
it
means you keeping having to go back and code around it - whereas
IE &
WebBrowser just ignore it). Believe me, spending three days
fault-
finding, and then re-coding a parser routine, just because some
{*filter*}
decided to forget that closing comment tags were "-->" and not
"--!>" is
not my idea of efficiency! ;-)

The downside with doing it with IE or WB is that both are pigs
when it
comes to setting them up to NOT download JAVA Applets, execute
SCRIPT
tags, and get all internal page elements on the fly (SRC=
images, etc,
etc). However, you can turn these off, quite safely, if you
either
personally set up IE's options before running the app (the long
way), or
setting the appropriate Registry Keys in your code, before and
after
each run (the hard, but neat way).

FWIW, after many months of experimenting with Winsock raw, Inet
(async
and sync modes) and now finally WebBrowser/IE, I'm sticking with
latter,
primarily because its parser makes up for all its other
inconveniences.

YMMV.

And good luck - it's going to get scarier from here on in... ;-)

Cheers
Neil



Quote:
>Is there a simple way to load an HTML page from the web without
>displaying it and make the HTML source available to my program?
I would
>like to parse the page and display only certain information.

>Any help would be greatly appreciated.

>Kyle

--
Neil D. Jackson


Mon, 18 Mar 2002 03:00:00 GMT  
 Loading and parsing HTML page


Fri, 19 Jun 1992 00:00:00 GMT  
 Loading and parsing HTML page
Kyle,

The other option without cash is SocketWrench (  http://www.catalyst.com  )
control. I'm playing with - make that evaluating -  both it and Msinet.
Msinet works inconsistently. SocketWrench has it's drawbacks too. For
instance, Msinet will resolve a local file and local host correctly without
doing anything except giving it the name. Socketwrench is a bit more
involved to do that. For internet access, SocketWrench seems to be more
consistent ....

dale sampson
www.dalesplace.net


Quote:
> Is there a simple way to load an HTML page from the web without
> displaying it and make the HTML source available to my program?  I would
> like to parse the page and display only certain information.

> Any help would be greatly appreciated.

> Kyle



Tue, 19 Mar 2002 03:00:00 GMT  
 Loading and parsing HTML page
Look this sample in Microsoft site. It's parse and save the webpage using
WinInet.

http://msdn.microsoft.com/library/techart/msdn_vbnetget.htm

Quote:

> Is there a simple way to load an HTML page from the web without
> displaying it and make the HTML source available to my program?  I would
> like to parse the page and display only certain information.

> Any help would be greatly appreciated.

> Kyle

--


Mon, 25 Mar 2002 03:00:00 GMT  
 Loading and parsing HTML page
Daniels approach is much safer than Inet-Transfer Control.
If you want the document already parsed, and are only
against displaying the content as Webrowser will do,
use IE as InternetExplorer.Application with .visible
set to false. A bit slower due to parsing,
but a lot of work on your side may be saved.

HTH

Thomas


Quote:
>Look this sample in Microsoft site. It's parse and save the webpage using
>WinInet.

>http://msdn.microsoft.com/library/techart/msdn_vbnetget.htm


>> Is there a simple way to load an HTML page from the web without
>> displaying it and make the HTML source available to my program?  I would
>> like to parse the page and display only certain information.

>> Any help would be greatly appreciated.

>> Kyle

>--



Thu, 28 Mar 2002 03:00:00 GMT  
 
 [ 9 post ] 

 Relevant Pages 

1. Load and Parse HTML Doc without WebBrowser Control

2. Parsing html page with Vb.Net service

3. parse HTML pages using regular expressions

4. windows service that parse html pages

5. Parsing html page with Vb.Net service

6. parsing an Internet HTML page

7. Parsing an HTML page

8. Retrieving & Parsing Remote HTML Pages with ASP

9. Parsing out text from HTML page

10. Using server side ASP to capture and parse HTML pages from any URL

11. How can i load a new html page in a different frame

12. Progress indicator whilst loading an HTML page

 

 
Powered by phpBB® Forum Software