SGML/HTML parsing tool 
Author Message
 SGML/HTML parsing tool

Hey folks,

Recently I've found myself writing a lot of scripts that basically take
an HTML page that's up on the web somewhere, read in it's contents,
parse out the HTML tags, and throw everything into a database.  (More
specifically, I usually have to extract the data from each individual
table cell from a certain table and put it in the right database
column.)  I do this by writing the scripts from scratch, but I can't
help but think that there has to be a better way.  I've seen a few
classes (like HTML::TreeBuilder) that will do simple HTML parsing.
However, I've had limited success finding much good sample source code
for HTML::TreeBuilder, so I haven't put a lot of effort into that.

What I want to know is:
1.  Has anyone else ever found themselves in the situation I describe?
(i.e., needing to parse a lot of HTML files and extra data from
different table cells)

2.  If so, what tools did you use?  What would you recommend?  What
tool did you find that was easiest to adapt to different purposes
(e.g., sometimes I only want the data from TD's that are on odd-
numbered rows, etc.)

3.  If any of your answers above involved HTML::TreeBuilder, can you
point me in the direction of some good documentation and/or sample code
that uses this module?

Thanks for any help anyone can provide.
Brad

Sent via Deja.com http://www.*-*-*.com/
Before you buy.



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

Quote:

> Recently I've found myself writing a lot of scripts that basically take
> an HTML page that's up on the web somewhere, read in it's contents,
> parse out the HTML tags, and throw everything into a database.

Cool.  I had an idea last week for a roundish thing, a sort of disk, and
if you put a bar between two of them you could move stuff easily.
Seriously:

Quote:
> What I want to know is:
> 1.  Has anyone else ever found themselves in the situation I describe?
> (i.e., needing to parse a lot of HTML files and extra data from
> different table cells)

> 2.  If so, what tools did you use?  What would you recommend?  What
> tool did you find that was easiest to adapt to different purposes
> (e.g., sometimes I only want the data from TD's that are on odd-
> numbered rows, etc.)

Have you tried HTML::Parser for parsing HTML?  Or tried 'perl -MCPAN -e
shell' and searching for HTML or SGML there?

-Chris
--
Christopher R. Maden, Solutions Architect
Exemplary Technologies
One Embarcadero Center, Ste. 2405
San Francisco, CA 94111



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

MCMXCIII in <URL::">
;;
;; Have you tried HTML::Parser for parsing HTML?

Yes, I have. It's about as useful for parsing HTML as a table is
to do shopping with.

HTML::Parser doesn't parse, nor does it have any HTML knowledge.

Abigail
--

  -----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==----------
   http://www.newsfeeds.com       The Largest Usenet Servers in the World!
------== Over 73,000 Newsgroups - Including  Dedicated  Binaries Servers ==-----



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

kiosk62279> 1.  Has anyone else ever found themselves in the situation
kiosk62279> I describe?  (i.e., needing to parse a lot of HTML files
kiosk62279> and extra data from different table cells)

yes.

kiosk62279> 2.  If so, what tools did you use?  What would you recommend?  What
kiosk62279> tool did you find that was easiest to adapt to different purposes
kiosk62279> (e.g., sometimes I only want the data from TD's that are on odd-
kiosk62279> numbered rows, etc.)

The two hardest things about parsing HTML are:

1) deducing the missing close tags (for which you must understand the DTD)
2) handling poor markup (there's a lot of junk on the net that incorrectly
   MIMEs itself as "text/html")

For #1, I'm currently building a tool using Parse::RecDecent that
takes a DTD to generate a recursive descent parser that so far seems
to correctly "close out" the missing end tags and validate the right
attributes.  (The first trivial application will be an HTML
"pretty-printer" for a column I'm writing for WT.)  It has horrible
error recovery though, so I'm thinking about how to modify it to
handle #2.

I can now see why SGML/HTML is a dead-end, and XML/XHTML will rock.
Those optional close-tags are *hard*, and XML has none such.

print "Just another Perl hacker and web-whacker,"

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095

Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

Quote:


>MCMXCIII in <URL::">
>;; Have you tried HTML::Parser for parsing HTML?
>Yes, I have. It's about as useful for parsing HTML as a table is
>to do shopping with.
>HTML::Parser doesn't parse, nor does it have any HTML knowledge.

i'd hate to see the pathological cases you must have to deal with.

i've been using HTML::Parser in dozens of scripts and it hasn't
coughed up a hairball yet.  it has tackled hand generated code, as
well as {*filter*}from frontpage and dreamweaver with aplomb.

then again your definition of "parsing" may well be different from mine.

--
Jon Drukman
Director of Technology
GameSpot



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

    Randal> For #1, I'm currently building a tool using
    Randal> Parse::RecDecent that takes a DTD to generate a recursive

    <snip>

    Randal> I can now see why SGML/HTML is a dead-end, and XML/XHTML
    Randal> will rock.  Those optional close-tags are *hard*, and XML
    Randal> has none such.

Just curious, are you using Parse::RecDescent rather than SGMLS
because it allows a pure Perl solution, or is there some other
advantage?

Kent

--

                                 http://darwin.eeb.uconn.edu
-- Department of Ecology & Evolutionary Biology          
-- University of Connecticut, U-43                                      
-- Storrs, CT   06269-3043                                              



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

Randal> For #1, I'm currently building a tool using
Randal> Parse::RecDecent that takes a DTD to generate a recursive

kent>     <snip>

Randal> I can now see why SGML/HTML is a dead-end, and XML/XHTML
Randal> will rock.  Those optional close-tags are *hard*, and XML
Randal> has none such.

kent> Just curious, are you using Parse::RecDescent rather than SGMLS
kent> because it allows a pure Perl solution, or is there some other
kent> advantage?

Partially to increase my knowledge of how P::RD works, partially
because I want a flexible toolbase that I can build from to add
pretty-printing, error-recovery, general rewrites, scanning and
extracting, etc.  SGMLS probably doesn't have that.

print "Just another Perl hacker,"

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095

Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool



: Randal> For #1, I'm currently building a tool using
: Randal> Parse::RecDecent that takes a DTD to generate a recursive

: kent>     <snip>

: Randal> I can now see why SGML/HTML is a dead-end, and XML/XHTML
: Randal> will rock.  

   I was gonna mention something the first time I saw that, and
   since it is quoted here, I'll just throw in my comment now  :-)

   If the Web is the Final Destination for your document, then
   what Randal says there is true.

   I expect that is the case for most readers here.

   If however, the Web is merely _one_ of the ways you need
   your data output, then it is the inverse: XML is a dead-end,
   and SGML rocks!

   Some of the bits of SGML that got axed to create XML are
   exceedingly useful to publishers of large, long-lived,
   complex documents.

   Like most of my clients, heh, heh.   :-)

   I keep the "truth copy" of my data in SGML. It is a rather
   simple matter to translate it to HTML, XML, etc... whenever
   those output formats are required.

   (note also that the w3c _could_ have defined HTML as an SGML
    application and still disallowed omitted tags and whatnot,
    making it easy to parse.

    Those features can be configured in the "SGML Declaration".

    But they chose not to, probably in the spirit of
    "being generous in what you accept", even if that allows
    the data to be hard to process. Oh well.  <shrug>
   )

: Randal> Those optional close-tags are *hard*, and XML
: Randal> has none such.

: kent> Just curious, are you using Parse::RecDescent rather than SGMLS
: kent> because it allows a pure Perl solution, or is there some other
: kent> advantage?

: Partially to increase my knowledge of how P::RD works, partially
: because I want a flexible toolbase that I can build from to add
: pretty-printing, error-recovery, general rewrites, scanning and
: extracting, etc.  SGMLS probably doesn't have that.

   nsgmls (sgmls is way out of date) is an application built
   with a toolkit (in C++).

   SP (the toolkit) likely gives visibility to at least some of
   those things, but who wants to write C++ instead of Perl?

   I hope you are planning to hard-wire it to HTML, rather than
   try and make a general-purpose SGML parser, 'cause SGML has
   some features that make parsing hard. Sounds like you have
   already found out first hand though :-)

   Omitted tags are not at the top of the "hard things" to parse
   in arbitrary SGML though.

   I expect "exceptions", in particular "exclusions", to provide
   code-contortions. I think this feature rules out using standard
   compiler-compilers like yacc for making general-purpose SGML
   parsers. Best to avoid them if you can (they are, of course,
   disallowed in XML :-)

--
    Tad McClellan                          SGML Consulting

    Fort Worth, Texas



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

Quote:


> MCMXCIII in <URL::">
> ;; Have you tried HTML::Parser for parsing HTML?

> Yes, I have. It's about as useful for parsing HTML as a table is
> to do shopping with.

> HTML::Parser doesn't parse, nor does it have any HTML knowledge.

Yeah, and you're a cranky geezer.  It's still useful.  In fact, its
non-parser status is helpful, since most HTML out there is such {*filter*}that
it doesn't parse cleanly, and an ad-hoc event thrower is in many ways more
useful.  At O'Reilly, I used it to clean up the {*filter*}spewed by
FrameMaker+SGML, to check links for CD-ROMs, and to mangle stuff into the
Rocket e-Book format.

It'll do what the original poster *really* wants (if he uses it
correctly), though not fulfilling the letter of his request.

-Chris
--
Christopher R. Maden, Solutions Architect
Exemplary Technologies
One Embarcadero Center, Ste. 2405
San Francisco, CA 94111



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

    Tad>    nsgmls (sgmls is way out of date) is an application built
    Tad> with a toolkit (in C++).

I was referring to SGMLS.pm, which provides a Perl interface to the
output of nsgmls.

Kent

--

                                 http://darwin.eeb.uconn.edu
-- Department of Ecology & Evolutionary Biology          
-- University of Connecticut, U-43                                      
-- Storrs, CT   06269-3043                                              



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

Quote:

>   Some of the bits of SGML that got axed to create XML are
>   exceedingly useful to publishers of large, long-lived,
>   complex documents.

We're off topic for this newsgroup, but... while we're at it: what
features?

All I know about SGML is that it is sooo horribly permissive in the
syntax it must accept, that it is impossible to write a generic SGML
parser from scratch. So the whole world is stuck with just one SGML
parser, written by one guy: James Clark. A very unhealthy "monopoly", in
my opinion. (Tell me, honestly: who ever but J. CLark has ever looked
into the source for his parser family?)

At least, it is very well possible to write a complete XML parser from
scratch in just a few days.

--
        Bart.



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

: >   Some of the bits of SGML that got axed to create XML are
: >   exceedingly useful to publishers of large, long-lived,
: >   complex documents.

: We're off topic for this newsgroup,

   Right.

   So this is my last followup to this sub-thread.

   If you want to move it (or start a new thread) in comp.text.sgml
   we can continue there.

: but... while we're at it: what
: features?

   exceptions, for one, which I already mentioned.

   several others:  short-ref, tag minimization, ...

: All I know about SGML is that it is sooo horribly permissive in the
: syntax it must accept,

   It is a meta-language, and so allows lots (perhaps too much) of
   flexibilty in its support for the "real" language (that is, the
   language that you are specifying with SGML).

   As I mentioned before, there are lots of things in the "SGML
   Declaration" that can be configured.

   The "default" SGML Declaration has many features turned on.

   The SGML Declaration for XML has many features turned off.

   The w3c left many features turned on in their SGML Declaration
   for the language they defined, HTML.

: that it is impossible to write a generic SGML
: parser from scratch.

   A full-up ISO-8879 conforming parser is indeed a Big Job.

   With care in defining _your_ particular language (what you put
   in your SGML Declaration and DTD), you can arrange for it
   to be more easily parsed.

   Much easier to parse the language being described that it is
   to parse the meta-language that you use to do the describing.

   I wrote a parser for a milstd markup language defined in SGML
   in about a week several years ago.

: So the whole world is stuck with just one SGML
: parser, written by one guy: James Clark. A very unhealthy "monopoly", in
: my opinion.

   There are others, but James' is by far the most popular
   (and it is open, several of the others are commercial).

: (Tell me, honestly: who ever but J. CLark has ever looked
: into the source for his parser family?)

   Probably 1/2 to 3/4 of all of the companies that make SGML tools,
   as SP is embedded in a bunch of them.

   So I figure at least 100-200 people perhaps?

   :-)

: At least, it is very well possible to write a complete XML parser from
: scratch in just a few days.

   And, since XML is a subset of SGML, a parser _can be_ easy
   to write for a language defined in SGML too.

   You just avoid using features that are Too Hard compared
   to the Perceived Benefit (cost/benefit analysis).

   <plug>
      A Consultant can help you make that trade-off intelligently.
   </plug>

   heh, heh (I gotta live, you know :-)

--
    Tad McClellan                          SGML Consulting

    Fort Worth, Texas



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

MCMXCIII in <URL::">


|| >MCMXCIII in <URL::">
|| >;; Have you tried HTML::Parser for parsing HTML?
||
|| >Yes, I have. It's about as useful for parsing HTML as a table is
|| >to do shopping with.
|| >HTML::Parser doesn't parse, nor does it have any HTML knowledge.
||
|| i'd hate to see the pathological cases you must have to deal with.
||
|| i've been using HTML::Parser in dozens of scripts and it hasn't
|| coughed up a hairball yet.  it has tackled hand generated code, as
|| well as {*filter*}from frontpage and dreamweaver with aplomb.
||
|| then again your definition of "parsing" may well be different from mine.

So, tell me, what does HTML::Parser do?

All it does is doing a call back when encountering _tokens_.

That's not parsing.

Of course, you could write a piece of software that does something
when seeing:

     $i<--===>%i{&**]!!"flup fluP"

and call it a Perl parser, just because it recognizes some characters
as tokens....

Abigail
--


  -----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==----------
    http://www.*-*-*.com/       The Largest Usenet Servers in the World!
------== Over 73,000 Newsgroups - Including  Dedicated  Binaries Servers ==-----



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

Quote:

>So, tell me, what does HTML::Parser do?

>All it does is doing a call back when encountering _tokens_.

>That's not parsing.

Then HTML::Lexer might have been a better name for the module.

--
        Bart.



Wed, 18 Jun 1902 08:00:00 GMT  
 SGML/HTML parsing tool

in <URL::">
^^
^^ >So, tell me, what does HTML::Parser do?
^^ >
^^ >All it does is doing a call back when encountering _tokens_.
^^ >
^^ >
^^ >That's not parsing.
^^
^^ Then HTML::Lexer might have been a better name for the module.

No. Because it has no frigging clue about HTML. The only part of
HTML::Parser that isn't a misnomer is the '::'.

Tokenization of HTML cannot be done context-free. A simple example:

    <p><br><br></p>

would have 4 "tokens". But

    <script><br><br></script>

only has 3. Which you can only know if you know about HTML.

Abigail
--
               split // => '"';



  -----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==----------
   http://www.newsfeeds.com       The Largest Usenet Servers in the World!
------== Over 73,000 Newsgroups - Including  Dedicated  Binaries Servers ==-----



Wed, 18 Jun 1902 08:00:00 GMT  
 
 [ 16 post ]  Go to page: [1] [2]

 Relevant Pages 

1. ANNOUNCE: Perl SGML/HTML/Mosaic/Frame MIF tools

2. SGML-SPGrove-0.01: perl module for loading SGML, XML, HTML

3. Perl tools for parsing HTML?

4. HTML Parse Tree Tools in Perl

5. Design Questions for a new SGML perl tool

6. Parsing SGML with Perl

7. Parsing SGML file

8. more on parsing SGML file

9. parsing SGML file

10. SGML::parser & SGML::ultil

11. SGML to HTML for *green* Perl People

12. SGML to HTML Perl Scripts???

 

 
Powered by phpBB® Forum Software