HTML -> list of sentences? (semi-impossible task) 
Author Message
 HTML -> list of sentences? (semi-impossible task)

Hello, all.

Here's an idea I'm toying with. Suggestions
are welcome.

I want to take an HTML document (reasonably
well-formed, but not guaranteed) and remove
all the tags from it...

..and get a list of the *sentences* in the
document.

There are, of course, several things that make
this difficult:
  - need to distinguish between end-of-sentence
    and embedded punctuation, including both
    abbreviations and textual references to
    Ruby methods such as eof? and split!
  - need to treat sentence fragments as sentences
  - need to ignore blocks of code
  - etc.

My current approach is to start with htmlsplit
from the RAA. This is fairly simplistic, but
at least it doesn't have any dependencies.

Not sure whether to do it in two steps or not:
1. Convert to text
2. Process

Might be just as easy to do it in one step if
I knew what I was doing.

Also not sure what is the best tool/library for
this job.

Comments welcome.

Hal

--
Hal Fulton



Mon, 28 Nov 2005 10:38:13 GMT  
 HTML -> list of sentences? (semi-impossible task)

Quote:

> Here's an idea I'm toying with. Suggestions
> are welcome.

> I want to take an HTML document (reasonably
> well-formed, but not guaranteed) and remove
> all the tags from it...

> ...and get a list of the *sentences* in the
> document.

Attached is a 30 mins. hack of mine that does something like that.
The scanning part is really a kludge but I've been using it w/
acceptable results in a proxy that add hints to webpages on the fly.

Quote:
> There are, of course, several things that make
> this difficult:
>   - need to distinguish between end-of-sentence
>     and embedded punctuation, including both
>     abbreviations and textual references to
>     Ruby methods such as eof? and split!

They only way I can think of to do that is having a list of methods and
abbreviations to ignore.

Quote:
>   - need to treat sentence fragments as sentences
>   - need to ignore blocks of code

Kind of doable if you have a dictionary (/usr/share/dict/words should
be enough). For each candidate sentence, you see how many words are
there and take it if the percentage is above some threshold.

Quote:
>   - etc.

> My current approach is to start with htmlsplit
> from the RAA. This is fairly simplistic, but
> at least it doesn't have any dependencies.

> Not sure whether to do it in two steps or not:
> 1. Convert to text
> 2. Process

> Might be just as easy to do it in one step if
> I knew what I was doing.

IMHO it can be done in one pass.

----

Given http://www.rubygarden.org/ruby?ClassMethodsTutorial, my hack
returns

["This", "is", "simply", "an", "extract", "from", "a", "post", "to", "ruby-talk", "by", "DavidBlack", "on", "the", "topic", "of", "class", "methods."]
["It", "is", "stored", "here", "in", "the", "hope", "that", "it", "will", "be", "useful!"]
["It", "actually", "goes", "beyond", "the", "surface", "of", "class", "methods", "to", "describe", "the", "nature", "of", "classes", "as", "objects", "so", "is", "interesting", "reading", "for", "anyone", "progressing", "to", "intermediate-level", "Ruby."]
["See", "also", "ClassMethods", "for", "an", "overview", "of", "the", "options", "available", "in", "Ruby", "for", "creating", "class", "methods", "and", "SingletonTutorial", "for", "a", "detailed", "explanation", "of", "singleton", "methods."]
["Every", "object", "responds", "to", "certain", "messages", "i.e.", "can", "call", "methods", "with", "certain", "names", "."]
["Usually", "those", "methods", "are", "the", "instance", "methods", "defined", "by", "the", "object's", "class", "However", "it's", "also", "possible", "to", "add", "methods", "to", "individual", "objects", "Now", "c", "will", "respond", "to", "speak", "--", "but", "other", "instances", "of", "class", "C", "will", "not", "This", "means", "that", "speak", "is", "a", "singleton", "method", "of", "c."]
["now", "look", "at", "this", "Notice", "the", "similarity", "between", "the", "syntax", "involved", "in", "creating", "a", "new", "singleton", "method", "for", "c", "and", "creating", "a", "class", "method", "of", "class", "D", "In", "fact", "these", "are", "essentially", "the", "same", "thing."]
["In", "both", "cases", "what's", "happening", "is", "that", "a", "singleton", "method", "is", "being", "added", "to", "a", "particular", "object."]
["It", "just", "happens", "to", "be", "that", "in", "the", "second", "case", "the", "object", "getting", "the", "new", "method", "is", "a", "Class", "object", "as", "opposed", "to", "a", "String", "an", "Array", "an", "instance", "of", "MyClass", "?"]
["So", "now", "D", "responds", "to", "greet", "just", "as", "c", "responds", "to", "speak", "."]
["In", "other", "words", "the", "term", "class", "method", "is", "just", "a", "special", "term", "for", "something", "which", "you", "can", "do", "with", "any", "mutable", "object", "namely", "add", "a", "singleton", "method", "to", "it."]
["It", "has", "a", "special", "name", "because", "in", "actual", "program", "design", "class", "methods", "have", "a", "special", "role", "to", "play."]
["But", "what", "they", "are", "at", "heart", "is", "singleton", "methods", "defined", "on", "objects", "where", "those", "objects", "happen", "to", "be", "instances", "of", "a", "class", "called", "Class."]
["The", "use", "of", "uppercase", "names", "constants", "for", "classes", "can", "obscure", "the", "fact", "that", "classes", "are", "just", "objects."]
["Also", "the", "usual", "style", "is", "to", "put", "class", "method", "definitions", "inside", "the", "class", "definition", "which", "makes", "it", "look", "like", "they", "have", "some", "special", "status."]
["But", "look", "at", "this", "etc."]
["You", "can", "see", "that", "some", "of", "the", "special", "treatment", "of", "classes", "--", "constants", "as", "names", "the", "separate", "notion", "of", "class", "method", "for", "their", "singleton", "methods", "--", "is", "just", "that", "special", "treatment."]
["Underneath", "a", "class", "is", "indeed", "an", "object."]
["CategoryDocumentation", "CategoryTutorial", "HomePage", "RecentChanges", "Preferences", "RubyGarden", "Edit", "text", "of", "this", "page", "View", "other", "revisions", "Last", "edited", "May", "am", "diff", "Search"]

note that "i.e." was recognized :)
However several problems are yet to be solved:
 * how to get rid of meaningful lone words? (last line)
 * what happens to things like
  As seen here:
    CODE
  bla bla bla.
 * etc

However solving that would transform the 30mins. hack into a 1H kludge,
better stay this way :)

--
 _           _                            
| |__   __ _| |_ ___ _ __ ___   __ _ _ __  
| '_ \ / _` | __/ __| '_ ` _ \ / _` | '_ \
| |_) | (_| | |_\__ \ | | | | | (_| | | | |
|_.__/ \__,_|\__|___/_| |_| |_|\__,_|_| |_|
        Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Because I don't need to worry about finances I can ignore Microsoft
and take over the (computing) world from the grassroots.
        -- Linus Torvalds

  bloom.c
3K Download

  extconf.rb
< 1K Download

  scanhtml.rb
2K Download


Mon, 28 Nov 2005 14:43:43 GMT  
 HTML -> list of sentences? (semi-impossible task)
It depends on what you mean by "sentence", 'ey?  Do you mean natural
language (English? Rumanian? Urdu? Hakka? Thai? Japanese?), or
artificial formalisms like programming languages (Perl, Ruby, FORTH)?

But someone went to a lot of trouble to carve up their perceptions of
reality (heh) into procrustean HTML, so you may as well begin there.  
Determine the major syntactical units  (TABLE, DIV, P, HR, PRE, TT, H1,
etc.).  Recursing, determine what is a "sentence" on semantic,
idiomatic (BR, B, U), or at least grammatical , grounds.
  Collect these purely formal "sentences" and send the list to
post-processing (possibly human inspection) to be vetted and refined
(e.g., does your system account for utterances which are meaningful but
grammatically abbreviated, like "What up?" (MTV argot used by
advertisers to slide nickels out of pockets) or "Annta desu" (kids
choosing sides for oni in Osaka). )

If you have access to a page's CSS, your hints about what the author(s)
intended are much expanded.  Maybe not so impossible after all?  This
does not seem like a difficult task to me, but maybe I haven't
appreciated the context from which the question is posed?  Does the
solution have to be extremely general, or is it a one-shot?

David

Quote:

> Hello, all.

> Here's an idea I'm toying with. Suggestions
> are welcome.

> I want to take an HTML document (reasonably
> well-formed, but not guaranteed) and remove
> all the tags from it...

> ...and get a list of the *sentences* in the
> document.

> There are, of course, several things that make
> this difficult:
>   - need to distinguish between end-of-sentence
>     and embedded punctuation, including both
>     abbreviations and textual references to
>     Ruby methods such as eof? and split!
>   - need to treat sentence fragments as sentences
>   - need to ignore blocks of code
>   - etc.

> My current approach is to start with htmlsplit
> from the RAA. This is fairly simplistic, but
> at least it doesn't have any dependencies.

> Not sure whether to do it in two steps or not:
> 1. Convert to text
> 2. Process

> Might be just as easy to do it in one step if
> I knew what I was doing.

> Also not sure what is the best tool/library for
> this job.

> Comments welcome.

> Hal

> --
> Hal Fulton


--

Cedar Rapids, Iowa       http://homepage.mac.com/dcoshel
``I think most pleasantly in metaphors, and smoking brings metaphors to
mind." - Augustus Srb, in Alexei Panshin's  _Star Well_


Mon, 28 Nov 2005 22:29:23 GMT  
 HTML -> list of sentences? (semi-impossible task)

Quote:

> Hello, all.

> Here's an idea I'm toying with. Suggestions
> are welcome.

> I want to take an HTML document (reasonably
> well-formed, but not guaranteed) and remove
> all the tags from it...

> ...and get a list of the *sentences* in the
> document.

> There are, of course, several things that make
> this difficult:
>   - need to distinguish between end-of-sentence
>     and embedded punctuation, including both
>     abbreviations and textual references to
>     Ruby methods such as eof? and split!
>   - need to treat sentence fragments as sentences
>   - need to ignore blocks of code
>   - etc.

> My current approach is to start with htmlsplit
> from the RAA. This is fairly simplistic, but
> at least it doesn't have any dependencies.

> Not sure whether to do it in two steps or not:
> 1. Convert to text
> 2. Process

I would parse into a tree, process there, then strip tags.  The reason
being, ruby code and other nongramatical entities are likely to be
offset by tags -- <pre>, <code>, <tt>, things like that.  Not always,
but it's a useful heuristic.

It's not a trivial task -- I've done a lot of natural-language work for
the Wiki that I run (it's markup is one of the least code-like of any
wiki).  How good you need the results to be are a big deciding factor in
how to implement, for sure.  Natural language parsing is a big cpu
cruncher.

- Show quoted text -

Quote:

> Might be just as easy to do it in one step if
> I knew what I was doing.

> Also not sure what is the best tool/library for
> this job.

> Comments welcome.

> Hal

> --
> Hal Fulton




Tue, 29 Nov 2005 00:16:50 GMT  
 HTML -> list of sentences? (semi-impossible task)

Quote:
----- Original Message -----


Sent: Thursday, June 12, 2003 1:43 AM
Subject: Re: HTML -> list of sentences? (semi-impossible task)

> Attached is a 30 mins. hack of mine that does something like that.
> The scanning part is really a kludge but I've been using it w/
> acceptable results in a proxy that add hints to webpages on the fly.

The world is full of kludges. One more won't hurt.

Quote:
> Given http://www.rubygarden.org/ruby?ClassMethodsTutorial, my hack
> returns

[snippage]

Great. I will look into your source.

Thanks,
Hal



Tue, 29 Nov 2005 07:09:12 GMT  
 HTML -> list of sentences? (semi-impossible task)

Quote:
----- Original Message -----


Sent: Thursday, June 12, 2003 9:29 AM
Subject: Re: HTML -> list of sentences? (semi-impossible task)

> It depends on what you mean by "sentence", 'ey?  Do you mean natural
> language (English? Rumanian? Urdu? Hakka? Thai? Japanese?), or
> artificial formalisms like programming languages (Perl, Ruby, FORTH)?

In this case, English sentences. Not as in formal grammars, or as
in prison sentences. Not that those two are so different.

Quote:
> But someone went to a lot of trouble to carve up their perceptions of
> reality (heh) into procrustean HTML, so you may as well begin there.
> Determine the major syntactical units  (TABLE, DIV, P, HR, PRE, TT, H1,
> etc.).  Recursing, determine what is a "sentence" on semantic,
> idiomatic (BR, B, U), or at least grammatical

, grounds.

Quote:
>   Collect these purely formal "sentences" and send the list to
> post-processing (possibly human inspection) to be vetted and refined
> (e.g., does your system account for utterances which are meaningful but
> grammatically abbreviated, like "What up?" (MTV argot used by
> advertisers to slide nickels out of pockets) or "Annta desu" (kids
> choosing sides for oni in Osaka). )

I think even that is perhaps too much intelligence.I don't want to
build in knowledge about nouns and verbs.

Quote:
> If you have access to a page's CSS, your hints about what the author(s)
> intended are much expanded.  Maybe not so impossible after all?  This
> does not seem like a difficult task to me, but maybe I haven't
> appreciated the context from which the question is posed?

My parents sometims quote a comedian from before I was born: "Easy for
you, difficult for me."

Quote:
>  Does the
> solution have to be extremely general, or is it a one-shot?

Ehh, somewhat general in the sense of several chapters. But very
one-shot in that I'm looking at one particular document, and it's
about Ruby. ;)

I think the replies I've got are fairly promising along with my
own dirty hack from last night.

Cheers,
Hal

Quote:
> David


> > Hello, all.

> > Here's an idea I'm toying with. Suggestions
> > are welcome.

> > I want to take an HTML document (reasonably
> > well-formed, but not guaranteed) and remove
> > all the tags from it...

> > ...and get a list of the *sentences* in the
> > document.

> > There are, of course, several things that make
> > this difficult:
> >   - need to distinguish between end-of-sentence
> >     and embedded punctuation, including both
> >     abbreviations and textual references to
> >     Ruby methods such as eof? and split!
> >   - need to treat sentence fragments as sentences
> >   - need to ignore blocks of code
> >   - etc.

> > My current approach is to start with htmlsplit
> > from the RAA. This is fairly simplistic, but
> > at least it doesn't have any dependencies.

> > Not sure whether to do it in two steps or not:
> > 1. Convert to text
> > 2. Process

> > Might be just as easy to do it in one step if
> > I knew what I was doing.

> > Also not sure what is the best tool/library for
> > this job.

> > Comments welcome.

> > Hal

> > --
> > Hal Fulton

> --

> Cedar Rapids, Iowa       http://homepage.mac.com/dcoshel
> ``I think most pleasantly in metaphors, and smoking brings metaphors to
> mind." - Augustus Srb, in Alexei Panshin's  _Star Well_



Tue, 29 Nov 2005 07:16:07 GMT  
 HTML -> list of sentences? (semi-impossible task)

Quote:
----- Original Message -----


Sent: Thursday, June 12, 2003 11:16 AM
Subject: Re: HTML -> list of sentences? (semi-impossible task)

> I would parse into a tree, process there, then strip tags.  The reason
> being, ruby code and other nongramatical entities are likely to be
> offset by tags -- <pre>, <code>, <tt>, things like that.  Not always,
> but it's a useful heuristic.

> It's not a trivial task -- I've done a lot of natural-language work for
> the Wiki that I run (it's markup is one of the least code-like of any
> wiki).  How good you need the results to be are a big deciding factor in
> how to implement, for sure.  Natural language parsing is a big cpu
> cruncher.

Yes, in this case, large code fragments are always set off by "pre"
tags. That does simplify.

As I said, I'm not interested in true natural-language parsing.
Something "mostly" accurate is good enough.

Thanks,
Hal



Tue, 29 Nov 2005 07:18:33 GMT  
 
 [ 7 post ] 

 Relevant Pages 

1. LOGO-L> Semi-Random Sentences

2. Tasking under GNAT Linux - Semi success

3. HTML file => list of objects

4. To semi or not to semi: opinion sought on small style questio

5. To semi or not to semi: opinion sought on small style question

6. Sentence to list

7. Bugs in ST80 4.1 (Stream>>through, List>>copyFrom:to:)

8. LOGO-L> Re: logo and sentence generating

9. Why is it impossible to have a list of something of the same class in Haskell

10. Why is it impossible to have a list of something of the same class in Haskell

11. SELL this FBI NOC LIST and MAKE MILLIONS like TOM CRUISE did in MISSION IMPOSSIBLE

12. LOGO-L> Semi-Random Songs

 

 
Powered by phpBB® Forum Software