Javascript RegExp for "well-forming" HTML? 
Author Message
 Javascript RegExp for "well-forming" HTML?

Since IE regularly mangles HTML (presumably to save space) by removing
closing tags, unquoting certain attributes and having certain attributes
(e.g. CHECKED) that have no value, and I have an engine which needs that
HTML to be valid XML, I've been hunting for a quick Javascript routine which
simply makes any inner/outerHTML into a well formed XML string.  I've got
the expression which will quote any unquoted attributes, but haven't
successfully figured out how to add ="true" to any attributes which have no
value.  Any takers?

Thanks,
Grant



Wed, 08 Sep 2004 19:33:40 GMT  
 Javascript RegExp for "well-forming" HTML?
Here's an example of how to use replacement functions to accomplish this.
WARNING: Regular expressions are Q&D examples, not intended for use in real
code! Testing and fine tuning required!

  // test string
  str = '<p><input type="radio" checked> Some text</p>'

  // get HTML tags
  getTag = /<\s*(\w+)\s+(\w+\s*(?:=\s*(['"])?[\s\S]*?\2)*)>/g;
  function fixTag($0,$1,$2) {
    // get attributes in tags
    var getAttr = /\w+\s*(?:=(?:\s*(['"])[\s\S]*?\1)|\S+)?/g
    function fixAttr($0) {
      return ($0.indexOf("=") == -1) ? $0+'="'+$0+'"' : $0
    }
    return "<"+$1+" "+$2.replace(getAttr,fixAttr)+">";
  }

  WScript.echo(str.replace(getTag ,fixTag));

--
If the aborigine drafted an IQ test, all of Western civilization would
presumably flunk it. -Stanley Marion Garn

=-=-=
Steve
-=-=-



Quote:
> Since IE regularly mangles HTML (presumably to save space) by removing
> closing tags, unquoting certain attributes and having certain attributes
> (e.g. CHECKED) that have no value, and I have an engine which needs that
> HTML to be valid XML, I've been hunting for a quick Javascript routine which
> simply makes any inner/outerHTML into a well formed XML string.  I've got
> the expression which will quote any unquoted attributes, but haven't
> successfully figured out how to add ="true" to any attributes which have no
> value.  Any takers?

> Thanks,
> Grant



Wed, 08 Sep 2004 22:33:54 GMT  
 Javascript RegExp for "well-forming" HTML?
Thanks, Steve.

Grant

Quote:
> Here's an example of how to use replacement functions to accomplish this.
> WARNING: Regular expressions are Q&D examples, not intended for use in
real
> code! Testing and fine tuning required!

>   // test string
>   str = '<p><input type="radio" checked> Some text</p>'

>   // get HTML tags
>   getTag = /<\s*(\w+)\s+(\w+\s*(?:=\s*(['"])?[\s\S]*?\2)*)>/g;
>   function fixTag($0,$1,$2) {
>     // get attributes in tags
>     var getAttr = /\w+\s*(?:=(?:\s*(['"])[\s\S]*?\1)|\S+)?/g
>     function fixAttr($0) {
>       return ($0.indexOf("=") == -1) ? $0+'="'+$0+'"' : $0
>     }
>     return "<"+$1+" "+$2.replace(getAttr,fixAttr)+">";
>   }

>   WScript.echo(str.replace(getTag ,fixTag));

> --
> If the aborigine drafted an IQ test, all of Western civilization would
> presumably flunk it. -Stanley Marion Garn

> =-=-=
> Steve
> -=-=-



> > Since IE regularly mangles HTML (presumably to save space) by removing
> > closing tags, unquoting certain attributes and having certain attributes
> > (e.g. CHECKED) that have no value, and I have an engine which needs that
> > HTML to be valid XML, I've been hunting for a quick Javascript routine
which
> > simply makes any inner/outerHTML into a well formed XML string.  I've
got
> > the expression which will quote any unquoted attributes, but haven't
> > successfully figured out how to add ="true" to any attributes which have
no
> > value.  Any takers?

> > Thanks,
> > Grant



Fri, 10 Sep 2004 00:43:25 GMT  
 Javascript RegExp for "well-forming" HTML?



Quote:
> Since IE regularly mangles HTML (presumably to save space) by removing
> closing tags, unquoting certain attributes and having certain
attributes
> (e.g. CHECKED) that have no value,

It simply has a different normalisation to what you want, it doesn't
"mangle HTML", it's normalisation is well defined and appears consistent
through versions.

Quote:
> and I have an engine which needs that
> HTML to be valid XML, I've been hunting for a quick Javascript routine
which
> simply makes any inner/outerHTML into a well formed XML string.  I've
got
> the expression which will quote any unquoted attributes, but haven't
> successfully figured out how to add ="true" to any attributes which
have no
> value.  Any takers?

Don't use innerHTML and regexp's, it's not going to guarantee XML well
formedness of innerHTML, just construct your XML document fragment from
the HTML DOM yourself, much easier, and guaranteed XML.

Jim.



Fri, 10 Sep 2004 19:13:36 GMT  
 Javascript RegExp for "well-forming" HTML?
I must politely beg to differ; I've seen that its treatment is consistent,
but my definition of mangling includes altering the content I explicitly
coded in the document :-)

Does your suggestion mean walking the DOM nodes and creating nodes in my
XML, copying attributes and content, etc.?  If so, I think that would be
perfect if I were having to deal with any HTML, and I'll keep it in mind for
that.  In our case I've got a limited set of tags to deal with and the
attributes will be controlled through my own client-side code, so I'll stick
with the RegExps for now (until it screws up enough ;-).  Unless, of course,
you think the time consumed by regexp matching would be comparable to that
of traversing the HTML nodes, 'cause that sounds a tad expensive to me, with
all the createNode/createElements.  What are your thoughts on the
efficiencies?

Grant


Quote:



> > Since IE regularly mangles HTML (presumably to save space) by removing
> > closing tags, unquoting certain attributes and having certain
> attributes
> > (e.g. CHECKED) that have no value,

> It simply has a different normalisation to what you want, it doesn't
> "mangle HTML", it's normalisation is well defined and appears consistent
> through versions.

> > and I have an engine which needs that
> > HTML to be valid XML, I've been hunting for a quick Javascript routine
> which
> > simply makes any inner/outerHTML into a well formed XML string.  I've
> got
> > the expression which will quote any unquoted attributes, but haven't
> > successfully figured out how to add ="true" to any attributes which
> have no
> > value.  Any takers?

> Don't use innerHTML and regexp's, it's not going to guarantee XML well
> formedness of innerHTML, just construct your XML document fragment from
> the HTML DOM yourself, much easier, and guaranteed XML.

> Jim.



Fri, 10 Sep 2004 22:53:00 GMT  
 Javascript RegExp for "well-forming" HTML?
Also, if I were to try to make XML nodes out of the HTML nodes, how would I
handle user attributes in the HTML.  I think we found in the past that
unless I knew the name of the attribute I was looking for, I couldn't fetch
the attribute, even using the attributes collection - am I wrong on that?

Thanks, Jim!

Grant



Quote:
> I must politely beg to differ; I've seen that its treatment is consistent,
> but my definition of mangling includes altering the content I explicitly
> coded in the document :-)

> Does your suggestion mean walking the DOM nodes and creating nodes in my
> XML, copying attributes and content, etc.?  If so, I think that would be
> perfect if I were having to deal with any HTML, and I'll keep it in mind
for
> that.  In our case I've got a limited set of tags to deal with and the
> attributes will be controlled through my own client-side code, so I'll
stick
> with the RegExps for now (until it screws up enough ;-).  Unless, of
course,
> you think the time consumed by regexp matching would be comparable to that
> of traversing the HTML nodes, 'cause that sounds a tad expensive to me,
with
> all the createNode/createElements.  What are your thoughts on the
> efficiencies?

> Grant





> > > Since IE regularly mangles HTML (presumably to save space) by removing
> > > closing tags, unquoting certain attributes and having certain
> > attributes
> > > (e.g. CHECKED) that have no value,

> > It simply has a different normalisation to what you want, it doesn't
> > "mangle HTML", it's normalisation is well defined and appears consistent
> > through versions.

> > > and I have an engine which needs that
> > > HTML to be valid XML, I've been hunting for a quick Javascript routine
> > which
> > > simply makes any inner/outerHTML into a well formed XML string.  I've
> > got
> > > the expression which will quote any unquoted attributes, but haven't
> > > successfully figured out how to add ="true" to any attributes which
> > have no
> > > value.  Any takers?

> > Don't use innerHTML and regexp's, it's not going to guarantee XML well
> > formedness of innerHTML, just construct your XML document fragment from
> > the HTML DOM yourself, much easier, and guaranteed XML.

> > Jim.



Sat, 11 Sep 2004 03:09:49 GMT  
 
 [ 6 post ] 

 Relevant Pages 

1. JavaScript "\"" and WShell Run

2. Escape " (double quot) in regEXP pattern

3. "regexp" object

4. How to count "Premium" and "Free" Pages Viewed using HTML Forms Authentication

5. What's the difference "text/javascript" and "javascript"?

6. Disabling "BACK"/"FORWARD" buttons

7. CreateObject("Excel","//server"), MsgBox output

8. Problem With "window.showmodaldialog("")"

9. Disabling "BACK"/"FORWARD" buttons

10. IE 4.5 Javascript "self.location" Issue

11. href="javascript:..." and IE5

12. Vbscript alternative to Javascripts "settimeout()"

 

 
Powered by phpBB® Forum Software