Javascript RegExp for "well-forming" HTML?
Author |
Message |
Grant W Park #1 / 6
|
 Javascript RegExp for "well-forming" HTML?
Since IE regularly mangles HTML (presumably to save space) by removing closing tags, unquoting certain attributes and having certain attributes (e.g. CHECKED) that have no value, and I have an engine which needs that HTML to be valid XML, I've been hunting for a quick Javascript routine which simply makes any inner/outerHTML into a well formed XML string. I've got the expression which will quote any unquoted attributes, but haven't successfully figured out how to add ="true" to any attributes which have no value. Any takers? Thanks, Grant
|
Wed, 08 Sep 2004 19:33:40 GMT |
|
 |
Steve Fulto #2 / 6
|
 Javascript RegExp for "well-forming" HTML?
Here's an example of how to use replacement functions to accomplish this. WARNING: Regular expressions are Q&D examples, not intended for use in real code! Testing and fine tuning required! // test string str = '<p><input type="radio" checked> Some text</p>' // get HTML tags getTag = /<\s*(\w+)\s+(\w+\s*(?:=\s*(['"])?[\s\S]*?\2)*)>/g; function fixTag($0,$1,$2) { // get attributes in tags var getAttr = /\w+\s*(?:=(?:\s*(['"])[\s\S]*?\1)|\S+)?/g function fixAttr($0) { return ($0.indexOf("=") == -1) ? $0+'="'+$0+'"' : $0 } return "<"+$1+" "+$2.replace(getAttr,fixAttr)+">"; } WScript.echo(str.replace(getTag ,fixTag)); -- If the aborigine drafted an IQ test, all of Western civilization would presumably flunk it. -Stanley Marion Garn =-=-= Steve -=-=-
Quote: > Since IE regularly mangles HTML (presumably to save space) by removing > closing tags, unquoting certain attributes and having certain attributes > (e.g. CHECKED) that have no value, and I have an engine which needs that > HTML to be valid XML, I've been hunting for a quick Javascript routine which > simply makes any inner/outerHTML into a well formed XML string. I've got > the expression which will quote any unquoted attributes, but haven't > successfully figured out how to add ="true" to any attributes which have no > value. Any takers? > Thanks, > Grant
|
Wed, 08 Sep 2004 22:33:54 GMT |
|
 |
Grant W Park #3 / 6
|
 Javascript RegExp for "well-forming" HTML?
Thanks, Steve. Grant
Quote: > Here's an example of how to use replacement functions to accomplish this. > WARNING: Regular expressions are Q&D examples, not intended for use in real > code! Testing and fine tuning required! > // test string > str = '<p><input type="radio" checked> Some text</p>' > // get HTML tags > getTag = /<\s*(\w+)\s+(\w+\s*(?:=\s*(['"])?[\s\S]*?\2)*)>/g; > function fixTag($0,$1,$2) { > // get attributes in tags > var getAttr = /\w+\s*(?:=(?:\s*(['"])[\s\S]*?\1)|\S+)?/g > function fixAttr($0) { > return ($0.indexOf("=") == -1) ? $0+'="'+$0+'"' : $0 > } > return "<"+$1+" "+$2.replace(getAttr,fixAttr)+">"; > } > WScript.echo(str.replace(getTag ,fixTag)); > -- > If the aborigine drafted an IQ test, all of Western civilization would > presumably flunk it. -Stanley Marion Garn > =-=-= > Steve > -=-=-
> > Since IE regularly mangles HTML (presumably to save space) by removing > > closing tags, unquoting certain attributes and having certain attributes > > (e.g. CHECKED) that have no value, and I have an engine which needs that > > HTML to be valid XML, I've been hunting for a quick Javascript routine which > > simply makes any inner/outerHTML into a well formed XML string. I've got > > the expression which will quote any unquoted attributes, but haven't > > successfully figured out how to add ="true" to any attributes which have no > > value. Any takers? > > Thanks, > > Grant
|
Fri, 10 Sep 2004 00:43:25 GMT |
|
 |
Jim Le #4 / 6
|
 Javascript RegExp for "well-forming" HTML?
Quote: > Since IE regularly mangles HTML (presumably to save space) by removing > closing tags, unquoting certain attributes and having certain attributes > (e.g. CHECKED) that have no value,
It simply has a different normalisation to what you want, it doesn't "mangle HTML", it's normalisation is well defined and appears consistent through versions. Quote: > and I have an engine which needs that > HTML to be valid XML, I've been hunting for a quick Javascript routine which > simply makes any inner/outerHTML into a well formed XML string. I've got > the expression which will quote any unquoted attributes, but haven't > successfully figured out how to add ="true" to any attributes which have no > value. Any takers?
Don't use innerHTML and regexp's, it's not going to guarantee XML well formedness of innerHTML, just construct your XML document fragment from the HTML DOM yourself, much easier, and guaranteed XML. Jim.
|
Fri, 10 Sep 2004 19:13:36 GMT |
|
 |
Grant W Park #5 / 6
|
 Javascript RegExp for "well-forming" HTML?
I must politely beg to differ; I've seen that its treatment is consistent, but my definition of mangling includes altering the content I explicitly coded in the document :-) Does your suggestion mean walking the DOM nodes and creating nodes in my XML, copying attributes and content, etc.? If so, I think that would be perfect if I were having to deal with any HTML, and I'll keep it in mind for that. In our case I've got a limited set of tags to deal with and the attributes will be controlled through my own client-side code, so I'll stick with the RegExps for now (until it screws up enough ;-). Unless, of course, you think the time consumed by regexp matching would be comparable to that of traversing the HTML nodes, 'cause that sounds a tad expensive to me, with all the createNode/createElements. What are your thoughts on the efficiencies? Grant
Quote:
> > Since IE regularly mangles HTML (presumably to save space) by removing > > closing tags, unquoting certain attributes and having certain > attributes > > (e.g. CHECKED) that have no value, > It simply has a different normalisation to what you want, it doesn't > "mangle HTML", it's normalisation is well defined and appears consistent > through versions. > > and I have an engine which needs that > > HTML to be valid XML, I've been hunting for a quick Javascript routine > which > > simply makes any inner/outerHTML into a well formed XML string. I've > got > > the expression which will quote any unquoted attributes, but haven't > > successfully figured out how to add ="true" to any attributes which > have no > > value. Any takers? > Don't use innerHTML and regexp's, it's not going to guarantee XML well > formedness of innerHTML, just construct your XML document fragment from > the HTML DOM yourself, much easier, and guaranteed XML. > Jim.
|
Fri, 10 Sep 2004 22:53:00 GMT |
|
 |
Grant W Park #6 / 6
|
 Javascript RegExp for "well-forming" HTML?
Also, if I were to try to make XML nodes out of the HTML nodes, how would I handle user attributes in the HTML. I think we found in the past that unless I knew the name of the attribute I was looking for, I couldn't fetch the attribute, even using the attributes collection - am I wrong on that? Thanks, Jim! Grant
Quote: > I must politely beg to differ; I've seen that its treatment is consistent, > but my definition of mangling includes altering the content I explicitly > coded in the document :-) > Does your suggestion mean walking the DOM nodes and creating nodes in my > XML, copying attributes and content, etc.? If so, I think that would be > perfect if I were having to deal with any HTML, and I'll keep it in mind for > that. In our case I've got a limited set of tags to deal with and the > attributes will be controlled through my own client-side code, so I'll stick > with the RegExps for now (until it screws up enough ;-). Unless, of course, > you think the time consumed by regexp matching would be comparable to that > of traversing the HTML nodes, 'cause that sounds a tad expensive to me, with > all the createNode/createElements. What are your thoughts on the > efficiencies? > Grant
> > > Since IE regularly mangles HTML (presumably to save space) by removing > > > closing tags, unquoting certain attributes and having certain > > attributes > > > (e.g. CHECKED) that have no value, > > It simply has a different normalisation to what you want, it doesn't > > "mangle HTML", it's normalisation is well defined and appears consistent > > through versions. > > > and I have an engine which needs that > > > HTML to be valid XML, I've been hunting for a quick Javascript routine > > which > > > simply makes any inner/outerHTML into a well formed XML string. I've > > got > > > the expression which will quote any unquoted attributes, but haven't > > > successfully figured out how to add ="true" to any attributes which > > have no > > > value. Any takers? > > Don't use innerHTML and regexp's, it's not going to guarantee XML well > > formedness of innerHTML, just construct your XML document fragment from > > the HTML DOM yourself, much easier, and guaranteed XML. > > Jim.
|
Sat, 11 Sep 2004 03:09:49 GMT |
|
|
|