a few more questions on XML and python 
Author Message
 a few more questions on XML and python

Hi,
  thanks for the nice pointers regrding XML and python. I have another
newbie question - I'm trying to get information out of a CML (Chemical
Markup Language) file which is a subset of XML. My query is that using
xml.parsers.expat I can get a loist of tags and values etc. I

How can I  indicate to the parser that if say it comes upon a tag like
<coordinate> it should call a certain handler to process the coordinate
data.

Or am I supposed to just do the handling by extractin the information
provided by expat. If so is there any type of 'hook' system where I can
plug in my handlers.

I'm a little confused as to how expat is supposed to handle an arbitrary
XML file where the tags could be describing anything.

As you can see I'm pretty confused :(

TIA

--
-------------------------------------------------------------------

417 Davey Laboratory           | web  : www.rajarshi.outputto.com
Department of Chemistry        | ICQ  : 123242928
Pennsylvania State University  | AIM  : LoverOfPanda
-------------------------------------------------------------------
GPG Fingerprint: DCCB 4D1A 5A8B 2F5A B5F6  1F9E CDC4 5574 9017 AF2A
Public Key     : http://www.*-*-*.com/
-------------------------------------------------------------------
Mathematics consists of proving the most obvious thing
in the least obvious way.
                -- George Polya



Tue, 22 Jun 2004 07:06:28 GMT  
 a few more questions on XML and python

Quote:

> How can I  indicate to the parser that if say it comes upon a tag like
> <coordinate> it should call a certain handler to process the coordinate
> data.

> Or am I supposed to just do the handling by extractin the information
> provided by expat. If so is there any type of 'hook' system where I can
> plug in my handlers.

If you want to use expat directly, you need to get used to the notion
of event-based processing. The parser invokes a callback (aka handler)
for each chunk of XML input. That callback must perform all the
processing.  It will invoke the same callback for all opening tags, so
inside this callback, you must look at the element name. If it is
coordinate, you set a global variable remembering that you have just
seen the <coordinate> tag.

Each of the other callbacks needs to look at the global variable. If
it is set, do everything  needed inside the coordinate.

If the end-element callback is invoked, and the tag is coordinate,
clear the global variable.

Alternatively, you may want to look at the DOM API. There, the parser
reads the entire document first, and gives you a tree structure. You
can then traverse this tree structure as you want.

Regards,
Martin



Tue, 22 Jun 2004 10:38:32 GMT  
 a few more questions on XML and python
On Thursday 03 January 2002 21:38 in comp.lang.python Martin v. Loewis

Quote:

>> Or am I supposed to just do the handling by extractin the information
>> provided by expat. If so is there any type of 'hook' system where I can
>> plug in my handlers.

[snip]
> Alternatively, you may want to look at the DOM API. There, the parser
> reads the entire document first, and gives you a tree structure. You
> can then traverse this tree structure as you want.

Thanks for the pointer regarding DOM - it was very useful.
I do have another question - how can I traverse the tree and see what
available tags got read in? Where can I look up the methods available and
the actual tree structuire? I tried reading the actual .py file but I got
all mixed up - is there any reference to the available methods.?

TIA,

--
-------------------------------------------------------------------

417 Davey Laboratory           | web  : www.rajarshi.outputto.com
Department of Chemistry        | ICQ  : 123242928
Pennsylvania State University  | AIM  : LoverOfPanda
-------------------------------------------------------------------
GPG Fingerprint: DCCB 4D1A 5A8B 2F5A B5F6  1F9E CDC4 5574 9017 AF2A
Public Key     : http://pgp.mit.edu/
-------------------------------------------------------------------
A mathematician is a device for turning coffee into theorems.
                -- P. Erdos



Tue, 22 Jun 2004 12:41:51 GMT  
 a few more questions on XML and python

Quote:

> As you can see I'm pretty confused :(

  Why not use DOM instead of SAX?  XML is clearly tree-based, so
processing it in terms of trees (DOM) tends to be more easily
comprehensible than processing it in terms of streams and event
handlers (SAX).  SAX has definite resource utilization advantages over
DOM for read-only XML access, but I daresay that working code can eat
a great deal of memory and still perform better than a dysfunctional
collection of highly optimized fragments :)

  At its simplest, DOM parsing consists of setting up a series of
nested node iterations to reach the information you want, as in
(Python 2.2):
--------------------------------------------------------------
from StringIO import StringIO
import xml.dom.minidom as dom

theXML = """<?xml version="1.0"?>
<book edition="1">
    <title>The Fascist Menace</title>
    <authors>
        <author>
            <name_first>Josef</name_first>
            <name_first>Stalin</name_first>

        </author>
        <author>
            <name_first>Alexei</name_first>
            <name_first>Voloshnikov</name_first>

        </author>
    </authors>
    <description><![CDATA[Our fearless leader's electrifying call to
action.]]></description>
    <pages>1941</pages>
</book>
"""

doc = dom.parse(StringIO(theXML))

book = doc.documentElement
print 'edition:', book.attributes['edition'].value

# It is the first child (a text node) of the title element,
# rather than the title element itself, that contains the value.
title = book.getElementsByTagName('title')[0].firstChild.nodeValue
print 'title:', title

authors = book.getElementsByTagName('authors')[0]
for author in authors.getElementsByTagName('author'):
    print "an author's email:",
    print author.getElementsByTagName('email')[0].firstChild.nodeValue

description = book.getElementsByTagName('description')[0].firstChild.nodeValue
print "The critics say [well, aside from omygodisthatakgvbofficer?]:",
print description
--------------------------------------------------------------

  Although that code is totally lax about error handling and quite
inefficient (overuse of getElementsByTagName, etc.), it's also brief,
and hopefully more comprehensible that the SAX you've been trying to
write.

  If you're processing huge data sets, DOM isn't going to cut it.  DOM
builds an in-memory representation of the entire document, whereas SAX
handles a single element at a time.  But after you have a bit of
python and XML parsing under your belt, you can always move on to SAX
if necessary, eh?



Tue, 22 Jun 2004 12:54:22 GMT  
 a few more questions on XML and python
Hi,

Quote:
> [...] but I daresay that working code can eat
> a great deal of memory and still perform better than a dysfunctional
> collection of highly optimized fragments :)

Sure, in general, but there are a number of things about reading XML
that can be performed using a lean SAX-style implementation, e.g.
reading rather simple configuration files etc.

Quote:
> If you're processing huge data sets, DOM isn't going to cut it.  DOM
> builds an in-memory representation of the entire document, whereas SAX
> handles a single element at a time.  But after you have a bit of
> Python and XML parsing under your belt, you can always move on to SAX
> if necessary, eh?

What I do in order to simplify parsing XML using SAX is the following
in a class used as an element handler. These two methods dispatch the
calls of startElement/endElement to a bunch of methods called E1_start,
E1_end, E2_start, E2_end, ... for each element type (e.g. E1, E2)
occurring in the XML file. That saves me a large if-construct. Very
simple, but I like to use it a lot.

  def startElement(self, name, attrs):

         mth_name = string.lower(name) + '_start'
         self.attr_st.append(attrs)
         if hasattr(self, mth_name):
             method = getattr(self, mth_name)
             method(attrs)      
         else:
             if self.verbose:
                 print 'Warning: Start of element %s skipped' % name

     def endElement(self, name):
         attrs = self.attr_st.pop()
         mth_name = string.lower(name) + '_end'
         if hasattr(self, mth_name):
             method = getattr(self, mth_name)
             method(attrs)
         else:
             if self.verbose:
                 print 'Warning: End of element %s skipped' % name

Lars



Tue, 22 Jun 2004 16:39:06 GMT  
 a few more questions on XML and python

Quote:

> Where can I look up the methods available and the actual tree structuire? I
> tried reading the actual .py file but I got all mixed up - is there any
> reference to the available methods.?

  The standard library documentation, section 13.6 "xml.dom" describes
the library's DOM support and has links to the W3C DOM specification.
In particular, 13.6.2 "Objects in the DOM" is probably what you're
looking for when you say "reference to the available methods":
http://python.org/doc/current/lib/node438.html

Quote:
> I do have another question - how can I traverse the tree and see what
> available tags got read in?

------------------------------------------------------------------
from StringIO import StringIO
import xml.dom.minidom as dom

theXML = """<?xml version="1.0"?>
<book edition="1">
    <title>The Fascist Menace</title>
    <authors>
        <author>
            <name_first>Josef</name_first>
            <name_last>Stalin</name_last>

        </author>
        <author>
            <name_first>Alexei</name_first>
            <name_last>Voloshnikov</name_last>

        </author>
    </authors>
    <description><![CDATA[Our fearless leader's electrifying call to
action.]]></description>
    <pages>1941</pages>
</book>
"""

def printTagNames(node, indentationLevel=0):
    print indentationLevel * ' ' + node.tagName

    for child in node.childNodes:
        if child.nodeType == dom.Node.ELEMENT_NODE:
            printTagNames(child, indentationLevel+4)

doc = dom.parse(StringIO(theXML))

print '--- TAG DUMP ---'
printTagNames(doc.documentElement)
------------------------------------------------------------------



Tue, 22 Jun 2004 19:31:03 GMT  
 a few more questions on XML and python

    ...

Quote:
> I'm a little confused as to how expat is supposed to handle an arbitrary
> XML file where the tags could be describing anything.

See http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/65248
for a specific example of using expat.  However, most often you would
instead use the more flexible interface SAX, as per example in
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/65127.

An excellent general introduction to XML and Python is at
http://www.oreilly.com/catalog/pythonxml/chapter/ch01.html (you'll
probably want to buy the whole book after reading this superb
first chapter).

Note that, be it with expat or SAX, the parsing is "event driven".
When an opening tag is encountered, what follows it is not known
yet.  So, what you normally do:

    in the start-handler for the tag of your interest, you save
    its attributes (all that's known so far) and prepare containers
    for embedded tags and text that may be needed;

    when tags or characters are received after you've seen the
    open-tag of interest but before the close-tag, you save the
    relevant information in the containers;

    at close-handler time, you process the saved information for
    the tag.

Let's take a simple example.  Say that an XML-marked-up text,
whatever else it may contain, has a tag called 'coordinate',
with a mandatory attribute 'name'; between <coordinate name="x">
and the corresponding </coordinate>, only character data may
be present (no need to worry about other embedded tags, except
maybe to diagnose an error and terminate processing).

Given such an XML file, you want to output coordinate data only
in the form of printing:

name -> coordinate data

to standard output.

OK so far?

Here's one possible approach, then:

import xml.sax

class handler(xml.sax.handler.ContentHandler):
    def startDocument(self):
        self.current_data = None
        self.current_name = None
    def endDocument(self):
        assert self.current_data is None
        assert self.current_name is None
    def startElement(self, name, attr):
        assert self.current_data is None
        assert self.current_name is None
        if name=='coordinate':
            self.current_data = []
            self.current_name = attr.get('name')
    def endElement(self, name):
        if name=='coordinate':
            assert self.current_data is not None
            assert self.current_name is not None
            print "%s -> %s" % (self.current_name,
                ''.join(self.current_data))
            self.current_data = None
            self.current_name = None
    def characters(self, content):
        if self.current_data is not None:
            self.current_data.append(content)
    def ignorableWhitespace(self, ws):
        if self.current_data is not None:
            self.current_data.append(ws)

# and some tiny self-testing...:
if __name__=='__main__':
    x = '''<?xml version="1.0" encoding="ISO8859-1"?>
    <blobof>
    One <coordinate name="a">23 45</coordinate> two <plik/>
    three <coordinate name="b">42 68</coordinate> four and
    <some>other</some> tag.
    </blobof>
    '''
    flob = open('someinput.xml', 'w')
    flob.write(x)
    flob.close()

    xml.sax.parse('someinput.xml', handler())

Normally, you want to process several different tags, and
testing for each case in startElement and endElement is not
elegant nor productive.  Then, you can dispatch on tag name
in each of these methods, in several possible ways -- Python
makes it easy to do so via introspection, and that's a
common tack to take (by imitation of sgmllib for example).
E.g.,

    def startElement(self, name, attr):
        try: method = getattr(self, 'start_'+name)
        except AttributeError: pass
        else: method(attr)
    def endElement(self, name):
        try: method = getattr(self, 'end_'+name)
        except AttributeError: pass
        else: method(attr)

and then you'd code the blocks that in the above example are
guarded by the "if name=='coordinate':" clauses into methods
    def start_coordinate(self, attr):
and
    def end_coordinate(self, attr):

But this doesn't deeply change the nature of what's going on.
It's still basically a game of preparing object-state in the
start-tag methods, enriching it in method characters, and
processing the accumulated stuff in the end-tag methods.

Alex



Tue, 22 Jun 2004 21:30:18 GMT  
 a few more questions on XML and python

Quote:
> [...] but I daresay that working code can eat
> > a great deal of memory and still perform better than a dysfunctional
> > collection of highly optimized fragments :)

> Sure, in general, but there are a number of things about reading XML
> that can be performed using a lean SAX-style implementation, e.g.
> reading rather simple configuration files etc.

  Certainly.  I wasn't suggesting that SAX should be generally
avoided, just that a newbie to both Python and XML parsing should
probably start with DOM because it tends to be more easily
comprehensible.


Tue, 22 Jun 2004 21:34:59 GMT  
 
 [ 8 post ] 

 Relevant Pages 

1. A few python questions

2. Few questions about new features in Python 2.2

3. New Python/XML column on xml.com

4. XML, Python, XML-SIG, documentation, PyXml, 4

5. Python, XML and ECMAScript for XML?

6. python & xml question

7. newbie xml+python question

8. Several questions: python plugin, xml-rpc + Medusa

9. Python-XML newbie question...

10. Python XML processing (newbie question)

11. Newbie Python+XML Question

12. Some questions on elided text in the text widget (and a few other text widget questions)

 

 
Powered by phpBB® Forum Software