Join the Sax Parsing Party

Posted by feydr | Posted in Uncategorized | Posted on 01-07-2010

View Comments

Many people utilize the DOM based parsing methods available in easy to use libraries throughout most languages but few ever need to utilize the power of SAX. Let me invite you to the party!

We’ve all been there — we need to parse an xml document so we pull out some xpath expression and get to work on grabbing our little bit of data from our document.

token = REXML::XPath.first(doc, "/books/title/")

This works great and is easy to use no matter your language of choice for MOST things. Now, let’s pretend this particular document is 10 meg — fuck that, let’s pretend it’s 100 meg — actually let’s get real and call it 2.5 gig of xml.

2.5!!! WTF!? That won’t fit into my memory!?!?

No, it won’t if you are using DOM parsing since that is based upon building a tree. That is if you are willing to wait for the fucker to even load itself into the tree — forget the memory usage.



I ran into this problem earlier today — memory hit a little over what the document size was in htop but the cpu just kept burning while the world was turning. I eventually killed it and contemplated splitting the xml apart — of course … I was going to run into the exact same problem if I continued to utilize the old worn out DOM parsing methods.



This is when I was turned onto SAX parsing from some random mailing list plea for help. All I need to know I learnt from man pages and mailing lists. I had ran across SAX in the past but was not too excited that I had to prepare a language JUST to parse my document rather than pass it some tried and true xpath — basically I was used to parsing html DOM — NOT million line xml documents. I had read that it was faster in the past but my feeble attempts at comparing documents of the 200k size were just an exercise in futility to find any meaning in using SAX.


Needless to say knowing the power of SAX is like shooting Columbian cocaine, XPath is drinking a pot of coffee, and regex is getting punched in the nuts while you are asleep.

Now let me address the audience in what happened. That little job I killed that was bitch-smacking one of my cores around for over 10 minutes and using over 10% of 6gig of memory did the EXACT same extraction under 10 seconds utilizing something like 0.1% of memory and only 70%ish CPU. I, to use the phrase slightly, shit a fucking brick!



Want to see an utterly contrived example of pulling 40+k urls from a sitemap which tied up my cores for-fucking-ever using old style DOM parsing but was done in less than 10 seconds with SAX?

require 'rexml/document'
require 'rexml/streamlistener'
require 'mysql'
include REXML
 
MY = Mysql.new("host", "luser", "assword", "dbname")
 
class ParseThatShit
  include REXML::StreamListener
  def tag_start(*args)
  end
 
  def text(data)
    return if data =~ /^\w*$/
    return if data =~ /\d\.\d/
    st = MY.prepare("insert into links (name) values (?)")
    st.execute(data)
    st.close
  end
end
 
pts = ParseThatShit.new
xmlfile = File.new("big-ass-sitemap.xml")
Document.parse_stream(xmlfile, pts)

Remember boys and girls — this Example is just an example to show you the power of SAX — does it take more LOC than a 3 line xpath block? Yeh, but so does your mother.

Use the power of SAX and drink the spice while you slip into the higher dimension of parsing power!

  • anon_anon
    SAX is somewhat hard to use, have you looked at vtd-xml?
  • feydr
    Hi anon_anon -- for xml documents that are not well-formed I agree SAX is prob. going to make you pull out your hair -- but then again -- hopefully the only things that are not well formed are going to be webpages which in case you can resort to DOM traversal instead. I have not heard of vtd-xml -- thanks for pointing that out to me. -- Having not tested vtd-xml take this with a grain of salt but I am a bit concerned over it's claim of storing 1.3-1.5x the size of the xml document in memory -- that is definitely NOT something I want for large documents -- definitely worth a look though.
blog comments powered by Disqus