Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Getting a standards-compliant XML parser into 2K lines is going to be a challenge, if you're not going to cheat on what a "line" is. You must be able to deal with both UTF-8 and UTF-16 [1] (and remember UTF-16 can be in either endian order), you have several tables of things like what chars are valid where, you've got data structures to declare, and there's a lot of edge cases that may not leap to mind but if you don't cover them you don't really have an XML parser, like CDATA handling, entity loading from a DTD, parsing DTDs at least well enough to get those entities, processor commands, etc. A useful subset certainly, something I'd actually call a real XML parser, I'm skeptical. Not quite ready to write the idea off, but skeptical. (It's saved from me writing if off entirely because if I read the spec correctly a parser is not required to resolve external DTD references, if it was it would be stick-a-fork-in-it done, you'd eat hundreds of lines just using raw sockets to do HTTP requests and manage them even halfway properly.)

A JSON parser? Heck yes, even with the UTF-8 handling. It wouldn't even necessarily suck.

[1]: http://www.w3.org/TR/REC-xml/#charsets



A JSON parser at ~2K lines: https://github.com/johnezang/JSONKit I'm the author, so I'm obviously biased :). It has strict Unicode standard UTF-8 parsing ("passes" http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt) and has a whiz-bang LRU cache with a clever aging replacement policy that saves lots of time by reusing already seen and instantiated immutable objects (think JSON dictionaries: the keys in "key": value pairs tend to repeat an awful lot). A further benefit is this saves lots of memory too. For converting JSON to 'native' ObjC/Foundation objects, it's faster than the other ObjC JSON libraries by 2-10x (HEAVILY dependent on the JSON being parsed, typical is 2-4x), and serializing ObjC -> JSON is (usually) even faster.

So, in short, it's possible to write a very high performance JSON serializer and deserializer in ObjC in ~2K lines of code. For JSONKit, most it is actually pure C (which in turn uses Core Foundation, which is a pure C API interface version for the equivalent native ObjC objects), with public API ObjC stub bindings making use of the C code.


> Getting a standards-compliant XML parser into 2K lines is going to be a challenge

This is a pretty good argument against XML.


Yes, that point may have had some influence on the way I phrased my post.


I would say its an argument for using JSON in situations where you need stuff to be small, fast and simple and XML in situations where you need it (where you want to do a queries easily over the document structure, which XPath is fairly decent at, interop with systems that insist on XML and things that are more document-ish rather than data-ish.


I recently wrote a Turtle[1] parser and abbreviating serialiser[2] in just over 2K lines, which I'd say is roughly equivalent in complexity to doing the same for JSON. This is with UTF-8 support, full conformance, URI parsing/resolution, line/column error reporting, etc. A more kludgy job could be quite a bit smaller still...

[1]: http://www.w3.org/TeamSubmission/turtle/ [2]: http://drobilla.net/software/serd/


Sounds like a challenge (to someone ;) A good XML test suite to confirm "fullness" must exist?


A JSON parser (of sorts) in roughly 150 lines of C code. I'm not the author.

https://github.com/quartzjer/js0n




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: