XML Behavior in parsers

shadowayex · December 21, 2010

I'm working on a C++ XML parser (just because I can), but I'm unsure on how it should handle certain things.Case 1 - Misplaced '/'I have my parser search a tag for a '/' to determine whether the tag is self-closing or not. The issue I hit is what if the XML file has a misplaced '/', such as <foo val="bar" / > or (even worse) <foo / val="bar" />. My instinct is to check for a '>' following the location of the '/' found, and if there is not one, present an error message of some sort.Case 2 - Invalid XMLI'm unsure of what to here in general. This is mainly because I'm not sure if it is the job of a parser to make sure the XML is valid or not. Theoretically, if a file were to have something of the sort <$ (*)="foo" />, my parser would function, creating an XMLTag object with the name $ and one XMLAttribute, (*), with the value foo. Should I be checking for such things and generating error messages.Case 3 - OdditiesI'm not even sure if this is valid or not, but what should be the result of something like:

<foo>	Foo	<bar />	Bar</foo>

My parser has three parts to each XMLTag, XMLAttributes, child XMLTags, and a string of just content. As it stands, my parser would end up with a child XMLTag called 'foo' and '\n Foo\n\n Bar\n' in the content field.I'm sure there are other cases that may come up as I go, but these three are the ones I'm kind of stuck on at the moment. I just don't really know enough about XML parsers to know what they do about these cases.

boen_robot · December 21, 2010

case 1:The specification defines an empty element as

'<' Name (SAttribute)* S? '/>'

which means that "/>" is a single token.The specification also defines that for every error, the parser must stop processing and should provide an error message. (I'm not sure exactly where it says that... but all parsers do it)(In other words, if you have anything after the "/" but ">", the parser must stop processing.)//break intentionally omitted :)case 2:It also places contrains on the node names, so in the example you gave, the parser must crash because of invalid start tag name.All in all, if in doubth, consult the specification. That's why it's public.break; // :)case 3:Text nodes are defined as a separate kind of node, which should be part of your parser's child collection. if you really want to also keep content for convinience, I guess you're doing it right in this case.FYI, dealing with unicode is probably the hardest part with XML parsing... C++ can natively only deal with ANSI... do you have that covered?

shadowayex · December 21, 2010

So, my parser has to just stop when it finds an error and provide a message. Simple enough.I'm reading through the specification now. I will probably be back with more questions, but it might be a while, as I need to figure out how to read the specification before I can even attempt to go further.Thank you for your help so far, and I'll be back (hopefully) soon.

Sign In

XML Behavior in parsers

Recommended Posts

shadowayex

Link to comment

Share on other sites

boen_robot

Link to comment

Share on other sites

shadowayex

Link to comment

Share on other sites

Archived

Browse

Activity