Jump to content

Simplexml Parsing Errors


clonetrooper9494

Recommended Posts

I am trying to make a script that allows the user to upload the exported file of their bookmarks, and then grab all of the links and store them in a data base. The problem is that these bookmark files have a lot of syntax errors, such as most of the tags are never closed. Here is an example bookmark file: http://www.clonedrone.com/bookmarks/bookmark_IE.htm (view the source)I tried to fix them by using str_replace, but there are still more errors... this is the code I am using,

<?php$string = file_get_contents('bookmark_IE.htm');$string = str_replace("<p>","",$string);$string = str_replace("</A>","</A></DT>",$string);$string = str_replace("</H3>","</H3></DT>",$string);echo $string;$xml = simplexml_load_string($string);?>

and these are the error messages i am getting

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 6: parser error : Extra content at the end of the document in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 8Warning: simplexml_load_string() [function.simplexml-load-string]: <H1>Bookmarks</H1> in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 8Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 8Fatal error: Call to a member function children() on a non-object in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 11

Are there still some open tags or something?

Link to comment
Share on other sites

What is in $string when there are problems? One such sample?From the first error message, one can suggest that you simply have no element to serve as a wrapper, and after all - XML documents must have one root element. Explicitly wrapping everything up in an element is going to fix that... like so:

simplexml_load_string("<div>{$string}</div>")

The rest of the error appear to be a consequence of the first error.

Link to comment
Share on other sites

That is not the case, while I was waiting for a reply, I did some more work, here is the new script:

<?php$string = file_get_contents('bookmark_IE.htm');$string = str_replace("<p>","",$string);$string = str_replace("</A>","</A></DT>",$string);$string = str_replace("</H3>","</H3></DT>",$string);$string = str_replace("FOLDED ","",$string);$string = "<start>\n".$string."\n</start>\n";echo $string;$xml = simplexml_load_string($string);?>

There are now too many errors to post... but i will post a few since they all say the same thing:

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 1641: parser error : EntityRef: expecting ';' in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 13Warning: simplexml_load_string() [function.simplexml-load-string]: <A HREF="http://w3schools.invisionzone.com/index.php?showtopic=12509&st=0&p=9660 in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 13Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 13Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 1641: parser error : CharRef: invalid decimal value in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 13Warning: simplexml_load_string() [function.simplexml-load-string]: HREF="http://w3schools.invisionzone.com/index.php?showtopic=12509&st=0&p=96608 in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 13Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 13Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 1641: parser error : xmlParseCharRef: invalid xmlChar value 0 in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 13Warning: simplexml_load_string() [function.simplexml-load-string]: HREF="http://w3schools.invisionzone.com/index.php?showtopic=12509&st=0&p=96608 in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 13Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in /home/clonedro/public_html/bookmarks/bookmark_reader.php on line 13

Why is it expecting a ';'?

Link to comment
Share on other sites

Ampersands (&) in URLs need to be encoded as entities (&). Unescaped ampersands are used to mark the start of an entity reference (notice &amp;). ';' marks the end of an entity reference (notice &amp;). So in "&st=0&p=9660", "st=0" becomes the expected name of a referenced entity, and then suddenly, you get a start of a new entity ("p=9660"). Missing out the end of one entity before starting another is an error, and you therefore get these messages.Solving this error is going to be tricky, because you need to turn all "&" into "&", but you need to leave all actual entity references (like " " for example) intact.I suggest you ditch your string replacing technique, and use a specialized tool like Tidy. It will surely generate a valid (X)HTML, from which you can then extract whatever you need.@justsomeguy, I think the line "1641" may be from the XML string... we never saw it, and I suppose the reason is exactly because of its size (I mean... having at least 1641 lines... that would be huge to show in this forum).

Link to comment
Share on other sites

... sorry about that boen_robot... this may sound kinda stupid, but i didn't see your post... between the time I read Justsomeguy's and posted the page never refreshed, and I never scrolled up to see that you posted after editting...sorry, my bad... :)

Link to comment
Share on other sites

Googled it and found that '&' is a reserved character, meant to help escape other characters.
Khm... did you read what I said? Perhaps I didn't said it clearly enough... sorry about that. Still - let me repeat my earlier suggestion - use Tidy. Revert back the ampersands, then use Tidy to be more precise. Tidy will intelegently escape the correct ampersands, and will fix any other errors that may arise.
Link to comment
Share on other sites

I edited my last post explaining why I missed your last post boen_robot.I looked in to Tidy, and I am afraid that my host has not installed it... I will ask them to install it, but in the mean time I will try to work around this without Tidy.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...