Jump to content

Reading Html With Xslt


jdjohns74

Recommended Posts

You can read XHTML with XSLT (regardless of whether the document is served as "text/html" or not; a primary reason why I keep reccomending users to keep using it, instead of going back to HTML 4.01...), but in the common case, you won't be able to read an HTML document. Most HTML documents contain at least one self closing element or are invalid because of an ampersand in attributes. If you find a valid HTML document that looks like XHTML (with no self closing elements or ones that are closed with "/"), it will work.In what environment do you want to do this XSLT processing? Can you use Tidy in it? If so, you can use Tidy to convert the HTML to XHTML, and then use XSLT to convert it into whatever you want.

Link to comment
Share on other sites

If the HTML document is well-formed XHTML then you can certainly use it with XSLT, the same way you use other XML documents with XSLT (i.e. either as the primary input document or as a secondary input document you read in with the XSLT document function or (with XSLT 2.0) the XPath 2.0 doc function).If the HTML document is not well-formed XHTML then it depends on the XSLT processor you use and the kind of input it allows, for instance XSLT processors implemented in Java often allow the input in the form of an org.xml.sax.Parser or org.xml.sax.XMLReader and there are tools like http://home.ccil.org/~cowan/XML/tagsoup/ that parse HTML but expose such an interface and that way you can feed HTML to your XSLT stylesheet. In the .NET world there is SgmlReader that can parse HTML and exposes a System.Xml.XmlReader API you can feed to .NET APIs taking that type as the input e.g. XslCompiledTransform's Transform method.In the browser world of XSLT 1.0 browsers like Mozilla or Opera do allow HTML DOM documents as the input to XPath and XSLT but that is then restricted to documents already loaded in a browser window or frame I think, as I don't think there is an API to load a HTML document solely in memory without rendering it.Finally, if you use XSLT 2.0, David Carlisle has written an HTML tag soup parser in pure XSLT 2.0 (http://code.google.com/p/web-xslt/source/browse/trunk/htmlparse/htmlparse.xsl), so that way you can read in a HTML document as a string with the unparsed-text function and then use David's parser to process it.

Link to comment
Share on other sites

You can read XHTML with XSLT (regardless of whether the document is served as "text/html" or not; a primary reason why I keep reccomending users to keep using it, instead of going back to HTML 4.01...), but in the common case, you won't be able to read an HTML document. Most HTML documents contain at least one self closing element or are invalid because of an ampersand in attributes. If you find a valid HTML document that looks like XHTML (with no self closing elements or ones that are closed with "/"), it will work.In what environment do you want to do this XSLT processing? Can you use Tidy in it? If so, you can use Tidy to convert the HTML to XHTML, and then use XSLT to convert it into whatever you want.
Thanks. Primarily, I'm working in Windows environments. I've been given a couple of technical books in HTML format to extract notes from to create an XML document. I assume that if I can accomplish this that there are other HTML documents that someone would want to convert or use to make XML docs. This is pretty new to me. I did install TidyCOM including the gui. It looked to me like it might be more work to get it to work and learn than trying to use Python lxml to parse it. A Python guru had advised me that using XSLT might minimize what I would have to do in Python. I don't know much about any of these so it may not matter what direction I go. I'm open to anything you might advise. Thanks again.
Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...