GrandiJoos Posted October 6, 2007 Share Posted October 6, 2007 Hi,I have never used XPath before and could not find an example that comes close to my problem.I have the following XML file (a wikinews dump): <mediawiki> <page> <title>Image:Wiki.png</title> <id>1</id> <restrictions>move=sysop:edit=sysop</restrictions> <revision> <id>92375</id> <timestamp>2005-06-29T02:57:00Z</timestamp> <contributor> <username>NGerda</username> <id>2442</id> </contributor> <text xml:space="preserve">Wikinews logo.{{CopyrightByWikimedia}}</text> </revision> </page></mediawiki> (with more <page> elements of course)I would like to only select those pages (the titles and text tags) where the title does not contain 'category:' or 'Image:' or 'Template:' etc. or just does not contain ':' (I do not have all the pages, only most of them).What would be the right XPath expression?Any help is greatly appreciated!GrandiJoos Link to comment Share on other sites More sharing options...
boen_robot Posted October 6, 2007 Share Posted October 6, 2007 If you're sure there isn't any <page/> with a real title containing ":", as in, a real article, you can use: /mediawiki/page[contains(title,':')] Link to comment Share on other sites More sharing options...
GrandiJoos Posted October 6, 2007 Author Share Posted October 6, 2007 If you're sure there isn't any <page/> with a real title containing ":", as in, a real article, you can use:/mediawiki/page[contains(title,':')] And how do I test if the <title> does not contain a ':' and then select only the title and text terms?GrandiJoos Link to comment Share on other sites More sharing options...
boen_robot Posted October 6, 2007 Share Posted October 6, 2007 Opps... forgot we're searching negatives... simply put a not() wrap to onvert the match: /mediawiki/page[not(contains(title,':'))] Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.