Jump to content

xml recursive search


rahn
 Share

Recommended Posts

Hi folks -- thanks for the assistance,I have a couple thousand xml docs each in there own directory and I am not sure about the best way to go about setting up a simple keyword search through all the docs. Can some one point me in the right direction. I have a typical LAMP set up so, I was thinking that the way to go would be to set up an xquery, stick the results in an array, recurse through all the directories, and then pump it out with php? Or should I just do a regex search through all the docs, store the reference to the matched files in an array, and then call up the xml file with an xslt. Or should I be a bad boy and stick all the xml docs into mysql and just search through the full text, and then call up the xml file w/xslt. I'm confused, but want to do this the right way! Thanks.R.

Link to comment
Share on other sites

XQery sounds like the best option to me, but.. I don't think PHP has any support for it just yet. It only supports XPath 1.0 and DOM. This could be enough if you use it wisely. I personally never really thought about that, but now when I think about it... here's the logic you need to do by using PHP:

  1. Scan the directory for the targeted files. I think the scandir() function might be a good start, but I'm not sure how to filter only the XML files and/or filter those files even furher (say by directory for example).
  2. Loop through the array and load each file from it.
  3. Perform the desired XPath expression over the current file.
  4. Generate an XML (fragment) containing each result (and optionally some related data).
  5. Store the created fragment to a global newly created DOM tree and go to the next result. Each next result will be added to this DOM tree.
  6. Take the newly created DOM tree as an XML input and use an XSLT file to create the result page.
  7. Show the user the transformation.

Tought? Maybe not as much as it sounds, but it's not a walk in the park... for me at least.When XPath 2.0 and XQuery 1.0 are supported in PHP, things will get easier. By using the standart doc() function, you'll be able to generate the tree without calling a new DOM instance for every document.

Link to comment
Share on other sites

  • 2 weeks later...

Could the Google search engine be used on just created files? Wow :) If so, can't Google simply search the XML files?Anyhow, that won't be usefull if you want to have your own result page. It's also unpractical if you're only searching for a certain text at a known position.

Link to comment
Share on other sites

Well, I found that parsing through all the xml docs is really slow, so what I ended up doing was extracting a few of the more important nodes and sticking that in mysql. A search on a few fields is really fast, then sablotron parses the xml doc with an xslt. What I meant by using google, was to do a sablotron run on all the xml files to transform and save them in static html. Google would pick them up a few days later, probably. But the current solution seems to be working ok, though sablotron sometimes chokes on files.

Link to comment
Share on other sites

I see why you didn't liked my solution... it would only work with PHP5 and Salbotron is only part of PHP4. PHP5's libxslt is faster.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...