Jump to content

Search And Copy Xml Files Into A New Directory


crult

Recommended Posts

Hello,i'm using the 10.3 version of Oxygen in windows xp. I have the following problem: i have some folders with many XML files inside. I want to find only the files that contain a specific annotation (for example <Secteur>SCI</Secteur> ). There are some newspaper articles in XML format. The content SCI indicates that this is a science article. Other articles have <Secteur>SPO</Secteur> for Sports for example. I want to find only the science articles and to copy them in a new directory, doing it automatically. I used the option find, i found some results but i can't take each file manually to copy it (600 results). Is there any solution using XSLT? Thanks for your response.

Link to comment
Share on other sites

Do you have any backend in place? PHP, ASP.NET, ColdFusion, etc.?This can't be done by XSLT 1.0 alone. XSLT (1.0) can only produce one document as output. And you instead want multiple files as output.If you have XSLT 2.0, you can store the locations of all files in an extra file, then for each location in that file, check this category, and then produce a new document that is the copy-of the document. AFAIK, Oxygen's XSLT processor is XSLT 1.0. Does the editor have SAXON? SAXON is an XSLT 2.0 processor, so it could do the job (assuming that using the editor every time you want to trigger SAXON is OK for you).

Link to comment
Share on other sites

Do you have any backend in place? PHP, ASP.NET, ColdFusion, etc.?This can't be done by XSLT 1.0 alone. XSLT (1.0) can only produce one document as output. And you instead want multiple files as output.If you have XSLT 2.0, you can store the locations of all files in an extra file, then for each location in that file, check this category, and then produce a new document that is the copy-of the document. AFAIK, Oxygen's XSLT processor is XSLT 1.0. Does the editor have SAXON? SAXON is an XSLT 2.0 processor, so it could do the job (assuming that using the editor every time you want to trigger SAXON is OK for you).
Hi,No backend. Oxygen has something like Xpath 1.0, Xpath 2.0 and Xpath 2.0 SA to choose from. I don't know if it helps me. How can i verify the XSLT version? I will search for SAXON. If it doesn't exist how can i find it (or an another XSLT 2.0 processor?). I'm beginner and it's so hard :) Thank you very much for the help!
Link to comment
Share on other sites

I haven't worked with Oxygen XML, but from what I can see in its XSLT Editor page, you need to open the "Edit Scenario" window (see "Support for Multiple Transformations"), and select "Saxon-** 9.*.*.*" (which one exactly you select doesn't really matter).After that, in the <xsl:stylesheet> element, change the "version" attribute to "2.0".Now, if you try to type in some XSLT in a template, you should see the <xsl:result-document> element. That's what you need. The "href" attribute of this element is the location where the new file will be written to, and its contents is the contents of the new file. Using <xsl:copy-of select="document(.)" /> should be enough to copy the whole document (assuming at that point, "." will point the URL of the old document).

Link to comment
Share on other sites

So i have to describe you exactly the situation:All these folders are in the same directory(folder ''XSL'')1) The name of the folder containing the XML files is ''01''2) The folder where i want to extract only the files containing <Secteur>SCI</Secteur> is named ''SCI''. The Xpath is /Document/Article[1]/Secteur[1]. The form of the XML documents is:<?xml version='1.0' encoding='ISO-8859-15'?><Document xyurl='xyl://20040101N0001.xml'> <DocId>20040101N0001</DocId> <Article> <Page Lien='repository/2004/01/01/pages/04010120.pdf'>20</Page> <Date Annee='2004' Mois='01' Jour='01'/> <Publication>LeMonde</Publication> <Secteur>SCI</Secteur> <------ Here is the description of the category <Taille>34</Taille> <Corps> <Titraille> <Tetiere>AUJOURD'HUI VOYAGES</Tetiere> <Titres> <Surtitre>« Hermione », la frégate de Rochefort</Surtitre> <Titre> <P>A bord, la vie était rude</P> </Titre> <SousTitre/> </Titres> </Titraille> <Chapo/> <Origine/> <Texte> <P>Sur l' Hermione, les affûts de canon étaient peints en rouge pour faciliter le nettoyage du sang des hommes après la bataille. La « frégate de douze » était armée de 26 canons de douze (les boulets pèsent 6 kg) et 6 canons de six (boulets de 3 kg). Elle était beaucoup plus légère, rapide et maniable qu'un vaisseau taillé pour le combat avec 118 canons. A bord, l'eau est rationnée à trois pintes par homme et par jour. Les vers et les charançons infestent les biscuits de mer. L'absence de fruits et légumes frais rend le scorbut ravageur. La fièvre typhoïde, la petite vérole et la gangrène sont des maladies fréquentes. L'hygiène est absente, le sommeil mauvais. Deux matelots alternent dans un hamac, souvent trempé, à l'entrepont, espace confiné où vivent aussi les moutons embarqués vivants. Le capitaine prend soin de sa chair à canon comme d'un cheptel : il lui faut assez d'hommes vivants pour livrer combat. A cette époque, le service dans la marine est obligatoire - un an sur trois - dans les provinces maritimes du royaume. </P> <P/> </Texte> <SignaturePubliee/> <Note/> <Images/> </Corps> </Article> <Indexation> <TagAdmin1/> <TagAdmin2/> <TagAdmin3/> <TitreComplementaire>2 articles - description de la vie des matelots à bord de l'"Hermione"</TitreComplementaire> <Commentaire>Q0101/675650;</Commentaire> <Categories> <Categorie>DESCRIPTION</Categorie> <Categorie>ENCADRE</Categorie> <Categorie>ENSEMBLE</Categorie> </Categories> <Lien/> <Oeuvre> <TitresOeuvre/> <GenresOeuvre/> <AuteursOeuvre/> </Oeuvre> <SignaturesIndexees> <SignatureIndexee/> </SignaturesIndexees> </Indexation> <Etat Statut='EXPORTE'> <Documentaliste>DAR</Documentaliste> <MisesAJour> 31-12-2003 </MisesAJour> </Etat> <Historique> <France/> <Etranger/> <Personnes/> </Historique></Document>3) I created a new XSL stylesheet. Something like this:<?xml version="1.0" encoding="UTF-8"?><xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"><xsl:template match="/Document/Article[1]/Secteur[1]/SCI"> <xsl:result-document href=""> <xsl:copy-of select="document(.) "></xsl:copy-of> </xsl:result-document> </xsl:template></xsl:stylesheet>Now i want to tell XSLT to search the folder ''01'', to find only tha files containing <Secteur>SCI</Secteur>, and to copy them (without any changes) to the folder ''SCI''. Can you help me with th XSLT stylesheet? Another question: How can i apply my scenario to the whole folder ''01''. I can't do this for each only XML file. Thank you very much!

Link to comment
Share on other sites

If you have XSLT 2.0, you can store the locations of all files in an extra file, then for each location in that file, check this category, and then produce a new document that is the copy-of the document.
... for example, your index file (let's call it "index.xml") may look like:
<folder>	<file name="20040101N0001.xml">01/20040101N0001.xml</file>	<file name="20040101N0002.xml">01/20040101N0002.xml</file>	<!-- etc. --></folder>

In that case, your XSLT should look something like this:

<?xml version="1.0" encoding="UTF-8"?><xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"exclude-result-prefixes="xs"version="2.0">	<xsl:template match="file">		<xsl:if test="document(.)/Document/Article[1]/Secteur[1] = 'SCI'">			<xsl:result-document href="SCI/{@name}">				<xsl:copy-of select="document(.)" />			</xsl:result-document>		</xsl:if>	</xsl:template></xsl:stylesheet>

Link to comment
Share on other sites

Good morning,i have two questions:- How can i generate automatically my index.html file with the locations of all my XML files?- My XSL is now:<?xml version="1.0" encoding="UTF-8"?><xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"> <xsl:template match="file"> <xsl:if test="document(.)/Document/Article[1]/Secteur[1] = 'SCI'"> <xsl:result-document href="folder:///E:/PROJET/XSL_projet/SCI/{@name}"> <xsl:copy-of select="document(.)" /> </xsl:result-document> </xsl:if> </xsl:template></xsl:stylesheet>it's wrong like this? I've tested it with this XML:<?xml version="1.0" encoding="UTF-8"?><folder name="test"> <file name="20040101N0001.xml">01/20040101N0001.xml</file> <file name="20040101N0002.xml">01/20040101N0002.xml</file> <file name="20040102N0063.xml">01/20040102N0063.xml</file> </folder>It doesn't work...With Saxon B 9.1.0.7 :no resultsWith Saxon SA 9.1.0.7 :warning, cannot validate <folder>, no element declaration available

Link to comment
Share on other sites

Your resulting path should start with "file:///", not "folder:///".You should use Saxon B. The default configuration of Saxon SA in Oxygen appears to be one where Saxon requires a validating XML, which yours isn't.You can't automatically generate the index without some kind of a scripting language. However, if you do have a scripting language (see my first reply), you don't even need to use XSLT. You can simply use a script that will open up every XML file in a directory, check it, then copy it to wherever you want to.

Link to comment
Share on other sites

Perfect! that worked! I've tested it with only three files to understand the procedure. Can you give me please an exemple of a script, like that you used to generate the index? What kind of scripting langage? Like Perl? Can i have a link for instructions? Thank you very much, you helped me so much so far, good night! :)

Link to comment
Share on other sites

You could try it with PHP. Download any PHP binary (within a ZIP archive) from http://windows.php.net/download/ (I reccomend PHP 5.3 VC9 x86 Thread Safe, but everyone else on that page would do fine too), and extract it in any folder.Then, create a new text file somewhere, and call it... let's say "xmlcopy.php". As its contents, type in the PHP script itself. The PHP for what you want to do would probably go along something like that:

<?phperror_reporting(E_ALL | E_STRICT);ini_set('display_errors', 'On');$sourceDir = 'E:\PROJET\XSL_projet';$destinationDir = 'E:\PROJET\XSL_projet\SCI';$criteriaNode = '/Document/Article[1]/Secteur[1]';$criteriaValue = 'SCI';/** * Calls a function for every file in a folder. * * @param string $callback The function to call. It must accept one argument that is a relative filepath of the file. * @param string $dir The directory to traverse. * @param array $types The file types to call the function for. Leave as NULL to match all types. * @param bool $recursive Whether to list subfolders as well. * @param string $baseDir String to append at the beginning of every filepath that the callback will receive. */function dir_walk($callback, $dir, $types = null, $recursive = false, $baseDir = '') {	if ($dh = opendir($dir)) {		while (($file = readdir($dh)) !== false) {			if ($file === '.' || $file === '..') {				continue;			}			if (is_file($dir . $file)) {				if (is_array($types)) {					if (!in_array(strtolower(pathinfo($dir . $file, PATHINFO_EXTENSION)), $types, true)) {						continue;					}				}				$callback($baseDir . $file);			}elseif($recursive && is_dir($dir . $file)) {				dir_walk($callback, $dir . $file . DIRECTORY_SEPARATOR, $types, $recursive, $baseDir . $file . DIRECTORY_SEPARATOR);			}		}		closedir($dh);	}}function xmlcopy($file) {	global $sourceDir, $destinationDir, $criteriaNode, $criteriaValue;	$dom = new DOMDocument();	$fullFilePath = $sourceDir . DIRECTORY_SEPARATOR . $file;	if ($dom->load($fullFilePath)) {		$xpath = new DOMXPath($dom);		$queryResult = $xpath->query($criteriaNode);		if ($queryResult->length === 0) {			echo "File\n{$fullFilePath}\nDoesn't have a node that matches the expression\n{$criteriaNode}\n\n";		}else {			if ($queryResult->item(0)->nodeValue === $criteriaValue) {				copy($fullFilePath, $destinationDir . DIRECTORY_SEPARATOR . $file);			}		}	}else {		echo "Failed parsing {$fullFilePath}.\nThe parser gave the following errors:\n";		foreach(libxml_get_errors() as $error) {			$level = '';			switch($error->level) {				case LIBXML_ERR_WARNING:					$level = 'Warning';					break;				case LIBXML_ERR_ERROR:					$level = 'Error';					break;				case LIBXML_ERR_FATAL:					$level = 'Fatal error';					break;			}			echo "{$level}\nFile {$error->file}, Line {$error->line}, Column {$error->column}\n{$error->code}: {$error->message}\n\n";		}		libxml_clear_errors();	}}libxml_use_internal_errors(true);if (!is_dir($sourceDir)) {	echo "Directory\n{$sourceDir}\nDoes not exist.";	exit(1);}if (!is_dir($destinationDir)) {	echo "Directory\n{$destinationDir}\nDoes not exists. Creating... ";	if (mkdir($destinationDir, 0777, true)) {		echo 'Done.';	}else {		echo 'Failed.';		exit(2);	}}dir_walk('xmlcopy', $sourceDir, array('xml'), true);?>

Adjust the variables at the top accordingly if you must.To run the script, open up a command prompt (Start > (All) Programs > Accessories > Command Prompt), and type:

"path\to\php.exe" -f "path\to\xmlcopy.php"

e.g. if php.exe is at "E:\PHP\php.exe", and xmlcopy.php is at "E:\PROJET\XSL_projet\xmlcopy.php", the full line will be

"E:\PHP\php.exe" -f "E:\PROJET\XSL_projet\xmlcopy.php"

(do not forget the quotes... they are especially important if there are spaces in the paths)The code is very well bullet proof, so if you have any problems, there should be some kind of an error message on screen.BTW, this dir_walk() function is my own creation. I recently needed something like it for something else :) .

Link to comment
Share on other sites

Hi,i resolved the memory problem but i had to use a 4gb memory machine. Now all work perfectly. I need your help because i want to make some other transformations. Can i add something to the xslt code in order to make the copy of the specific file in the desired folder (as now) but to delete also the original file fron the original folder after copying it? And a second question: In the tag <Texte> there's the texte of the article. Can i search specific words? For exemple:<Texte><P>Sur l' Hermione, les affûts de canon étaient peints en rouge pour faciliter le nettoyage du sang des hommes après la bataille. La « frégate de douze » était armée de 26 canons de douze (les boulets pèsent 6 kg) et 6 canons de six (boulets de 3 kg). Elle était beaucoup plus légère, rapide et maniable qu'un vaisseau taillé pour le combat avec 118 canons. A bord, l'eau est rationnée à trois pintes par homme et par jour. Les vers et les charançons infestent les biscuits de mer. L'absence de fruits et légumes frais rend le scorbut ravageur. La fièvre typhoïde, la petite vérole et la gangrène sont des maladies fréquentes. L'hygiène est absente, le sommeil mauvais. Deux matelots alternent dans un hamac, souvent trempé, à l'entrepont, espace confiné où vivent aussi les moutons embarqués vivants. Le capitaine prend soin de sa chair à canon comme d'un cheptel : il lui faut assez d'hommes vivants pour livrer combat. A cette époque, le service dans la marine est obligatoire - un an sur trois - dans les provinces maritimes du royaume. </P><P/></Texte>I want to search among all files, in the tag <Texte>, if exist the words provinces, deux, combat for example. and use these criteria to make the copy, as we made with the criteria <Secteur>SCI</Secteur>. Is that possible? It's usefull for me because i want to make a second categorisation. Now, after the first step i have my folder with all SCI file (refering to science) and i want to make a second categorisation (for exemple science for cars, earth etc). Some words in the texte might be usefull for that maybe.Other question: in the result xml file, how can i copy only the element (tag) <Texte> that contains the article, instead of <xsl:copy-of select="document(.)" /> that copies the whole document?Thank you very much for the help so far, have i nice day! :)

Link to comment
Share on other sites

Is that with PHP, or are you strill trying it with XSLT? XSLT uses a tree under the hood, which can easily exsaust powerful computers when dealing with large XML documents. I think that perhaps once consumed, an XML document isn't released, so that in your case, you have loaded all of your articles all at once.Using PHP will likely consume less RAM, as the document is (or at least should...) be released as soon as the xmlcopy() function is over (i.e. when you're done dealing with the current file)You can also compare files by another fashion by having another "query()" call (see the script above), and do something if it matches that criteria OR if it matches that and another criteria (whatever you want).

Link to comment
Share on other sites

Hi,i took your code and i wrote this, in order to exclude some categories and copy all the other ...but it doesn't work. Have you any idea?<?xml version="1.0" encoding="UTF-8"?><xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0"> <xsl:template match="file"> <xsl:if test="document(.)/Document/Article[1]/Secteur[1][ not((.,'ART')) and not((.,'ECO')) and not((.,'FRA')) and not((.,'INT')) and not((.,'LIV')) and not((.,'SCI')) and not((.,'SOC')) and not((.,'SPO')) and not((.,'TEL')) and not((.,'UNE')) and not((.,'ENT')) and not((.,'DER')) and not((.,'AGE')) and not((.,'HOR')) ]"> <xsl:result-document href="file:///E:/PROJET/TELIKO/AUTRES/{@name}"> <xsl:copy-of select="document(.)" /> </xsl:result-document> </xsl:if> </xsl:template></xsl:stylesheet> :)

Link to comment
Share on other sites

The conditions should be like

not(. = 'ART')

and not

not((.,'ART'))

Replace them all, and it should work.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...