Jump to content

VOTE script difficulty


tantachar07

Recommended Posts

Could you create a console app that takes a word webpage and cleans the code? I tried that in perl but it's a lot more complicated than I thought...

Link to comment
Share on other sites

Here is a doc that needed to be cleaned, I ended up just copy pasting from word to DW.All I need is cleaned p tags <p> no styles, classes, or anything in there. Goes for any tags actually. no style tag in the head section no matter what is in there. No empty <p> tags (<p> </p>)I can try and search for the original word doc if you need it.Thanks! :)

<html><head><meta http-equiv=Content-Type content="text/html; charset=windows-1252"><meta name=Generator content="Microsoft Word 11 (filtered)"><title>PREFACE</title><style><!-- /* Font Definitions */ @font-face	{font-family:"Times New \(W1\)";	panose-1:2 2 6 3 5 4 5 2 3 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal	{margin:0cm;	margin-bottom:.0001pt;	punctuation-wrap:simple;	text-autospace:none;	font-size:12.0pt;	font-family:"Times New Roman";}h1	{margin-top:12.0pt;	margin-right:0cm;	margin-bottom:3.0pt;	margin-left:21.6pt;	text-indent:-21.6pt;	page-break-after:avoid;	punctuation-wrap:simple;	text-autospace:none;	font-size:16.0pt;	font-family:Arial;}@page Section1	{size:612.0pt 792.0pt;	margin:72.0pt 90.0pt 72.0pt 90.0pt;}div.Section1	{page:Section1;} /* List Definitions */ ol	{margin-bottom:0cm;}ul	{margin-bottom:0cm;}--></style></head><body lang=EN-CA><div class=Section1><h1 style='margin:0cm;margin-bottom:.0001pt;text-indent:0cm'><aname="_Toc138070756"></a><a name="_Toc95619545">PREFACE</a></h1><p class=MsoNormal style='margin-right:-22.5pt;text-align:justify'> </p><p class=MsoNormal style='margin-right:-22.5pt;text-align:justify'>The purposeof this guidebook is to provide information to government agencies, industryand industrial associations, non-governmental organizations and the publicabout the methods used for the compilation of area source emissions for the 2002National Emissions Inventory of Criteria Air Contaminants. The 2002 emissionsinventory was compiled by the provincial and territorial representatives of theEmissions and Projections Working Group of the Canadian Council of Ministers ofthe Environment and published in the summer of 2005. Relevant area sourceemission sectors are identified with sources of base quantity, emission factorsand other information used to derive emissions of TPM, PM<sub>10</sub>, PM<sub>2.5</sub>,SO<sub>x</sub>, NO<sub>x</sub>, VOCs and CO (and NH<sub><span style='font-family:"Times New \(W1\)"'>3</span></sub><span style='font-family:"Times New \(W1\)"'>,<sub></sub></span>where data were available) for the following major sourcecategories:</p><p class=MsoNormal style='margin-right:-22.5pt;text-align:justify'> </p><p class=MsoNormal style='margin-top:0cm;margin-right:-22.5pt;margin-bottom:0cm;margin-left:18.0pt;margin-bottom:.0001pt;text-align:justify;text-indent:-18.0pt'><span style='font-size:10.0pt;font-family:Symbol'>·<spanstyle='font:7.0pt "Times New Roman"'>        </span></span>industrial sector;</p><p class=MsoNormal style='margin-top:0cm;margin-right:-22.5pt;margin-bottom:0cm;margin-left:18.0pt;margin-bottom:.0001pt;text-align:justify;text-indent:-18.0pt'><span style='font-size:10.0pt;font-family:Symbol'>·<spanstyle='font:7.0pt "Times New Roman"'>        </span></span>non-industrial fuel combustion sector; </p><p class=MsoNormal style='margin-top:0cm;margin-right:-22.5pt;margin-bottom:0cm;margin-left:18.0pt;margin-bottom:.0001pt;text-align:justify;text-indent:-18.0pt'><span style='font-size:10.0pt;font-family:Symbol'>·<spanstyle='font:7.0pt "Times New Roman"'>        </span></span>transportation sector;</p><p class=MsoNormal style='margin-top:0cm;margin-right:-22.5pt;margin-bottom:0cm;margin-left:18.0pt;margin-bottom:.0001pt;text-align:justify;text-indent:-18.0pt'><span style='font-size:10.0pt;font-family:Symbol'>·<spanstyle='font:7.0pt "Times New Roman"'>        </span></span>incineration sector;</p><p class=MsoNormal style='margin-top:0cm;margin-right:-22.5pt;margin-bottom:0cm;margin-left:18.0pt;margin-bottom:.0001pt;text-align:justify;text-indent:-18.0pt'><span style='font-size:10.0pt;font-family:Symbol'>·<spanstyle='font:7.0pt "Times New Roman"'>        </span></span>miscellaneous source sector; and </p><p class=MsoNormal style='margin-top:0cm;margin-right:-22.5pt;margin-bottom:0cm;margin-left:18.0pt;margin-bottom:.0001pt;text-align:justify;text-indent:-18.0pt'><span style='font-size:10.0pt;font-family:Symbol'>·<spanstyle='font:7.0pt "Times New Roman"'>        </span></span>open source sector.</p><p class=MsoNormal style='margin-top:0cm;margin-right:-22.5pt;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;text-align:justify'> </p><p class=MsoNormal style='margin-right:-22.5pt;text-align:justify'>Theguidebook is arranged using these categories, with sections corresponding tospecific contributing sectors within the above categories.</p><p class=MsoNormal style='margin-right:-22.5pt;text-align:justify'> </p><p class=MsoNormal style='margin-right:-22.5pt;text-align:justify'>Contributionsto this guidebook were prepared by scientists and engineers from the CriteriaAir Contaminants Section of the Pollution Data Division within EnvironmentCanada. The work was originally edited by Canadian ORTECH Environmental Inc.and prepared for publication by Graphics Plus+.</p><p class=MsoNormal> </p><p class=MsoNormal style='text-align:justify'><b><span lang=EN-US>Note toReader:</span></b></p><p class=MsoNormal style='text-align:justify'><b><span lang=EN-US> </span></b></p><p class=MsoNormal style='text-align:justify'><span lang=EN-US>The informationcontained in the guidebook may have been updated owing to more recentinformation that may have become available after the publication of thisdocument.  </span></p><p class=MsoNormal style='text-align:justify'><span lang=EN-US> </span></p><p class=MsoNormal style='text-align:justify'><span lang=EN-US>Please see the‘Addendum’ for a description of any changes that were brought about to theguide since its publication.</span></p><p class=MsoNormal><span lang=EN-US> </span></p></div></body></html>

Link to comment
Share on other sites

ok here is what I have so far, it is not complete yet. It removes <style></style> and <meta> and catchs most tags with attributes in them but misses tags like this

#include <windows.h>#include <iostream>#include <string>#include <stdio.h>using namespace std;string fileName;string outFile;string contents;string tmp;string strRemove;FILE * pFile;long lSize;char * buffer;int start;int end;int main(){    cout << "Enter path and file name to load: ";    cin >> fileName;    //cout << "Enter path and file name to create: ";    //cin >> outFile;    pFile = fopen(fileName.c_str(),"r");    if(pFile == NULL)    {        cout << "Could not open file! Exiting..." << endl << endl;         system("pause");        exit(1);    }    //obtain file size.    fseek(pFile , 0 , SEEK_END);    lSize = ftell(pFile);    rewind(pFile);    //allocate memory to contain the whole file.    buffer = (char*) malloc(lSize);    if(buffer == NULL)    {        cout << "Error initializing buffer! Exiting..." << endl << endl;         system("pause");        exit(2);    }    //copy the file into the buffer.    fread(buffer,1,lSize,pFile);    contents.assign(buffer);    //remove <style></style>    if(contents.find("<style") != string::npos)    {        start = contents.find("<style");        end = contents.find("</style>") + 8;        strRemove = contents.substr(start,end-start);        contents = contents.replace(contents.find(strRemove),strRemove.size(),"");    }    //clean remaining HTML tags    tmp = contents;    int loop = 0;    int s;    string oldTag;    string newTag;    while(tmp.find("<") != string::npos && loop < 100000)    {        start = tmp.find("<");        end = tmp.find(">");        tmp = tmp.replace(start,1,"[");        tmp = tmp.replace(end,1,"]");        oldTag = tmp.substr(start,end-start+1);        if(oldTag.find(" ") != string::npos)        {            s = oldTag.find(" ");            newTag = oldTag.substr(0,s) + "]";            tmp = tmp.replace(tmp.find(oldTag),oldTag.size(),newTag);            //cout << newTag << endl;        }        loop++;    }    while(tmp.find("[") != string::npos && loop < 100000)    {        start = tmp.find("[");        end = tmp.find("]");        tmp = tmp.replace(start,1,"<");        tmp = tmp.replace(end,1,">");        loop++;    }    while(tmp.find("<meta>") != string::npos)    {        tmp = tmp.replace(tmp.find("<meta>"),6,"");    }    contents = tmp;        cout << tmp << endl << endl;    system("pause");    //close file and free memory    fclose(pFile);    free(buffer);    return 0;}

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...