Jump to content

Extracting XML table data from a file with generic node names

Recommended Posts

I would like to extract data from Associated Press files and put them in tabular form for my work at a newspaper. I understand how to extract data when the nodes are obvious such as: <booklist>, <author>, but the files from the AP contain just <table>, <th>, <tr>, <td>, and I can't seem to extract the data. The file format is called NITF XML. I want to put them in a similar style table that would be printed in a newspaper. Can anyone help me with this? Thank you! Here is an sample file as it appears when downloaded from their wire: <nitf xmlns="http://ap.org/schemas/03/2005/nitf"> <head> <meta name="ap-transref" content="s0225" /> <meta name="ap-origin" content="dss" /> <meta name="ap-selector" content="-----" /> <meta name="ap-category" content="s" /> <meta name="ap-format" content="at" /> <!-- Routing Type="Passcode" Expanded="true" Outed="false" --> <meta name="ap-routing" content="s,s1,sag" /> <meta name="ap-cycle" content="BC" /> <meta name="ap-xhl" content="BBO--Baseball Expanded Glance" /> <docdata> <doc-id regsrc="AP" /> <del-list> <from-src level-number="s0225" /> </del-list> <urgency ed-urg="3" /> <date.issue norm="2012812TZ" /> <du-key key="BC-BBO--Baseball Expanded Glance" /> <doc.rights owner="http://www.ap.org" agent="http://license.icopyright.net" type="none" /> <doc.copyright /> </docdata> </head> <body> <body.head> <hedline> <hl1>Baseball Expanded Standings</hl1> <byline>The Associated Press<byttl></byttl></byline> </hedline> <distributor>The Associated Press</distributor> </body.head> <body.content> <block> <table> <tr> <th>AMERICAN LEAGUE</th> </tr> <tr> <th>East Division</th> </tr> <tr> <th></th> <th>W</th> <th>L</th> <th>Pct</th> <th>GB</th> <th>WCGB</th> <th>L10</th> <th>Str</th> <th>Home</th> <th>Away</th> </tr> <tr> <td>New York</td> <td>67</td> <td>46</td> <td>.593</td> <td>—</td> <td>—</td> <td>7-3</td> <td>W-4</td> <td>34-22</td> <td>33-24</td> </tr> <tr> <td>Tampa Bay</td> <td>61</td> <td>52</td> <td>.540</td> <td>6</td> <td>—</td> <td>8-2</td> <td>W-5</td> <td>32-27</td> <td>29-25</td> </tr> <tr> <td>Baltimore</td> <td>61</td> <td>53</td> <td>.535</td> <td>6½</td> <td>½</td> <td>6-4</td> <td>L-1</td> <td>29-28</td> <td>32-25</td> </tr> <tr> <td>Boston</td> <td>56</td> <td>59</td> <td>.487</td> <td>12</td> <td>6</td> <td>3-7</td> <td>L-1</td> <td>29-34</td> <td>27-25</td> </tr> <tr> <td>Toronto</td> <td>53</td> <td>60</td> <td>.469</td> <td>14</td> <td>8</td> <td>2-8</td> <td>L-5</td> <td>28-25</td> <td>25-35</td> </tr> <tr> <th>Central Division</th> </tr> <tr> <th></th> <th>W</th> <th>L</th> <th>Pct</th> <th>GB</th> <th>WCGB</th> <th>L10</th> <th>Str</th> <th>Home</th> <th>Away</th> </tr> <tr> <td>Chicago</td> <td>61</td> <td>51</td> <td>.545</td> <td>—</td> <td>—</td> <td>6-4</td> <td>L-1</td> <td>31-26</td> <td>30-25</td> </tr> <tr> <td>Detroit</td> <td>61</td> <td>53</td> <td>.535</td> <td>1</td> <td>½</td> <td>7-3</td> <td>L-1</td> <td>33-23</td> <td>28-30</td> </tr> <tr> <td>Cleveland</td> <td>53</td> <td>61</td> <td>.465</td> <td>9</td> <td>8½</td> <td>3-7</td> <td>W-1</td> <td>30-28</td> <td>23-33</td> </tr> <tr> <td>Kansas City</td> <td>49</td> <td>64</td> <td>.434</td> <td>12½</td> <td>12</td> <td>6-4</td> <td>W-1</td> <td>21-32</td> <td>28-32</td> </tr> <tr> <td>Minnesota</td> <td>49</td> <td>64</td> <td>.434</td> <td>12½</td> <td>12</td> <td>5-5</td> <td>L-3</td> <td>23-34</td> <td>26-30</td> </tr> <tr> <th>West Division</th> </tr> <tr> <th></th> <th>W</th> <th>L</th> <th>Pct</th> <th>GB</th> <th>WCGB</th> <th>L10</th> <th>Str</th> <th>Home</th> <th>Away</th> </tr> <tr> <td>Texas</td> <td>66</td> <td>46</td> <td>.589</td> <td>—</td> <td>—</td> <td>7-3</td> <td>W-1</td> <td>35-22</td> <td>31-24</td> </tr> <tr> <td>Oakland</td> <td>61</td> <td>52</td> <td>.540</td> <td>5½</td> <td>—</td> <td>5-5</td> <td>W-1</td> <td>34-26</td> <td>27-26</td> </tr> <tr> <td>Los Angeles</td> <td>60</td> <td>54</td> <td>.526</td> <td>7</td> <td>1½</td> <td>3-7</td> <td>L-1</td> <td>31-23</td> <td>29-31</td> </tr> <tr> <td>Seattle</td> <td>52</td> <td>63</td> <td>.452</td> <td>15½</td> <td>10</td> <td>4-6</td> <td>W-1</td> <td>25-29</td> <td>27-34</td> </tr> </table> <p>___</p> <table> <tr> <th>NATIONAL LEAGUE</th> </tr> <tr> <th>East Division</th> </tr> <tr> <th></th> <th>W</th> <th>L</th> <th>Pct</th> <th>GB</th> <th>WCGB</th> <th>L10</th> <th>Str</th> <th>Home</th> <th>Away</th> </tr> <tr> <td>Washington</td> <td>71</td> <td>43</td> <td>.623</td> <td>—</td> <td>—</td> <td>9-1</td> <td>W-8</td> <td>32-22</td> <td>39-21</td> </tr> <tr> <td>Atlanta</td> <td>66</td> <td>47</td> <td>.584</td> <td>4½</td> <td>—</td> <td>7-3</td> <td>W-3</td> <td>32-26</td> <td>34-21</td> </tr> <tr> <td>New York</td> <td>54</td> <td>60</td> <td>.474</td> <td>17</td> <td>9½</td> <td>4-6</td> <td>L-2</td> <td>27-30</td> <td>27-30</td> </tr> <tr> <td>Miami</td> <td>52</td> <td>62</td> <td>.456</td> <td>19</td> <td>11½</td> <td>4-6</td> <td>W-1</td> <td>28-28</td> <td>24-34</td> </tr> <tr> <td>Philadelphia</td> <td>51</td> <td>62</td> <td>.451</td> <td>19½</td> <td>12</td> <td>5-5</td> <td>L-1</td> <td>25-33</td> <td>26-29</td> </tr> <tr> <th>Central Division</th> </tr> <tr> <th></th> <th>W</th> <th>L</th> <th>Pct</th> <th>GB</th> <th>WCGB</th> <th>L10</th> <th>Str</th> <th>Home</th> <th>Away</th> </tr> <tr> <td>Cincinnati</td> <td>68</td> <td>46</td> <td>.596</td> <td>—</td> <td>—</td> <td>5-5</td> <td>W-2</td> <td>36-20</td> <td>32-26</td> </tr> <tr> <td>Pittsburgh</td> <td>63</td> <td>50</td> <td>.558</td> <td>4½</td> <td>—</td> <td>4-6</td> <td>L-3</td> <td>35-20</td> <td>28-30</td> </tr> <tr> <td>St. Louis</td> <td>62</td> <td>52</td> <td>.544</td> <td>6</td> <td>1½</td> <td>6-4</td> <td>W-1</td> <td>34-23</td> <td>28-29</td> </tr> <tr> <td>Milwaukee</td> <td>51</td> <td>61</td> <td>.455</td> <td>16</td> <td>11½</td> <td>5-5</td> <td>L-2</td> <td>33-26</td> <td>18-35</td> </tr> <tr> <td>Chicago</td> <td>44</td> <td>68</td> <td>.393</td> <td>23</td> <td>18½</td> <td>1-9</td> <td>L-2</td> <td>28-26</td> <td>16-42</td> </tr> <tr> <td>Houston</td> <td>38</td> <td>77</td> <td>.330</td> <td>30½</td> <td>26</td> <td>3-7</td> <td>W-2</td> <td>27-31</td> <td>11-46</td> </tr> <tr> <th>West Division</th> </tr> <tr> <th></th> <th>W</th> <th>L</th> <th>Pct</th> <th>GB</th> <th>WCGB</th> <th>L10</th> <th>Str</th> <th>Home</th> <th>Away</th> </tr> <tr> <td>San Francisco</td> <td>62</td> <td>52</td> <td>.544</td> <td>—</td> <td>—</td> <td>6-4</td> <td>W-1</td> <td>33-24</td> <td>29-28</td> </tr> <tr> <td>Los Angeles</td> <td>61</td> <td>53</td> <td>.535</td> <td>1</td> <td>2½</td> <td>5-5</td> <td>L-1</td> <td>33-25</td> <td>28-28</td> </tr> <tr> <td>Arizona</td> <td>57</td> <td>57</td> <td>.500</td> <td>5</td> <td>6½</td> <td>4-6</td> <td>L-2</td> <td>30-26</td> <td>27-31</td> </tr> <tr> <td>San Diego</td> <td>51</td> <td>64</td> <td>.443</td> <td>11½</td> <td>13</td> <td>7-3</td> <td>W-6</td> <td>27-30</td> <td>24-34</td> </tr> <tr> <td>Colorado</td> <td>41</td> <td>70</td> <td>.369</td> <td>19½</td> <td>21</td> <td>4-6</td> <td>L-1</td> <td>21-37</td> <td>20-33</td> </tr> </table> <p>___</p> <table> <tr> <th>AMERICAN LEAGUE</th> </tr> <tr> <th>Saturday's Games</th> </tr> </table> <p>N.Y. Yankees 5, Toronto 2</p> <p>Cleveland 5, Boston 2</p> <p>Kansas City 7, Baltimore 3</p> <p>Oakland 9, Chicago White Sox 7</p> <p>Tampa Bay 4, Minnesota 2</p> <p>Texas 2, Detroit 1</p> <p>Seattle 7, L.A. Angels 4</p> <table> <tr> <th>Sunday's Games</th> </tr> </table> <p>Boston at Cleveland, 1:05 p.m.</p> <p>N.Y. Yankees at Toronto, 1:07 p.m.</p> <p>Kansas City at Baltimore, 1:35 p.m.</p> <p>Oakland at Chicago White Sox, 2:10 p.m.</p> <p>Tampa Bay at Minnesota, 2:10 p.m.</p> <p>Detroit at Texas, 3:05 p.m.</p> <p>Seattle at L.A. Angels, 3:35 p.m.</p> <table> <tr> <th>Monday's Games</th> </tr> </table> <p>Texas (Dempster 1-0) at N.Y. Yankees (Undecided), 7:05 p.m.</p> <p>Chicago White Sox (Peavy 9-8) at Toronto (Villanueva 6-2), 7:07 p.m.</p> <p>Detroit (A.Sanchez 1-2) at Minnesota (Deduno 3-0), 8:10 p.m.</p> <p>Cleveland (Masterson 8-10) at L.A. Angels (C.Wilson 9-8), 10:05 p.m.</p> <p>Tampa Bay (Cobb 6-8) at Seattle (Beavan 7-6), 10:10 p.m.</p> <table> <tr> <th>Tuesday's Games</th> </tr> </table> <p>Boston at Baltimore, 7:05 p.m.</p> <p>Texas at N.Y. Yankees, 7:05 p.m.</p> <p>Chicago White Sox at Toronto, 7:07 p.m.</p> <p>Detroit at Minnesota, 8:10 p.m.</p> <p>Oakland at Kansas City, 8:10 p.m.</p> <p>Cleveland at L.A. Angels, 10:05 p.m.</p> <p>Tampa Bay at Seattle, 10:10 p.m.</p> <p>___</p> <table> <tr> <th>NATIONAL LEAGUE</th> </tr> <tr> <th>Saturday's Games</th> </tr> </table> <p>Cincinnati 4, Chicago Cubs 2</p> <p>San Francisco 9, Colorado 3</p> <p>Houston 6, Milwaukee 5, 10 innings</p> <p>San Diego 5, Pittsburgh 0</p> <p>St. Louis 4, Philadelphia 1</p> <p>Atlanta 9, N.Y. Mets 3</p> <p>Miami 7, L.A. Dodgers 3</p> <p>Washington 6, Arizona 5</p> <table> <tr> <th>Sunday's Games</th> </tr> </table> <p>L.A. Dodgers at Miami, 1:10 p.m.</p> <p>San Diego at Pittsburgh, 1:35 p.m.</p> <p>St. Louis at Philadelphia, 1:35 p.m.</p> <p>Milwaukee at Houston, 2:05 p.m.</p> <p>Cincinnati at Chicago Cubs, 2:20 p.m.</p> <p>Colorado at San Francisco, 4:05 p.m.</p> <p>Washington at Arizona, 4:10 p.m.</p> <p>Atlanta at N.Y. Mets, 8:05 p.m.</p> <table> <tr> <th>Monday's Games</th> </tr> </table> <p>L.A. Dodgers (Harang 7-7) at Pittsburgh (Karstens 4-2), 7:05 p.m.</p> <p>Philadelphia (Hamels 12-6) at Miami (Eovaldi 3-7), 7:10 p.m.</p> <p>San Diego (Stults 2-2) at Atlanta (Minor 6-8), 7:10 p.m.</p> <p>Houston (Galarraga 0-2) at Chicago Cubs (Samardzija 7-10), 8:05 p.m.</p> <p>Milwaukee (Fiers 6-4) at Colorado (Francis 3-4), 8:40 p.m.</p> <p>Washington (G.Gonzalez 14-6) at San Francisco (Vogelsong 10-5), 10:15 p.m.</p> <table> <tr> <th>Tuesday's Games</th> </tr> </table> <p>L.A. Dodgers at Pittsburgh, 7:05 p.m.</p> <p>N.Y. Mets at Cincinnati, 7:10 p.m.</p> <p>Philadelphia at Miami, 7:10 p.m.</p> <p>San Diego at Atlanta, 7:10 p.m.</p> <p>Houston at Chicago Cubs, 8:05 p.m.</p> <p>Arizona at St. Louis, 8:15 p.m.</p> <p>Milwaukee at Colorado, 8:40 p.m.</p> <p>Washington at San Francisco, 10:15 p.m.</p> <p /> </block> </body.content> <body.end /> </body></nitf>

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...