Jump to content
Sign in to follow this  
mysteriousmonkey29

trouble parsing just img src out of RSS feed?

Recommended Posts

Hello, I am trying to create an RSS reader based on this example:

 

http://www.w3schools.com/php/php_ajax_rss_reader.asp

 

Specifically, I am attempting to modify this example so that the reader will access and display all the available comic images (and nothing else) from any given web comic RSS feed. I realize that it may be necessary to make the code at least a little site-specific, but I am trying to make it as general-purpose as possible. Currently, I have modified the initial example to produce a reader that displays all the comics of a given list of RSS feeds.. However, it also displays other unwanted text information that I am trying to get rid of. Here is my code so far, with a few feeds that are giving me trouble in particular:

 

index.php file:

 

<html>
<head>
<script>
function showRSS()
{
if (window.XMLHttpRequest)
{
// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
} else
{ // code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.onreadystatechange=function()
{
if (xmlhttp.readyState==4 && xmlhttp.status==200)
{
document.getElementById("rssOutput").innerHTML=xmlhttp.responseText;
}
}
xmlhttp.open("GET","logger.php",true);
xmlhttp.send();
}
</script>
</head>
<body onload="showRSS()">
<div id="rssOutput"></div>
</body>
</html>
(pretty sure there's nothing wrong with this file; I think the problems arise in the next one although I included this one for completeness)
logger.php:
<?php
//function to get all comics from an rss feed
function getComics($xml)
{
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
foreach ($x as $x)
{
$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
//output the comic
echo ($comic_image . "</p>");
echo ("<br>");
}
}
//create array of all RSS feed URLs
$URLs =
[
"garfieldMinusGarfield" => "http://garfieldminusgarfield.net/rss",
];
//Loop through all RSS feeds
foreach ($URLs as $xml)
{
getComics($xml);
}
?>
Because this method includes extra text in between the comic images (a lot of random stuff with SMBC, just a few advertisement links for gMg, and a copyright link for baby blues), I looked at the RSS feeds and concluded that the problem is that it's the description tag that includes the image source, but also includes other stuff. Next, I tried modifying the getComics function to scan directly for the image tag, rather than first looking for the description tag. I replaced the part in between the DOMDocument creation/loading and the URL list with:
$images=$xmlDoc->getElementsByTagName('img');
print_r($images);
foreach ($images as $image)
{
//echo $image->item(0)->getAttribute('src');
echo $image->item(0)->nodeValue;
echo ("<br>");
}
but apparently getElementsByTagName doesn't pick up the image tag embedded inside the description tag, because I get no comic images outputted, and the following output from the print_r statement:
DOMNodeList Object ( [length] => 0 ) DOMNodeList Object ( [length] => 0 )
Finally, I tried a combination of the two methods, trying to use getElementsByTagNam('img') inside the code that parses out the description tag contents. I replaced the line:
$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
with:
$comic_image=$x->getElementsByTagName('description')->item(0)->getElementsByTagName('img');
print_r($comic_image);
But this also finds nothing, producing the output:
DOMNodeList Object ( [length] => 0 )
So sorry for the really long background, but I'm wondering if there is a way to parse just the img src out of a given RSS feed without the other text and links I don't want?
Help would be much appreciated

 

Share this post


Link to post
Share on other sites

In most RSS feeds, the HTML inside the description is escaped and not part of the DOM structure, it's just a string. You need to pull out that string and parse it separately.

 

When it is part of the DOM it's in a different namespace, so you need to use getElementsByTagNameNS and specify the namespace.

Share this post


Link to post
Share on other sites

Thanks,you were correct. Here is the code I ended up with. I replaced everything inside the for each loop with just a getImageSrc function that calls a getImageTag function:

 

//function to find an image tag within a specific section if there is one
function getImageTag ($item,$tagName)
{
//pull desired section from given item
$section = $item->getElementsByTagName($tagName)->item(0);
//reparse description as if it were a string, because for some reason PHP woon't let you directly go to the source image with getElementsByTagName
$decoded_section = htmlspecialchars_decode($section->nodeValue);
$section_xml = new DOMDocument();
@$section_xml->loadHTML($decoded_section); //the @ is to suppress a bunch of warnings about characters this parser doesn't like
//pull image tag from section if there
$image_tag = $section_xml->getElementsByTagName('img')->item(0);
return $image_tag;
}
//function to get the image source URL from a given item
function getImageSrc ($item)
{
$image_tag = getImageTag($item,'description');
if (is_null($image_tag)) //if there was nothing with the tag name of image in the description section
{
//check in content:encoded section, because that's the next most likely place
$image_tag = getImageTag($item,'encoded');
if (is_null($image_tag)) //if there was nothing with the tag name of image in the encoded content section
{
//if the program gets here, it's probably because the feed is crap and doesn't include images,
//or it's because this particular item doesn't have a comic image in it
$image_src = '';
//THIS EXCEPTION WILL PROBABLY NEED TO BE HANDLED LATER TO AVOID POTENTIAL ERRORS
} else
{
$image_src = $image_tag->getAttribute('src');
}
} else
{
$image_src = $image_tag->getAttribute('src');
}
return $image_src;
}

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
Sign in to follow this  

×
×
  • Create New...