Jump to content

trouble parsing just img src out of RSS feed?


mysteriousmonkey29

Recommended Posts

Hello, I am trying to create an RSS reader based on this example:

 

http://www.w3schools.com/php/php_ajax_rss_reader.asp

 

Specifically, I am attempting to modify this example so that the reader will access and display all the available comic images (and nothing else) from any given web comic RSS feed. I realize that it may be necessary to make the code at least a little site-specific, but I am trying to make it as general-purpose as possible. Currently, I have modified the initial example to produce a reader that displays all the comics of a given list of RSS feeds.. However, it also displays other unwanted text information that I am trying to get rid of. Here is my code so far, with a few feeds that are giving me trouble in particular:

 

index.php file:

 

<html>
<head>
<script>
function showRSS()
{
if (window.XMLHttpRequest)
{
// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
} else
{ // code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.onreadystatechange=function()
{
if (xmlhttp.readyState==4 && xmlhttp.status==200)
{
document.getElementById("rssOutput").innerHTML=xmlhttp.responseText;
}
}
xmlhttp.open("GET","logger.php",true);
xmlhttp.send();
}
</script>
</head>
<body onload="showRSS()">
<div id="rssOutput"></div>
</body>
</html>
(pretty sure there's nothing wrong with this file; I think the problems arise in the next one although I included this one for completeness)
logger.php:
<?php
//function to get all comics from an rss feed
function getComics($xml)
{
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
foreach ($x as $x)
{
$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
//output the comic
echo ($comic_image . "</p>");
echo ("<br>");
}
}
//create array of all RSS feed URLs
$URLs =
[
"garfieldMinusGarfield" => "http://garfieldminusgarfield.net/rss",
];
//Loop through all RSS feeds
foreach ($URLs as $xml)
{
getComics($xml);
}
?>
Because this method includes extra text in between the comic images (a lot of random stuff with SMBC, just a few advertisement links for gMg, and a copyright link for baby blues), I looked at the RSS feeds and concluded that the problem is that it's the description tag that includes the image source, but also includes other stuff. Next, I tried modifying the getComics function to scan directly for the image tag, rather than first looking for the description tag. I replaced the part in between the DOMDocument creation/loading and the URL list with:
$images=$xmlDoc->getElementsByTagName('img');
print_r($images);
foreach ($images as $image)
{
//echo $image->item(0)->getAttribute('src');
echo $image->item(0)->nodeValue;
echo ("<br>");
}
but apparently getElementsByTagName doesn't pick up the image tag embedded inside the description tag, because I get no comic images outputted, and the following output from the print_r statement:
DOMNodeList Object ( [length] => 0 ) DOMNodeList Object ( [length] => 0 )
Finally, I tried a combination of the two methods, trying to use getElementsByTagNam('img') inside the code that parses out the description tag contents. I replaced the line:
$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
with:
$comic_image=$x->getElementsByTagName('description')->item(0)->getElementsByTagName('img');
print_r($comic_image);
But this also finds nothing, producing the output:
DOMNodeList Object ( [length] => 0 )
So sorry for the really long background, but I'm wondering if there is a way to parse just the img src out of a given RSS feed without the other text and links I don't want?
Help would be much appreciated

 

Link to comment
Share on other sites

In most RSS feeds, the HTML inside the description is escaped and not part of the DOM structure, it's just a string. You need to pull out that string and parse it separately.

 

When it is part of the DOM it's in a different namespace, so you need to use getElementsByTagNameNS and specify the namespace.

Link to comment
Share on other sites

Thanks,you were correct. Here is the code I ended up with. I replaced everything inside the for each loop with just a getImageSrc function that calls a getImageTag function:

 

//function to find an image tag within a specific section if there is one
function getImageTag ($item,$tagName)
{
//pull desired section from given item
$section = $item->getElementsByTagName($tagName)->item(0);
//reparse description as if it were a string, because for some reason PHP woon't let you directly go to the source image with getElementsByTagName
$decoded_section = htmlspecialchars_decode($section->nodeValue);
$section_xml = new DOMDocument();
@$section_xml->loadHTML($decoded_section); //the @ is to suppress a bunch of warnings about characters this parser doesn't like
//pull image tag from section if there
$image_tag = $section_xml->getElementsByTagName('img')->item(0);
return $image_tag;
}
//function to get the image source URL from a given item
function getImageSrc ($item)
{
$image_tag = getImageTag($item,'description');
if (is_null($image_tag)) //if there was nothing with the tag name of image in the description section
{
//check in content:encoded section, because that's the next most likely place
$image_tag = getImageTag($item,'encoded');
if (is_null($image_tag)) //if there was nothing with the tag name of image in the encoded content section
{
//if the program gets here, it's probably because the feed is crap and doesn't include images,
//or it's because this particular item doesn't have a comic image in it
$image_src = '';
//THIS EXCEPTION WILL PROBABLY NEED TO BE HANDLED LATER TO AVOID POTENTIAL ERRORS
} else
{
$image_src = $image_tag->getAttribute('src');
}
} else
{
$image_src = $image_tag->getAttribute('src');
}
return $image_src;
}
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...