Jump to content

Picking out Infomation


dcole.ath.cx

Recommended Posts

I'm working on a crawler like script. It first opens a page and puts the source code into a variable. I want to pick out the URLs that have <h3><a href=" in front of them and put them into a list. Is their a way to pick the information out of the variable without having to break it up and sift through it? Or what would be the best way for me to do this?Example:

example = "<h1>hi</h1><h3><a href="http://example.com">Example Site</a></h3> blah blah blah";link = "http://example.com";

Maybe the best way would to use strpos() and find the beginning and end of the URL, then use substr to get the URL.

Link to comment
Share on other sites

Use a regular expression, it will be a lot faster then the string functions.

<?php$example = "<h1>hi</h1><h3><a href=\"http://example.com\">Example Site</a></h3> blah blah blah";preg_match('@<h3><a href="([^"]+)"([^>]*)>@i', $example, $matches);print_r($matches);?>

Link to comment
Share on other sites

Thank, so I want to pull out URLs that have more that the domain name, something such as:try"><h3><a href="http://sub.example.com/~d/Folder/~3/123456789/">Example...There may not be white space before <a href..., it may have a subdomain, may have many folders that could contain any acceptable URL char.I'm currently messing around with something such as: preg_match('%(<h3><a href="http)(.*?)(">)%is', $line, $matches);My thoughts behind it were: begins with <h3><a href="http, then has *anything*, the ends with ">What should the match phrase look like?

Link to comment
Share on other sites

If you have the DOM extension, you can just do:

$dom = new DOMDocument;$dom->loadHTML($example);$xpath = new DOMXPath($dom);$links = $xpath->query('//h3/a/@href');//The list is at $links. Now we just visualize itecho '<ul>';for($i = 0, $l = $links->length; $i<$l; $i++) {echo '<li>', $links->item($i)->nodeValue, '</li>';}echo '</ul>';

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...