dcole.ath.cx Posted February 8, 2008 Share Posted February 8, 2008 I'm working on a crawler like script. It first opens a page and puts the source code into a variable. I want to pick out the URLs that have <h3><a href=" in front of them and put them into a list. Is their a way to pick the information out of the variable without having to break it up and sift through it? Or what would be the best way for me to do this?Example: example = "<h1>hi</h1><h3><a href="http://example.com">Example Site</a></h3> blah blah blah";link = "http://example.com"; Maybe the best way would to use strpos() and find the beginning and end of the URL, then use substr to get the URL. Link to comment Share on other sites More sharing options...
justsomeguy Posted February 9, 2008 Share Posted February 9, 2008 Use a regular expression, it will be a lot faster then the string functions. <?php$example = "<h1>hi</h1><h3><a href=\"http://example.com\">Example Site</a></h3> blah blah blah";preg_match('@<h3><a href="([^"]+)"([^>]*)>@i', $example, $matches);print_r($matches);?> Link to comment Share on other sites More sharing options...
dcole.ath.cx Posted February 9, 2008 Author Share Posted February 9, 2008 Thank, so I want to pull out URLs that have more that the domain name, something such as:try"><h3><a href="http://sub.example.com/~d/Folder/~3/123456789/">Example...There may not be white space before <a href..., it may have a subdomain, may have many folders that could contain any acceptable URL char.I'm currently messing around with something such as: preg_match('%(<h3><a href="http)(.*?)(">)%is', $line, $matches);My thoughts behind it were: begins with <h3><a href="http, then has *anything*, the ends with ">What should the match phrase look like? Link to comment Share on other sites More sharing options...
boen_robot Posted February 9, 2008 Share Posted February 9, 2008 If you have the DOM extension, you can just do: $dom = new DOMDocument;$dom->loadHTML($example);$xpath = new DOMXPath($dom);$links = $xpath->query('//h3/a/@href');//The list is at $links. Now we just visualize itecho '<ul>';for($i = 0, $l = $links->length; $i<$l; $i++) {echo '<li>', $links->item($i)->nodeValue, '</li>';}echo '</ul>'; Link to comment Share on other sites More sharing options...
dcole.ath.cx Posted February 9, 2008 Author Share Posted February 9, 2008 Well I got the preg match to work. It turns out I was look at the wrong source code, I wanted to read XML... so my patterns is: @(<link>.{5,150}</link>)@isx and it works for what I'm doing. Link to comment Share on other sites More sharing options...
justsomeguy Posted February 11, 2008 Share Posted February 11, 2008 The pattern I gave will work for the h3/a tags, the match that it returns will be everything from href=" to the next double quote, it will match any character that is not a double-quote. Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.