Jump to content
Sign in to follow this  
dcole.ath.cx

Picking out Infomation

Recommended Posts

I'm working on a crawler like script. It first opens a page and puts the source code into a variable. I want to pick out the URLs that have <h3><a href=" in front of them and put them into a list. Is their a way to pick the information out of the variable without having to break it up and sift through it? Or what would be the best way for me to do this?Example:

example = "<h1>hi</h1><h3><a href="http://example.com">Example Site</a></h3> blah blah blah";link = "http://example.com";

Maybe the best way would to use strpos() and find the beginning and end of the URL, then use substr to get the URL.

Share this post


Link to post
Share on other sites

Use a regular expression, it will be a lot faster then the string functions.

<?php$example = "<h1>hi</h1><h3><a href=\"http://example.com\">Example Site</a></h3> blah blah blah";preg_match('@<h3><a href="([^"]+)"([^>]*)>@i', $example, $matches);print_r($matches);?>

Share this post


Link to post
Share on other sites

Thank, so I want to pull out URLs that have more that the domain name, something such as:try"><h3><a href="http://sub.example.com/~d/Folder/~3/123456789/">Example...There may not be white space before <a href..., it may have a subdomain, may have many folders that could contain any acceptable URL char.I'm currently messing around with something such as: preg_match('%(<h3><a href="http)(.*?)(">)%is', $line, $matches);My thoughts behind it were: begins with <h3><a href="http, then has *anything*, the ends with ">What should the match phrase look like?

Share this post


Link to post
Share on other sites

If you have the DOM extension, you can just do:

$dom = new DOMDocument;$dom->loadHTML($example);$xpath = new DOMXPath($dom);$links = $xpath->query('//h3/a/@href');//The list is at $links. Now we just visualize itecho '<ul>';for($i = 0, $l = $links->length; $i<$l; $i++) {echo '<li>', $links->item($i)->nodeValue, '</li>';}echo '</ul>';

Share this post


Link to post
Share on other sites

Well I got the preg match to work. It turns out I was look at the wrong source code, I wanted to read XML... so my patterns is: @(<link>.{5,150}</link>)@isx and it works for what I'm doing.

Share this post


Link to post
Share on other sites

The pattern I gave will work for the h3/a tags, the match that it returns will be everything from href=" to the next double quote, it will match any character that is not a double-quote.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
Sign in to follow this  

×
×
  • Create New...