danposs86 Posted April 26, 2007 Share Posted April 26, 2007 I am needing to do some web scraping for a project.Just wondered if anyone on here used this (or similar) technique. If so it would be a great help if you could point me in the right direction as i don't have a clue where to start, maybe a simple tutorial website or the likes.Thanks loads, Dan.(PS, also if this is in the wrong section im sorry, feel free to move). Link to comment Share on other sites More sharing options...
real_illusions Posted April 26, 2007 Share Posted April 26, 2007 http://en.wikipedia.org/wiki/Web_scraping Link to comment Share on other sites More sharing options...
danposs86 Posted April 26, 2007 Author Share Posted April 26, 2007 Have already read that, it tells me about it. I am after maybe some code or something that i can look at and try and understand to make my own scrapping site. Link to comment Share on other sites More sharing options...
justsomeguy Posted April 26, 2007 Share Posted April 26, 2007 Do you already have a programming language picked out? Link to comment Share on other sites More sharing options...
aspnetguy Posted April 26, 2007 Share Posted April 26, 2007 I have scrapped/crawled websites in ASP.Net(C#) if you decide to use that language. Link to comment Share on other sites More sharing options...
danposs86 Posted April 26, 2007 Author Share Posted April 26, 2007 only really used php before, so php then lol Link to comment Share on other sites More sharing options...
pulpfiction Posted April 26, 2007 Share Posted April 26, 2007 google search gave this...http://phpclasses.goodphp.com/browse/package/1754.html Link to comment Share on other sites More sharing options...
boen_robot Posted April 26, 2007 Share Posted April 26, 2007 There is also the PEAR HTTP Request class. Documentation is also available. Link to comment Share on other sites More sharing options...
danposs86 Posted April 29, 2007 Author Share Posted April 29, 2007 i have found this code on fluidthoughts: <html><head> <title>Ann Arbor Weather</title> <link rel="stylesheet" href="styles.css" type="text/css"></head><body bgcolor="#ffffff"><p><a href="http://www.crh.noaa.gov/forecasts/MIZ075.php">http://www.crh.noaa.gov/forecasts/MIZ075.php</a></p><table cellpadding="7" cellspacing="0" border="0"><tr> <td valign="top"> <img src="http://www.bmcmedia.net/webcam/bmccam.jpg" width="352" height="288" alt="" border="0" /> <br /><br /> <img src="http://weather.yahoo.com/images/northeast_sat_440x297.jpg" width="440" height="297" alt="" border="0" /> </td> <td><?php $src = 'http://www.wunderground.com/cgi-bin/findweather/getForecast?query=48103'; $stop = 0; $start = 1; $fp = fopen ($src, "r"); while ((!feof ($fp)) && (!$stop)) { $line = fgets($fp, 4096); if (preg_match("/Nowcast as of/", $line)) { $start = 0; } if (preg_match("/Forecast for Washtenaw/", $line)) { $start = 0; } if (preg_match("/Air Pollution/", $line)) { $stop = 1; } if ( !$start && ( preg_match( "/smalltableheader/", $line ))) { $stop = 1; } if (!$start) { if ( preg_match( "/<table /", $line )) { $stop = 1; } elseif ( preg_match( "/<\/?table[^>]*>/", $line )) {; } else { $line = preg_replace("/<img src[^>]*>/", '', $line); $line = preg_replace("/<(\/)?td[^>]*>/", "<$1p>", $line); $line = preg_replace("/<\/?(tr[^>]*|font|center)>/", '', $line); $line = preg_replace("/<p><\/p>/", '', $line); echo $line; } } } fclose($fp);?> </td></tr></table></body></html> I have been trying to work out what each bit does but its getting the better of me, can anyone who knows explain or comment the importatn lines please, brains melting lol Link to comment Share on other sites More sharing options...
reportingsjr Posted April 30, 2007 Share Posted April 30, 2007 Well, it defines a variable with a webpage to scrape, opens up the page to get the html of it, then it uses the perl regular expression engine to find certain parts of the html/text. It stores the found text in a variable and echos that variable.Thats it, I would actually recommend python for this. I made a web crawler for links in PHP and it didnt turn out so well, it worked fine but it could only crawl about 800 sites before quiting. The one I made in python took about 1 day to make with almost no knowledge of python beforehand (its a lot like php). It also has a lot of nice libraries that you can use to. Although the one I'm using right now (BeautifulSoup) has a bug with it and after crawling 50 or 60 website it crashes. Still works though . Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.