Jump to content

Web Scraping


danposs86

Recommended Posts

I am needing to do some web scraping for a project.Just wondered if anyone on here used this (or similar) technique. If so it would be a great help if you could point me in the right direction as i don't have a clue where to start, maybe a simple tutorial website or the likes.Thanks loads, Dan.(PS, also if this is in the wrong section im sorry, feel free to move).

Link to comment
Share on other sites

i have found this code on fluidthoughts:

<html><head>	<title>Ann Arbor Weather</title>	<link rel="stylesheet" href="styles.css" type="text/css"></head><body bgcolor="#ffffff"><p><a href="http://www.crh.noaa.gov/forecasts/MIZ075.php">http://www.crh.noaa.gov/forecasts/MIZ075.php</a></p><table cellpadding="7" cellspacing="0" border="0"><tr>	<td valign="top">		<img src="http://www.bmcmedia.net/webcam/bmccam.jpg"			width="352" height="288" alt="" border="0" />		<br /><br />		<img src="http://weather.yahoo.com/images/northeast_sat_440x297.jpg"			width="440" height="297" alt="" border="0" />	</td>	<td><?php	$src = 'http://www.wunderground.com/cgi-bin/findweather/getForecast?query=48103'; 	$stop = 0;	$start = 1;	$fp = fopen ($src, "r");	while ((!feof ($fp)) && (!$stop))	{		$line = fgets($fp, 4096);		if (preg_match("/Nowcast as of/", $line)) { $start = 0; } 		if (preg_match("/Forecast for Washtenaw/", $line)) { $start = 0; } 		if (preg_match("/Air Pollution/", $line)) { $stop = 1; }		if ( !$start && ( preg_match( "/smalltableheader/", $line )))		{ $stop = 1; }		if (!$start) 		{ 			if ( preg_match( "/<table /", $line ))			{ $stop = 1; }			elseif ( preg_match( "/<\/?table[^>]*>/", $line )) {; }			else 			{				$line = preg_replace("/<img src[^>]*>/", '', $line); 				$line = preg_replace("/<(\/)?td[^>]*>/", "<$1p>", $line); 				$line = preg_replace("/<\/?(tr[^>]*|font|center)>/", '', $line); 				$line = preg_replace("/<p><\/p>/", '', $line); 				echo $line; 			} 		}	}	fclose($fp);?>	</td></tr></table></body></html>

I have been trying to work out what each bit does but its getting the better of me, can anyone who knows explain or comment the importatn lines please, brains melting lol

Link to comment
Share on other sites

Well, it defines a variable with a webpage to scrape, opens up the page to get the html of it, then it uses the perl regular expression engine to find certain parts of the html/text. It stores the found text in a variable and echos that variable.Thats it, I would actually recommend python for this. I made a web crawler for links in PHP and it didnt turn out so well, it worked fine but it could only crawl about 800 sites before quiting. The one I made in python took about 1 day to make with almost no knowledge of python beforehand (its a lot like php). It also has a lot of nice libraries that you can use to. Although the one I'm using right now (BeautifulSoup) has a bug with it and after crawling 50 or 60 website it crashes. Still works though :).

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...