Web Scraping

danposs86 · April 26, 2007

I am needing to do some web scraping for a project.Just wondered if anyone on here used this (or similar) technique. If so it would be a great help if you could point me in the right direction as i don't have a clue where to start, maybe a simple tutorial website or the likes.Thanks loads, Dan.(PS, also if this is in the wrong section im sorry, feel free to move).

real_illusions · April 26, 2007

http://en.wikipedia.org/wiki/Web_scraping

danposs86 · April 26, 2007

Have already read that, it tells me about it. I am after maybe some code or something that i can look at and try and understand to make my own scrapping site.

justsomeguy · April 26, 2007

Do you already have a programming language picked out?

aspnetguy · April 26, 2007

I have scrapped/crawled websites in ASP.Net(C#) if you decide to use that language.

danposs86 · April 26, 2007

only really used php before, so php then lol

pulpfiction · April 26, 2007

google search gave this...http://phpclasses.goodphp.com/browse/package/1754.html

boen_robot · April 26, 2007

There is also the PEAR HTTP Request class. Documentation is also available.

danposs86 · April 29, 2007

i have found this code on fluidthoughts:

<html><head>	<title>Ann Arbor Weather</title>	<link rel="stylesheet" href="styles.css" type="text/css"></head><body bgcolor="#ffffff"><p><a href="http://www.crh.noaa.gov/forecasts/MIZ075.php">http://www.crh.noaa.gov/forecasts/MIZ075.php</a></p><table cellpadding="7" cellspacing="0" border="0"><tr>	<td valign="top">		<img src="http://www.bmcmedia.net/webcam/bmccam.jpg"			width="352" height="288" alt="" border="0" />		<br /><br />		<img src="http://weather.yahoo.com/images/northeast_sat_440x297.jpg"			width="440" height="297" alt="" border="0" />	</td>	<td><?php	$src = 'http://www.wunderground.com/cgi-bin/findweather/getForecast?query=48103'; 	$stop = 0;	$start = 1;	$fp = fopen ($src, "r");	while ((!feof ($fp)) && (!$stop))	{		$line = fgets($fp, 4096);		if (preg_match("/Nowcast as of/", $line)) { $start = 0; } 		if (preg_match("/Forecast for Washtenaw/", $line)) { $start = 0; } 		if (preg_match("/Air Pollution/", $line)) { $stop = 1; }		if ( !$start && ( preg_match( "/smalltableheader/", $line )))		{ $stop = 1; }		if (!$start) 		{ 			if ( preg_match( "/<table /", $line ))			{ $stop = 1; }			elseif ( preg_match( "/<\/?table[^>]*>/", $line )) {; }			else 			{				$line = preg_replace("/<img src[^>]*>/", '', $line); 				$line = preg_replace("/<(\/)?td[^>]*>/", "<$1p>", $line); 				$line = preg_replace("/<\/?(tr[^>]*|font|center)>/", '', $line); 				$line = preg_replace("/<p><\/p>/", '', $line); 				echo $line; 			} 		}	}	fclose($fp);?>	</td></tr></table></body></html>

I have been trying to work out what each bit does but its getting the better of me, can anyone who knows explain or comment the importatn lines please, brains melting lol

reportingsjr · April 30, 2007

Well, it defines a variable with a webpage to scrape, opens up the page to get the html of it, then it uses the perl regular expression engine to find certain parts of the html/text. It stores the found text in a variable and echos that variable.Thats it, I would actually recommend python for this. I made a web crawler for links in PHP and it didnt turn out so well, it worked fine but it could only crawl about 800 sites before quiting. The one I made in python took about 1 day to make with almost no knowledge of python beforehand (its a lot like php). It also has a lot of nice libraries that you can use to. Although the one I'm using right now (BeautifulSoup) has a bug with it and after crawling 50 or 60 website it crashes. Still works though .

Sign In

Web Scraping

Recommended Posts

danposs86

Link to comment

Share on other sites

real_illusions

Link to comment

Share on other sites

danposs86

Link to comment

Share on other sites

justsomeguy

Link to comment

Share on other sites

aspnetguy

Link to comment

Share on other sites

danposs86

Link to comment

Share on other sites

pulpfiction

Link to comment

Share on other sites

boen_robot

Link to comment

Share on other sites

danposs86

Link to comment

Share on other sites

reportingsjr

Link to comment

Share on other sites

Archived

Browse

Activity