Cod-nes Posted February 9, 2009 Share Posted February 9, 2009 How do we identify web crawlers using php? Link to comment Share on other sites More sharing options...
jlhaslip Posted February 9, 2009 Share Posted February 9, 2009 One way is to use the $_SERVER array. In particular: 'HTTP_USER_AGENT' , but these can be spoofed.http://ca3.php.net/reserved.variables.serverAnother way is to look for hits on the robots.txt file because "good" bots are expected to look in that file. Link to comment Share on other sites More sharing options...
Cod-nes Posted February 9, 2009 Author Share Posted February 9, 2009 So if I want to identify google, will the code be like this? <?phpif($_SERVER['HTTP_USER_AGENT] = 'Google'){echo 'Google bot';}else{echo 'Human user';}?> Link to comment Share on other sites More sharing options...
webtrix Posted February 9, 2009 Share Posted February 9, 2009 So if I want to identify google, will the code be like this?<?phpif($_SERVER['HTTP_USER_AGENT] = 'Google'){echo 'Google bot';}else{echo 'Human user';}?> I think that's Googlebot.Apart from Google bot, there are more google's robots:(from the below page)# Googlebot: crawl pages from our web index and our news index# Googlebot-Mobile: crawls pages for our mobile index# Googlebot-Image: crawls pages for our image index# Mediapartners-Google: crawls pages to determine AdSense content. We only use this bot to crawl your site if AdSense ads are displayed on your site.# Adsbot-GoogleHow do I block or allow Googlebot?: http://www.google.com/support/webmasters/b...mp;answer=40364And also, find the difference:bad >> if($_SERVER['HTTP_USER_AGENT] = 'Google')good >> if($_SERVER['HTTP_USER_AGENT'] = 'Google') Link to comment Share on other sites More sharing options...
zppblood Posted February 9, 2009 Share Posted February 9, 2009 You are checking for equal values, not assigning values to variables.So change = to ==. Link to comment Share on other sites More sharing options...
justsomeguy Posted February 9, 2009 Share Posted February 9, 2009 Google doesn't set the user agent string to "Google" or "Googlebot". You can find the user agent string it uses here:http://www.user-agents.org/You can use the strpos function to test if the word "Google" or "Googlebot" shows up in the user agent string. Be careful not to identify a browser with the Google toolbar as the googlebot. Link to comment Share on other sites More sharing options...
Cod-nes Posted February 10, 2009 Author Share Posted February 10, 2009 What is the crawler names for MSN, yahoo, and archive.org? Link to comment Share on other sites More sharing options...
justsomeguy Posted February 10, 2009 Share Posted February 10, 2009 They're all listed on that site, it has a search feature. Link to comment Share on other sites More sharing options...
Cod-nes Posted February 11, 2009 Author Share Posted February 11, 2009 So would the code look like this? if($_SERVER['HTTP_USER_AGENT'] == 'Googlebot' || 'msnbot' || 'YahooSeeker' || 'crawler@alexa.com' || 'archive.org_bot'){echo "Web crawler \"" . $_SERVER['HTTP_USER_AGENT'] . "\" is online";} Or would I have to do it like this: if($_SERVER['HTTP_USER_AGENT'] == 'Googlebot' || $_SERVER['HTTP_USER_AGENT'] == 'msnbot' || $_SERVER['HTTP_USER_AGENT'] == 'YahooSeeker' || $_SERVER['HTTP_USER_AGENT'] == 'crawler@alexa.com' || $_SERVER['HTTP_USER_AGENT'] == 'archive.org_bot'){echo "Web crawler \"" . $_SERVER['HTTP_USER_AGENT'] . "\" is online";} Link to comment Share on other sites More sharing options...
justsomeguy Posted February 11, 2009 Share Posted February 11, 2009 The second way is going to work, but it's not going to do any good if you just use == to compare. The entire user agent string is not "Googlebot", or "msnbot". The user agent string contains "Googlebot", but it also has other stuff in it. You need to check if the user agent string contains the text, not is the text. That's why I mentioned strpos above. $bots = array( 'Googlebot', 'msnbot', 'YahooSeeker', ...);$is_bot = false;foreach ($bots as $bot) $is_bot |= strpos($_SERVER['HTTP_USER_AGENT'], $bot);if ($is_bot) echo 'its a bot'; Link to comment Share on other sites More sharing options...
Cod-nes Posted February 12, 2009 Author Share Posted February 12, 2009 The second way is going to work, but it's not going to do any good if you just use == to compare. The entire user agent string is not "Googlebot", or "msnbot". The user agent string contains "Googlebot", but it also has other stuff in it. You need to check if the user agent string contains the text, not is the text. That's why I mentioned strpos above.$bots = array( 'Googlebot', 'msnbot', 'YahooSeeker', ...);$is_bot = false;foreach ($bots as $bot) $is_bot |= strpos($_SERVER['HTTP_USER_AGENT'], $bot);if ($is_bot) echo 'its a bot'; Thanks. But small question. What does this operator do on the line "$is_bot |= strpos($_SERVER['HTTP_USER_AGENT'], $bot);"? Link to comment Share on other sites More sharing options...
Synook Posted February 12, 2009 Share Posted February 12, 2009 The $a |= $b; is the same as $a = $a | $b; as in $a or $b. So that means that, in your case, any time the strpos() returns true $is_bot will be set to true, and won't switch back to false even if a future strpos() returns false (because $true | $false == $true). Link to comment Share on other sites More sharing options...
Cod-nes Posted February 13, 2009 Author Share Posted February 13, 2009 Thanks. :)w3schools needs to add this operation to the operations page still. http://w3schools.com/php/php_operators.asp Link to comment Share on other sites More sharing options...
justsomeguy Posted February 13, 2009 Share Posted February 13, 2009 Yeah, they're missing a lot of operators. Here is the list:http://www.php.net/manual/en/language.operators.phpThat doesn't cover the combination operators though, the assignment operators page only mentions this: In addition to the basic assignment operator, there are "combined operators" for all of the binary arithmetic, array union and string operators that allow you to use a value in an expression and then set its value to the result of that expression.The | operator is a bitwise OR operator. Not to be confused with ||, which is logical OR.After you run this:foreach ($bots as $bot) $is_bot |= strpos($_SERVER['HTTP_USER_AGENT'], $bot);$is_bot will be the number 0 if the bot wasn't found, or it will be a positive integer if it was found.Actually, that's a problem. strpos will return 0 if the string is found at the beginning, but the loop won't catch that. You better change it to this:foreach ($bots as $bot) $is_bot |= (strpos($_SERVER['HTTP_USER_AGENT'], $bot) !== false); After that, $is_bot will be integer 0/1 if it found a bot or not. Link to comment Share on other sites More sharing options...
Cod-nes Posted May 7, 2009 Author Share Posted May 7, 2009 Would this work? if(strpos($_SERVER['HTTP_CLIENT_IP'],'Googlebot')) { $ip='Googlebot'; } elseif(strpos($_SERVER['HTTP_CLIENT_IP'],'msnbot')) { $ip='MSN Bot'; } elseif(strpos($_SERVER['HTTP_CLIENT_IP'],'Slurp')) { $ip='Yahoo'; } elseif(strpos($_SERVER['HTTP_CLIENT_IP'],'ia_archiver')) { $ip='Alexa'; } // Getting IP address from human elseif (!empty($_SERVER['HTTP_CLIENT_IP'])) // Checking ip from share internet { $ip=$_SERVER['HTTP_CLIENT_IP']; } elseif (!empty($_SERVER['HTTP_X_FORWARDED_FOR'])) // Checking ip is pass from proxy { $ip=$_SERVER['HTTP_X_FORWARDED_FOR']; } else { $ip=$_SERVER['REMOTE_ADDR']; } Link to comment Share on other sites More sharing options...
justsomeguy Posted May 7, 2009 Share Posted May 7, 2009 No, that won't work. I'm not aware of a $_SERVER property called "HTTP_CLIENT_IP". If you're looking for the IP address, it's in $_SERVER['REMOTE_ADDR']. For the bots, you don't care what the IP is, you're supposed to be comparing the user agent string instead. That's like the examples we had above. Also, when you use strpos, it's not enough to just do what you did, you have to compare the return value strictly against false. Strpos might find the substring at the beginning of the string, and if it does it will return 0 because the substring starts at position 0. When you do it like you do above, when you're just checking if the return value is true or false, if it returns 0 it will say it is false and that it didn't find it (even though it found it at the start). In order to distinguish between 0 and false you need to compare it strictly.if(strpos($_SERVER['HTTP_USER_AGENT'],'Googlebot') !== false) Link to comment Share on other sites More sharing options...
Cod-nes Posted May 7, 2009 Author Share Posted May 7, 2009 *faceplams.*I can't believe that I was trying to find the web crawlers ip. >_< Anyways thanks! ^^ Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.