Jump to content

Identifying Web Crawlers


Cod-nes
 Share

Recommended Posts

One way is to use the $_SERVER array. In particular: 'HTTP_USER_AGENT' , but these can be spoofed.http://ca3.php.net/reserved.variables.serverAnother way is to look for hits on the robots.txt file because "good" bots are expected to look in that file.

Link to comment
Share on other sites

So if I want to identify google, will the code be like this?
<?phpif($_SERVER['HTTP_USER_AGENT] = 'Google'){echo 'Google bot';}else{echo 'Human user';}?>

I think that's Googlebot.Apart from Google bot, there are more google's robots:(from the below page)# Googlebot: crawl pages from our web index and our news index# Googlebot-Mobile: crawls pages for our mobile index# Googlebot-Image: crawls pages for our image index# Mediapartners-Google: crawls pages to determine AdSense content. We only use this bot to crawl your site if AdSense ads are displayed on your site.# Adsbot-GoogleHow do I block or allow Googlebot?: http://www.google.com/support/webmasters/b...mp;answer=40364And also, find the difference:bad >> if($_SERVER['HTTP_USER_AGENT] = 'Google')good >> if($_SERVER['HTTP_USER_AGENT'] = 'Google') :)
Link to comment
Share on other sites

Google doesn't set the user agent string to "Google" or "Googlebot". You can find the user agent string it uses here:http://www.user-agents.org/You can use the strpos function to test if the word "Google" or "Googlebot" shows up in the user agent string. Be careful not to identify a browser with the Google toolbar as the googlebot.

Link to comment
Share on other sites

So would the code look like this?

if($_SERVER['HTTP_USER_AGENT'] == 'Googlebot' || 'msnbot' || 'YahooSeeker' || 'crawler@alexa.com' || 'archive.org_bot'){echo "Web crawler \"" . $_SERVER['HTTP_USER_AGENT'] . "\" is online";}

Or would I have to do it like this:

if($_SERVER['HTTP_USER_AGENT'] == 'Googlebot' || $_SERVER['HTTP_USER_AGENT'] == 'msnbot' || $_SERVER['HTTP_USER_AGENT'] == 'YahooSeeker' || $_SERVER['HTTP_USER_AGENT'] == 'crawler@alexa.com' || $_SERVER['HTTP_USER_AGENT'] == 'archive.org_bot'){echo "Web crawler \"" . $_SERVER['HTTP_USER_AGENT'] . "\" is online";}

Link to comment
Share on other sites

The second way is going to work, but it's not going to do any good if you just use == to compare. The entire user agent string is not "Googlebot", or "msnbot". The user agent string contains "Googlebot", but it also has other stuff in it. You need to check if the user agent string contains the text, not is the text. That's why I mentioned strpos above.

$bots = array(  'Googlebot',  'msnbot',  'YahooSeeker',  ...);$is_bot = false;foreach ($bots as $bot)  $is_bot |= strpos($_SERVER['HTTP_USER_AGENT'], $bot);if ($is_bot)  echo 'its a bot';

Link to comment
Share on other sites

The second way is going to work, but it's not going to do any good if you just use == to compare. The entire user agent string is not "Googlebot", or "msnbot". The user agent string contains "Googlebot", but it also has other stuff in it. You need to check if the user agent string contains the text, not is the text. That's why I mentioned strpos above.
$bots = array(  'Googlebot',  'msnbot',  'YahooSeeker',  ...);$is_bot = false;foreach ($bots as $bot)  $is_bot |= strpos($_SERVER['HTTP_USER_AGENT'], $bot);if ($is_bot)  echo 'its a bot';

Thanks. But small question. What does this operator do on the line "$is_bot |= strpos($_SERVER['HTTP_USER_AGENT'], $bot);"?
Link to comment
Share on other sites

The $a |= $b; is the same as $a = $a | $b; as in $a or $b. So that means that, in your case, any time the strpos() returns true $is_bot will be set to true, and won't switch back to false even if a future strpos() returns false (because $true | $false == $true).

Link to comment
Share on other sites

Yeah, they're missing a lot of operators. Here is the list:http://www.php.net/manual/en/language.operators.phpThat doesn't cover the combination operators though, the assignment operators page only mentions this:

In addition to the basic assignment operator, there are "combined operators" for all of the binary arithmetic, array union and string operators that allow you to use a value in an expression and then set its value to the result of that expression.
The | operator is a bitwise OR operator. Not to be confused with ||, which is logical OR.After you run this:foreach ($bots as $bot) $is_bot |= strpos($_SERVER['HTTP_USER_AGENT'], $bot);$is_bot will be the number 0 if the bot wasn't found, or it will be a positive integer if it was found.Actually, that's a problem. strpos will return 0 if the string is found at the beginning, but the loop won't catch that. You better change it to this:
foreach ($bots as $bot)  $is_bot |= (strpos($_SERVER['HTTP_USER_AGENT'], $bot) !== false);

After that, $is_bot will be integer 0/1 if it found a bot or not.

Link to comment
Share on other sites

  • 2 months later...

Would this work?

if(strpos($_SERVER['HTTP_CLIENT_IP'],'Googlebot')) {  $ip='Googlebot'; } elseif(strpos($_SERVER['HTTP_CLIENT_IP'],'msnbot')) {  $ip='MSN Bot'; } elseif(strpos($_SERVER['HTTP_CLIENT_IP'],'Slurp')) {  $ip='Yahoo'; } elseif(strpos($_SERVER['HTTP_CLIENT_IP'],'ia_archiver')) {  $ip='Alexa'; } // Getting IP address from human elseif (!empty($_SERVER['HTTP_CLIENT_IP']))   // Checking ip from share internet {  $ip=$_SERVER['HTTP_CLIENT_IP']; } elseif (!empty($_SERVER['HTTP_X_FORWARDED_FOR']))   // Checking ip is pass from proxy {  $ip=$_SERVER['HTTP_X_FORWARDED_FOR']; } else {  $ip=$_SERVER['REMOTE_ADDR']; }

Link to comment
Share on other sites

No, that won't work. I'm not aware of a $_SERVER property called "HTTP_CLIENT_IP". If you're looking for the IP address, it's in $_SERVER['REMOTE_ADDR']. For the bots, you don't care what the IP is, you're supposed to be comparing the user agent string instead. That's like the examples we had above. Also, when you use strpos, it's not enough to just do what you did, you have to compare the return value strictly against false. Strpos might find the substring at the beginning of the string, and if it does it will return 0 because the substring starts at position 0. When you do it like you do above, when you're just checking if the return value is true or false, if it returns 0 it will say it is false and that it didn't find it (even though it found it at the start). In order to distinguish between 0 and false you need to compare it strictly.if(strpos($_SERVER['HTTP_USER_AGENT'],'Googlebot') !== false)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...