Jump to content

Why is Verisign Inc downloading my entire website?


Guest So Called
 Share

Recommended Posts

Guest LH91325

I wonder if anybody else here is interested in monitoring what web crawlers are doing on our websites. I like when Google, Yahoo, MSN and Bing crawl all over my site. They send me traffic. I can see the traffic coming in by the browser's referrer string. Yandex, Baidu, the rest. Meh. Maybe they'll send me some traffic some day, from Russia or China or something. (Yeah, right.) Another crawler, what is Majestic-12 doing? They gonna send me traffic? But WTF is Verisign doing downloading my site? AFAIK they aren't even in any business that will benefit my traffic. (Yeah Symantec just bought them. Must be nice to have the big bucks.) Maybe they're sniffing me for virus. Or virii, whatever you call them. I've been amusing myself watching the bots crawl my sites, and I've been taking names and keeping notes. There's a lot more here than is obvious from your server logs. I keep separate logs via PHP-MySQL and separate all the traffic out into, currently, about 100 'bots being tracked. Did you know that all of the major search engines (including Google, Yahoo, MSN, Bing and Yandex) have not only their web crawler fleet obvious not only from their user-agent string and client domain, but they all have stealth bots too? Most of the time they index your site and it's obvious it's them, but every now and they they send a stealth bot over to see if you're lying to them. That's easy to do if you track them. Identify the bot and you can send them different results than normal guests see. For an example of a stealth bot MSN uses the domain phx.gbl! That's not even an official TLD but MSFT has the DNS servers and it can resolve if they want it to! (I've never understood how they can continue to get away with this!) Other bots don't even resolve DNS. You get only the remote IP number and you have to manually figure out who owns the domain. I turn them into CIDR notation so I can track those in the future. They lie about their user-agent string, pretending they're using MSIE or Opera or Firefox. Sometimes they don't even provide any user-agent string at all, just blank. (I send those folks packing, after of course figuring out who they are and adding them to my 'bot list.) How do I know they're bots? Who else looks at your robots.txt file? That's a dead give-away. The bots are like mice going for cheese! Even if they're planning on ignoring it they still wanna see what's in it. (Some of them. Normal visitors almost never look at your robots.txt file. Except maybe web designers.) Another way I find bots is that I have a few links that are invisible to normal visitors. Bots don't know that the CSS is hiding the link so they follow it anyway. Caught you! :) Meanwhile I've been on to Verisign for a while now, and I've already put them in my bot class that doesn't get to view my pages. They can view / and they can read my robots.txt but while ordinary visitors are reading my pages Verisign is getting 404 page not found. Like I said when you know which are the bots you can send them different content, or no content at all if you like. Maybe I'll let them see my content some day in the future if I see any benefit to me or my website traffic. BTW AFAIK you could throw every last damned 'bot off your site except Google, Yahoo and MSN/Bing and if you're like me it probably wouldn't affect your website traffic by more than 1 percent. I analyze my referrals too so I know who is sending me traffic. Google alone is responsible for about 7/8 of my traffic. So my reason in posting this, is anybody else interested in watching the 'bots watch us? :) I'm talking about the 'bots that are watching us for reasons other than sending us traffic. :) Anybody interested?

  • Like 2
Link to comment
Share on other sites

Guest LH91325

Over the years I've developed my own content delivery system. I started out with a flat HTML site in the '90s, then started writing PHP code around 2000, putting my content in flat text files and then having the PHP fetch the code and slap header and footer on that before serving it to the guests. I learned MySQL and saw some great examples of PHP-MySQL content delivery systems (e.g. WordPress) so as I grew my code became more sophisticated until today everything goes through one index.php and all the content is stored in the database. (The main script calls other scripts as necessary depending on request type. No request requires more than 4 scripts loaded.) I can even serve images out of the MySQL database, although my main site has a separate image directory. I have several domains, some of them not really used except for email, so I decided to run my content delivery system on all of them--took me less than an hour to copy the files and configure the settings. In the last year I decided it would be easier to just have one copy of the PHP scripts and have all my sites execute from the same script. After all, all the content is in MySQL and the configuration settings are there too. Every site can be different, yet use the identical code. The first file to execute is index.php (of course) and that has not much more than the MySQL server configuration (because each site has a different MySQL server). The index then loads (require()) the main program that is located out of the publicly accessible html area. Each site has only the index.php file and .htaccess file. All the rest of the code is stored in a common code directory. As I said I have only one site that has any images stored in an image directory. The rest of the sites have such few images (and small ones at that) that they're just served from the MySQL database. So what I have is one main production site and 4-5 other sites that aren't really used for anything. (One of them has my CV.) It amused me to have them all share one code, but as I watched the traffic I started to see interesting things. I forgot to add, my content delivery system logs every transaction (viewing) in a MySQL log, and categorizes things like ordinary guests, errors, good web crawlers and what I like to call badbots. I began to amuse myself sorting out the good and bad web crawlers. Google, Yahoo, MSN, Bing, good--they send me traffic. But WTH are the rest of those bots doing? It's an interesting experiment because I have the main site (think of it as a sort of blog although mine runs custom software) with perhaps 100 pages of content, I have my CV site with only one page of content, and I have a few more sites left over with absolutely no purpose except that I use the domains for other reasons (like email). I realized I had the makings of an experiment! I look at the traffic from the main site (where I do pretty good in Google page ranking, even compared to much larger sites). I look at traffic from the CV site which has very little demand, and I look at the traffic from the other sites which have absolutely no content and no reason for anybody to visit them. On the nothing sites there is no traffic except web crawlers and other 'bots. I've made no effort to publicize or submit any of the test sites in any way. Whatever traffic found it's way there got there by snooping. It's an interesting system because the nothing sites help me figure out what traffic comes from web crawlers and 'bots. I identify them by their behavior (knowing they're 'bots, because nobody else is going to visit those sites, or so rarely I can ignore them) and classify the 'bots by user-agent string, client domain, or when that fails I track them down by their IP address and via ARIN, RIPE, APNIC etc. I figure out who owns that block of IP addresses. If they appear to be worth tracking I calculate the CIDR and put that in my web crawler database (which can handle several CIDRs if they use more than one). Once I've classified a web crawler I log them in separate log categories: goodbot and badbot. I also keep track of number of visits and most recent visit. As I discover new web crawlers I put the information in the database on my main website, then copy the table to the databases of the other sites (except for the number of visits and time of last visit). So I mostly find the 'bots from my nothing sites and that lets me analyze my main website traffic into human guests and 'bots. Ordinarily this would be very difficult to do on one website because of the amount of traffic. It would be hard to figure out which are 'bots and which are human guests. Or at least harder than it is when you see nothing but 'bot traffic on nothing sites. That's enough about my experimental setup for the time being. I further classify 'bots into categories. There's a main category that most bots start out in, really not a category as much as not being in any other category. I call these goodbots and include SEs like Google, Yahoo, Bing, MSN, Yandex. Baidu, any web crawler that looks like its mission is to index my site and send me traffic. Whenever I notice 'bots doing things that don't appear to have any benefit to me I flag them as being in other categories. One of the categories is what I call "link spammers." I first saw them appearing on my CV website from time to time, but as time went on they became more and more frequent. Their sole behavior is that they access my root page / but they have a referrer from some site I've never heard of. Most of the links don't have much in common except that many of them have RU or UA domain names, and not surprisingly most of the link spammers come out of IP addresses that I've tracked down to RU and UA (sometimes CZ, DK). Actually I don't even need to track down IP addresses to countries. I use a public domain IP address to Country database (you can download several on the Internet) so my logs include country of origin for all site accesses. It's interesting to see where my visitors and 'bots are coming from.Now all of my sites including my main content site are getting a fair amount of link spammer activity... So anyway when I discover link spammers I put them into the link spammer category and then when they visit my site I detect them right off, immediately log the access, and then my script enters a 10 minute sleep state without sending any headers or content. I don't have any way of knowing what this does to them other than hanging them up until their TCP/IP link times out or until they quit waiting. I see it as the 'bot equivalent of putting a phone call on hold with no background music. :)) My server (typical shared hosting LAMP setup) is multi-threaded so my site goes right on working for other guests and web crawlers while the link spammer is in limbo. I wish there was some way of figuring out what they're doing. But the big question: why are they doing this? Their visits certainly aren't getting any content from me. All they get is a TCP/IP connect to port 80. Apache might be sending a header or two but my site certainly isn't sending any headers or content. There's only one thing I can figure out. The only affect they're having on my site (other than annoying me) is that they appear in my logs. I can't believe that they want to get me personally to go visit their sites (and I haven't visited them either, I have no idea if they're hosting viruses and I don't want to find out). The only other thing I can figure out is that maybe some "webmasters" (that always sounded like a silly word to me: "I am the master of the web!" ;) might be posting their logs in publicly accessible folders, either intentionally or accidentally. (Nobody but me can see my MySQL logs, and nobody but my hosting service and I can see my server logs.) I have occasionally Googled something and ended up looking at somebody else's logs. Maybe the link spammers are counting on showing up at random server logs across the Internet, hoping that legimate web crawlers like Google are indexing those logs. It's a known fact that Google's page ranking system (and presumably others) count number of links to a site in determining page ranking. Maybe my link spammers are casting their seeds to the wind and hoping that the behavior will increase their page ranking. That's the only theory I have at the present. I hope this topic can turn into something more than a monologue. I started it with the hope that others could share their experiences about web crawlers that apparently have no beneficial purpose in visiting our websites, and may even have malevolent intent. If interest is shown I'll tell you about a few more categories, perhaps the one I call "domain sniffers."

Link to comment
Share on other sites

Bots hardly ever remember what sites they visited. They just look for sites that allow them to post comments and put links to their websites trying to get visitors. Verisign might be visiting your site because you're using one of their SSL certificates. The only time I ever saw a bot that annoyed me on my site was one that visited about 20 times a day called "spinn3r." Their homepage says they're innocent, but looking it up I saw lots of people complaining about it. I just use an htaccess file to deny access to the spinn3r bot.

Link to comment
Share on other sites

Guest LH91325

The link spammers I'm referring to visit my combined sites 3-4 dozen times a day, usually hit my / (index page) twice with some fake referrer, then come back a few hours later and do it again with a different fake referrer. (Fake = these sites don't have links to my site.) I have no SSL. My site is non-commercial. It is one-to-many, there's no way for visitors to post anything. People read what I said and then they go away. That's all there is. (Edited to add: Doubtless some of my 'bot visitors are comment spammers. They visit sites looking for specific site software that allows comments to be added by guests. My site didn't have any way to post comments so they went away looking for other sites to comment spam.) There are lots of fake bots out there. That's another one of my categories: fake crawlers pretending to be well known crawkers. I usually take user-agent strings at face value unless I have reason to believe otherwise, but when a Google 'bot starts visiting my site with a RU or UA IP address I know for sure that isn't Google. I've seen fake Yahoo 'bots too. They annoyed me since I like to track number of times real search engine web crawlers index my site and the fakes screw up my statistics. I have a setting in my crawler table that I can switch on, bogus_ua_detect, and it requires more stringent criteria before believing the user-agent string. That would be either coming from the correct domain (resolved from their IP address) or from the correct CIDR if the IP address doesn't resolve DNS. It's pretty difficult fake your IP address. For example, Googlebot almost always resolves to some .googlebot.com (except for their stealth visits which resolve to a different domain). It would be hard for a fake bot to visit my site and have their IP address rDNS point to .googlebot.com. They'd have to poison the DNS system, not impossible but not worth the trouble since most administrators probably don't even look at their logs, let alone analyze who the liars are. Currently I'm verifying only Google and Yahoo as not being impostors. If I see MSN or Bing web crawlers showing up from the wrong country I'll add them to the list. The Internet has many fake bots, and you can put any user-agent string you like in your crawler script. I've seen the same bot visit my site several times in the space of a dozen or two dozen minutes with a different user-agent string each time. This would be difficult to spot on a production site but easy to spot on my test sites which have essentially zero human traffic. Whenever you have any complaint about web crawler behavior you should verify that its IP address leads to an IP block (CIDR) that is owned by the web crawler's owner or is generally used by their bots. There's lots of liar bots out there pretending to be somebody else's bots. Another thing that annoys me is domain sniffers. Here's a few examples: domaintools.com whois.sc sitedossier.com They visit your site and take pictures, compile all your ownership registration information (from your domain registrar public record) and they take reservations for anybody who wants to register your site if you give it up or forget to renew your domain registration. I can't see any reason a domain owner would benefit from having them visit your site. I track domain sniffers too. When they visit my / (site main index) I send them nothing but <HTML></HTML>. Let them take a picture of that! :) If they try to access any of my content pages I just send them a 404 not found. Of course there's nothing I can do about the domain registration data--it's public records--except that I have an anonymous registration arrangement with my domain registrar that hides my personal information (real name, address) and provides an email proxy if somebody has an official reason to contact me regarding my website. Another type of web crawler that I just hate I simply call 'badbots' -- bad robots. They often run out of shared hosting services where anybody can rent a hosting account at a small price per month and then run a crawler script out of their account. Many of these are run by "script kiddies" who download their scripts from hacker sites. The scripts crawl the web looking for known site vulnerabilities. For example I can always tell when there's a new WordPress vulnerability or security issue because accesses start showing up in my log looking for some signature that you are running the software and have the vulnerability. For example the recent "Tim Thumb" WordPress exploit. They try to access several scripts and images you would have if you have Tim Thumb installed. (It's some kind of WordPress add-on.) Currently the script kiddies are looking for /admin/spaw/spacer.gif ... If you have that file then you have SPAW installed (whatever that is, some kind of editor I think) and the vuln can lead to privilege escalation and possible ability to upload their own files to your site. (If you can upload PHP scripts to somebody's site you can have full control of everything!) I call sites that send me too much of this garbage 'badbot' sites. I put their domain name and/or CIDR in my web crawler database and then when they visit my site my script logs their visit (in my bad robot log) and then it just goes to sleep() for 10 minutes without sending any headers or content. I presume sometime in that period--a few seconds or minutes--their script or connection times out and disconnects. So you're asking yourself why am I doing this? It's an experiment. It's a hobby. It's amusement. It's interesting to see Internet things happening that not very many people know much about. It's also useful because I know how the legitimate web crawlers are indexing my site, how often, what pages, who indexes my site, what kind of traffic they're sending me. I can see my real human site visitors and their referrer string tells me which search engine sent me the traffic, and for the big ones (Google, Yahoo, MSN, Bing, etc.) I can usually see the actual search terms they used to find my site and find my content. Google even usually tells me which rank my content page got. In any case I can just visit whatever search engine and plug in the search terms and find my own listing in the results. This is pretty useful SEO information. Knowing the search terms and page rank I can give some thought to why people visit my site, and why other competing websites rank higher than my pages. I can visit the competitor's sites and compare their content to my content, and then change my content with the goal of increasing my page rank and getting more traffic. What I'm doing may sound crazy to some people (in fact probably sounds crazy) but it's working! I'm getting more real human site visitor traffic every month and my page ranks are often higher than commercial sites with similar content. I've learned a lot from my experiment. I know which search engines are helping me, which crawlers are not giving me any benefit or might even be trying to harm me. I know more about why real people visit my site, about how they got there, and I've learned a lot about how search engines index websites and what it takes to increase your page ranking and hopefully get more traffic.

Edited by LH91325
Link to comment
Share on other sites

Guest LH91325

Upon further research I've discovered that it's possible, even likely, that some of these bad robots I've been tracking might be involved in scams like "content scraping" and "web page page rank hijacking." (You can Google those for more information.) I'm still researching those topics because I've found the material difficult to understand, but extremely important if true. So maybe it turns out that my paranoia and obsessive-compulsive monitoring of web crawlers might have been the smart thing to do! One thing for sure, I've already got my scripting in place and I can shut down any web crawler's access to my site if I wish. Already have a few dozen I've kicked out...

Link to comment
Share on other sites

Guest LH91325

Well the problem with bad robots is that they ignore anything you say in your robots.txt file. In fact they might even use that as an indication of places to look for content you don't want indexed. You can block bad robots to some extent on an Apache server by use of the .htaccess file but it requires a lot of manual maintenance keeping it updated. (Although I have written PHP code that modifies the .htaccess file automatically.) The very first thing you need is some kind of log analysis utility. You can't block bad robots if you don't know which ones they are, and going through the server logs manually would be impractical for any site with enough page ranking (traffic) to be worth scraping or hijacking. In effect I've built my own log analysis utility right into my site code, and I've included he means to not only identify and monitor web crawlers, but to restrict or ban any of them I want. I can even ban referrers, domains, IP blocks, user-agents, IP addresses, IP address blocks... I've even got bad robot detection links, links that are invisible so ordinary human site visitors can't see them. Only web crawlers (or people using HTML to voice software) can see them. My robots.txt file tells web crawlers to not index the directory those links point to. If a web crawler accesses the bad robot detection link my site automatically bans the IP address and client domain, and sends me an email telling me it snagged another bad robot! :) I also have what I call flood detection. That's when some bad robot comes on the site and tries to download my entire site in a minute or two. Polite robots don't do that. I've got the "crawl delay" set in my robots.txt file and most polite robots respect that. I've got a detection threshold I can adjust to any number of hits within any number of seconds, and if they exceed that I throttle them, either sending them 404s or null html pages <HTML></HTML> (I can select either.) My website is a hobby site. It serves two of my hobbies. One of them is the content I write about. The other is writing PHP-MySQL code and putting in amusing features like the robot detection and traffic monitoring. BTW my content has absolutely nothing to do with programming or Internet. I'm getting seriously annoyed at some of the antics coming out of a couple countries and I'm thinking I might ban the entire countries. :D (I don't get much if any human traffic from those countries.)

Link to comment
Share on other sites

What I was saying is that bad robots often scan your robots.txt file for places you don't want them to go, so you can have a particular file that robots.txt tells the robot not to go to, and that file will ban the robot from your site.

Link to comment
Share on other sites

Guest LH91325

Funny... That's more or less how my code works. I've got a directory that my robots.txt file says not to index. The hidden link points to that directory. If a site visitor (bad robot) tries to access that link my script bans them. Bang! Caught another bad guy! :) I've got another directory that is also disallowed for indexing, but the only place it's ever mentioned is in my robots.txt file. I've caught a few bad robots peeking to see what's there (the only way they would have ever known is to read my robots.txt file and then do the opposite of what I said) and they get banned too. I've only caught a very few doing that, maybe less than 5%. Most of the bad robots don't even bother with the robots.txt file. They just try to download the entire site as fast as possible. Either my bad robot detection links get them or my download flooding detection gets them. Robots should be polite like Google or Yahoo. I really hate rude robots! I also dislike any web crawlers that don't send me traffic. Why should I let them index my site if they don't send real people to read my pages?

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...