Jump to content

Site Spider/Crawler


Skemcin

Recommended Posts

So I've been digging around trying to find a software program that I can use to mine some basic information from a website. The short of it is, I've got a new job and will be managing several web sites. I need to get an idea how each of these sites link back anbd forth to each other or to other areas. And, I need to understand the myriad of server and client side technologies being used - a sort of detail summary of how many times a particular file extension is used in link or form.I've downloaded these two applications - both falling short of my requirements:GSiteCrawlerDRKSpiderAgain, basic requirements for a report/export:a.) a count and list of unique domains and sub-domains referencedb.) a count and list of file extension referenced throughout the site - a href, form action, document.location,etc.Anyone have any suggestions?

Link to comment
Share on other sites

Don't tell me... the sites are not with well formed XHTML, right? If they were, writing your own bot would be almost trivial (except the document.location... analyzing JavaScript originating code would ###### big time).

Link to comment
Share on other sites

That is funny - a corporate site done in well formatted [anything].lolMy analysis thus far has:

  • 2,328 pages (not that bad)
  • 523 web forms
  • 206 mailto references (which equals spam)
  • 75 server side mail functions
  • 2,484 orphaned files (lol - but understand that includes things like .htaccess 'cause not linked to)
  • 1,108 broken links
  • 8 public application integration points (where we use a vendor with an associated domain)
  • 11 private application integration points (where we use a vendor with an associated domain)
  • 2 databases with 34 tables between them

Needless to say, I wasn't expecting to see well-formatted pages - lol.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...