Jump to content

Site Spider/Crawler

Recommended Posts

So I've been digging around trying to find a software program that I can use to mine some basic information from a website. The short of it is, I've got a new job and will be managing several web sites. I need to get an idea how each of these sites link back anbd forth to each other or to other areas. And, I need to understand the myriad of server and client side technologies being used - a sort of detail summary of how many times a particular file extension is used in link or form.I've downloaded these two applications - both falling short of my requirements:GSiteCrawlerDRKSpiderAgain, basic requirements for a report/export:a.) a count and list of unique domains and sub-domains referencedb.) a count and list of file extension referenced throughout the site - a href, form action, document.location,etc.Anyone have any suggestions?

Link to post
Share on other sites

Don't tell me... the sites are not with well formed XHTML, right? If they were, writing your own bot would be almost trivial (except the document.location... analyzing JavaScript originating code would ###### big time).

Link to post
Share on other sites

That is funny - a corporate site done in well formatted [anything].lolMy analysis thus far has:

  • 2,328 pages (not that bad)
  • 523 web forms
  • 206 mailto references (which equals spam)
  • 75 server side mail functions
  • 2,484 orphaned files (lol - but understand that includes things like .htaccess 'cause not linked to)
  • 1,108 broken links
  • 8 public application integration points (where we use a vendor with an associated domain)
  • 11 private application integration points (where we use a vendor with an associated domain)
  • 2 databases with 34 tables between them

Needless to say, I wasn't expecting to see well-formatted pages - lol.

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...