Jump to content

web crawler


aspnetguy

Recommended Posts

Most are going to be programmed in C+, Python, or somehting similar. They are programmed basically download a page and then fill in the links and follow them - recording content and relationships to each other along the way in a database. Most are set up to recursively search folders - if it found aspnetguy.com/1/2/3/code.asp it would look up aspnetguy.com/1/2/3/, aspnetguy.com/1/2/, aspnetguy.com/1/, and aspnetguy.com/ to look for hidden or "forgotten" pages.All in all, you culd make a pretty crude one in any server side language, but a standalone executible is better for the sake of resource consumption. You're just downlaoading a file and parsing through it. The logic used fro parsing and then databasing is where the genious comes in.

Link to comment
Share on other sites

ok I got a good start on it. My crawler returns a Response Stream from the requested url and I can read that into HTML code and display it in a textarea.Now on to getting it to follow links and managing how many levels it goes into a domain and transfering to outside domains.I can see how this could grow exponentionally very quickly without any limits on it. You would want a seperate server or servers just to do your crawling...if you were big like Google or Yahoo...

Link to comment
Share on other sites

One thing you might want to do is put a kill timer on it to make sure it doesn't run indefinately. With some of the search algorithms we covered in school, the searcher would keep a counter of how many times it recurses, and would just stop itself when it reached a certain value. It would probably also be good to have a timer where if it sees a certain page more than a certain number of times it stops following links there because it is probably in a loop. The limit for that might only be once.As far as languages go, the ideal language would probably be C. The most difficult part there would be the HTTP wrapper to get information off the web, but it would be lightning fast. You wouldn't even necessarily need to use a database, you could just write to files. But that would probably eat up space at a shocking rate for plain text. Same with a database though, but that's why Google compresses everything.

Link to comment
Share on other sites

My search engine is writen in PHP... but I wasn't making it so it had to be large scale, fast, or space saving... just to learn more php.I made my search engine so it only crawls pages on a list, and then it removes copys of the same URL... so then it can only rescan the same URL 100 levels in.

Link to comment
Share on other sites

I agree C would probably be the fastest language to use but I don't know it that well I decided to use C# instead.
C# won't be that bad, and is probably more practical than C. If you were on the level of Google, there would be enormous savings to using a slightly more optimized application, but for all intents and purposes C# should suit you fine. There's a little overhead with the .NET framework, but it will still be a lot faster than a scripting language or other interpreted language.
Link to comment
Share on other sites

would it be worth it for a to be big search engine to write it's own programming lanuage that is made to be used as a search engine or would it be better to use a premade one like C?it would be more work but it would really be optimized!

Link to comment
Share on other sites

In this day and age, it seems like most languages that would fulfill a certain niche have already been created. If you want a language that is optimized for quickly finding data, that language is SQL. C allows the programmer to optimize the code to create very fast, small applications. That's not to say that all C code is fast, a bad programmer can just as easily create a slow C program as they can a slow Java program. But C is less forgiving, a poorly designed program is much more prone to memory leaks or core dumps than higher-level languages that include things like garbage collection. Using things like pointers or process forking improperly can bring down an entire primitive OS pretty quickly.Of course, if all you want is speed, at the expense of readability and being able to understand the code easily, then look no further than assembly.

Link to comment
Share on other sites

I actually looked at a beginner tutorial for assembly earlier today. I think C/C++ is as low a level as I am willing to go :). Overall my C# version will probably be faster then if I used C since I am still really new to C I would have a zillion memory leaks.

Link to comment
Share on other sites

That is my intensions. I have been documenting my processes a bit on this thread and will put all updates here.The last post is a bit outdated but I hope to update it today or tomorrow.So far this has been a really good challenge and fun project for me.http://aspnetguy.com/forums/thread/19.aspx

Link to comment
Share on other sites

oh sweet, looking nice! I have one question though, I have never used java or C languages (any compiled language..) to actually create and run a real program from the web, so how woul this be done exactly? Issnt the cgi bin where you store these types of scripts?

Link to comment
Share on other sites

Well That is the beauty of .Net. I am writing a "class" not just a webpage. The class, when it is done, will be generic enough so I can plug it into a webforum and deploy it on the web or I can drop it into a windows project and make an exe (which I think I will do) so it will be multi-functional.

Link to comment
Share on other sites

Issnt the cgi bin where you store these types of scripts?
Just as a point of terminology, a script is a program written using a scripting language, such as PHP, Perl, or Python. Compiled programs are generally just referred to as applications.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...