Jump to content

Search Engine


dcole.ath.cx
 Share

Recommended Posts

I think I must have visions of the future or someone hack my computer delete all referance of this post, then hacked invisionzone.com and deleted the posts of this form because I wrote this once... and submitted it, then came back to it... Or it's been a long day.Anyway, I'm making a search engine and I'm looking for tips as to what to have in it. Like a "(Good/Bad)" next to every result, so users can have input. And other cool things to make good results pop to the top.I know it's not going to be the next Google, my server could only hold 4 Million results. The Script it already half way done, so none of this you can't make a server. I ready have the crawler done, I just have to make the search box and term getting script, and the reindex script. Which should be the easiest of the scripts I've been working on. I don't know how many pages it will support because it doesn't store data like any other Search Engine what's open for anyone to see, The number will be more than 10,000 for sure. I'm going to have - + () OR || AND &&I'm going to have "Didn't find what your looking for?" and "(Good/Bad)"I'm not going to have cache, url:, define:, images, videoI may have a suggest tool (as you type it will come up with answers)I don't think I'm going to have a preview box (by don't think, I mean not)What else should I have, not have?? any tips at all would be good!

Link to comment
Share on other sites

Google is based around the theory of backlinks, where a page's importance increases in proportion to how many sites link to it. Additionally, if the term you are searching for is in the linked text itself:<a href="http://somesite.com">search keywords</a>Then that really boosts the rank. The more sites that link to a specific page with keywords in the link means that the page they link to probably has a lot to do with those keywords.The original name of the Google project that Larry Page and Sergy Brin started was "backrub", and they had a really creepy logo. And their server was made from red, yellow, blue, and green Lego blocks. Is that a feature? I can't seem to find a picture of the server, if I do I'll post it.

Link to comment
Share on other sites

Does your search engine just look for keywords in pages, or does it collect other information to try and determine which of the pages in the list of all possible matches is the most relevent? Does it just total up the number of keywords or something like that, or does it use information like links from other sites, or placement of keywords in different elements (a keyword in an h1 should score higher than a keyword in a span)?If all it does is look for and total up keywords, there's not much that we can say about that. You haven't showed any source to critique, and you haven't explained how it works other than to say you don't store any data, and if that's what you're stuck on doing then that's about it. But if you are trying to build something that determines the relevancy of a page, with keywords being one part of that, then I was trying to show how Google does it to give you an idea. Before the BackRub project, search engines just looked for keywords and totalled them up. The page with the most keywords won. And you can see how Google has changed search, because they try to determine how relevant a page is, not just whether or not it has keywords.I guess I was trying to suggest how you rank pages, not just search engine features. If the search engine doesn't give relevant results, it doesn't matter which features it has.

Link to comment
Share on other sites

The search engine looks for keywords in special tags like title, meta, b, u, i, p...It also allows user to submit if a results was good or bad but still resist spam!I do include backlinking. Backlinking is something that is every easy to do if you think about it... when I first thought about it I was over thinking and was going to increase the DB just to fit it in but now that I have thought about it, it became much easier.I though alot about bad searches and how spammers get in there so I doc points for spam like behavior. I don't do alot of talking at school and I take in everything, even if it's not what's being tought and I based the search engine off of my teacher, it grades webpages like teachers grade assignments. It really grades like that blue and orange paper on all the walls, I can remember right now what it says but I get the idea... I'll have to get back on that one. But it has to do with how you write, where things go, how the formating should look, how things flow...

Link to comment
Share on other sites

Backlinking is easy enough to do. What I don't understand is your claim that you don't store any data. You have to be storing data, for example if a page was marked as good or bad. Here is a paper by Sergey Brin and Larry Page pre-2000 about their work on the Google project:http://www-db.stanford.edu/~backrub/google.htmlIf you look at their diagram of how their search engine works, the data repository is the single largest piece:over.gifApparently they had indexed 24 million pages for a total uncompressed size of around 150GB, compressed to just over 50.

Link to comment
Share on other sites

I do store data... about 60 - 80 bytes per URL and Term. (this number is an average guess, it could really range from 16bytes (or 9 bytes, I can't remember) all the way up)"it doesn't store data like any other Search Engine what's open for anyone to see"I was looking to compaire my SE to someone elses small SE.Could you show me how do to compression in PHP? I have tried Zlib and other things but no luck. The DB where the data will be stored doesn't compress any of the data or anything... If I could compress with a 3 to 1 ratio like google will then my server would hold 12,000,000 pages. (if the search engine could stand that)Also what are Barrels and what is a repository? Sometimes people call something, something else so people can't copy them or what not. Because I though the repository was a temp cache and data holder, and the Barrels were the DB.

Link to comment
Share on other sites

Well I'm finishing the crawler and ranker... all that will be left will be the user innerface and fetcher.I will have to debug my code... I hate debugging code! So I hope I'm perfect.I believe I will have a good system with the user input and backlinking. I think people need to stop saying making a search engine is hard, it really makes me over think alot of things... Can't wait to get the script done..

Link to comment
Share on other sites

Could you show me how do to compression in PHP?
You should be able to use either gzcompress and gzuncompress or gzdeflate and gzinflate, look those up on php.net. Is there a reason you can't use those?
Also what are Barrels and what is a repository? Sometimes people call something, something else so people can't copy them or what not.
Well they aren't worried about people copying their technology, they have their pagerank algorithms copyrighted. And the project they were working on I believe was for their thesis, so they wrote plenty of papers explaining how they do everything. Read that paper and see if you can pick up anything.
Link to comment
Share on other sites

I have the crawler debugged... but I will have to debug the ranking script and it's about 10 times large than the crawler so I'm not going to be happy about that... I'm going to work so it's not going to get done today.But the good news is I'm only making stupid mistakes like forgetting a space, and other junk... Not anything to hard.

Link to comment
Share on other sites

Do you write everything first, and then test it? No wonder you hate debugging... it's much easier to test as you go. What if you run into a semantic bug that your application can't work around, and have to redesign everything?

Link to comment
Share on other sites

You can set up a scheduled job on the server to execute a given program, such as a PHP script, periodically. As for deciding which pages, I would be interested to hear how dcole starts out, but you start the crawler out somewhere, such as the Google news page for example, and start finding links on the page. You follow each link and decide whether you index that page, and look for more links there. Rinse and repeat. Or, if you already have an index built up, you can just start at different pages in your index and re-index everything. But you won't get a whole lot of new stuff, unless the pages change a lot.

Link to comment
Share on other sites

Well my search engine starts out with a todo list, then it will take some URLs and get the cache of the page using fopen. Then another script takes the cache and ranks it and gets all the URLs to pages out, it'll then add them to the todo list.If the todo list ever runs low, it will go to a redo script that will start reindexing the DB. I have a sytem set up to if the script failed on the ranking or reindexing it will be able to start from the last full page that was being worked on. It also allows for duplicate scripts to be working on the same thing. It also allows for the DB to be reindex evenly.I was just thinking and my script does a good job of stopping "Google bombing"I mean you can't just try to bomb any page with any term.

Link to comment
Share on other sites

I found out why small search engine use the bad method they do...GRR...there method is slow but easymy method is faster but harderthe way I'm using means I get to go though 64 TIMES less data than the other, more popular way...(note: this is an average guess. The number will increase with the DB size and the frequancy of the term on the internet. )so if were looking in a 150GB DB, They look in a 150GB DB, but I would look in 2GB of that 150GB DB... or if words were not favored on the internet, I would look in 0.06GB of that 150 GB DB.Yeah, so now your wondering why everyone doesn't use my method... well as you search for more words mine become exponentially harder on my part. Right now my search engine will max out at 4 word ... I may post that part of the code and let people come up with the next word if they want to... you have to come up with ~~(n^2)-((1/2)*n) ways of matching each term. But I think I may have a math forumla I just learn a couple of weeks ago that could make this easier... Maybe, it a forumla for writing out the longhand way of finding the nCr nPr or what ever it's called...

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...