Jump to content

Building a searchable database?


Man In Tan

Recommended Posts

I'm trying to add search functionality to my website. To do this, I have my spider crawl my site, and make a list of how many times each word appears on each page. It's a bit crude, but I'm not an expert on search-ability, so it's the best I can come up with for now. However, when it comes to saving the crawl data to the database, I have a bit of a problem. My original plan was to use an SQL table, with a column for every word, and a row for every page. However, after looking at the data collected, I see that is impractical. After converting everything to lower case, to emulate case-insensitivity, the spider found 5639 unique words, most of which don't appear on any given page. Multiply that by over 100 pages and counting, and we have a lot of zeros in the database, especially now that I've started adding a weblog entry each day, on a separate page. What would be a good way to store this kind of data, in a way that can be easily searched over for user queries? I'm sure there's a good answer, I'm just missing it. Thanks for the help!

Link to comment
Share on other sites

Collapse your thinking into fewer categories (you obviously don't need a page number for each word) and prioritize words (some words need page numbers even if most of them are zeros). That's just for starters. Accounting for phrases will be next. The challenge will be to associate the frequencies with user behavior connected to dollar signs. I've been the the direct marketing business for thirty years. We've been doing similar kinds of things for decades. It never ends. EDIT: Keep checking back. You're heading in the right direction.

Edited by niche
  • Like 1
Link to comment
Share on other sites

Guest So Called

Just put all of your pages in your database. Each row has information about one page including title, URL, contents... Then write your search using SQL. That's how my website works, and additionally I serve the pages out of the SQL table rather than having flat HTML files.

Link to comment
Share on other sites

Collapse your thinking into fewer categories (you obviously don't need a page number for each word) and prioritize words (some words need page numbers even if most of them are zeros). That's just for starters. Accounting for phrases will be next. The challenge will be to associate the frequencies with user behavior connected to dollar signs. I've been the the direct marketing business for thirty years. We've been doing similar kinds of things for decades. It never ends. EDIT: Keep checking back. You're heading in the right direction.
It sounds like I have quite the task ahead of me! I was hoping to reprogram my sitemap auto-builder to set up the search index as well, but it looks like I still need to do a lot of work and planning by hand. I'll have to see how much I can get done between classes.
Just put all of your pages in your database. Each row has information about one page including title, URL, contents... Then write your search using SQL. That's how my website works, and additionally I serve the pages out of the SQL table rather than having flat HTML files.
I'm a bit of a small site right now, so I don't think I have the space to store all the pages in the database and in files. I can't serve pages strictly from the database, because I use dynamic pages, so they need to be rebuilt when they load. This seems like a great option for some websites, but not the one I'm working on at the moment. Also, while I'm sure there's a way, I don't know how to search for words in a database that contains the whole page, without loading every row, which would clearly not be an efficient way to do things. EDIT: Oh, yeah. Where are my manners? Thank you both for your time and ideas! Edited by Man In Tan
Link to comment
Share on other sites

Guest So Called
I'm a bit of a small site right now, so I don't think I have the space to store all the pages in the database and in files. I can't serve pages strictly from the database, because I use dynamic pages, so they need to be rebuilt when they load. This seems like a great option for some websites, but not the one I'm working on at the moment. Also, while I'm sure there's a way, I don't know how to search for words in a database that contains the whole page, without loading every row, which would clearly not be an efficient way to do things.
You don't store all the pates in the database AND in files. Just put the content in the database. Serve the pages out of that. It works particularly well with smaller sites. You have the added benefit of being able to have a single header and footer (also stored in the database) and store your content pages without header/footer, dynamically add them together when a page is requested. Otherwise have an exclusion list (you don't want to index words like "a, the, in, and, or, ...") and process each page into an entry in your database after excluding these words. You can have a count in the database, word and number of appearances, like this: "store=3, header=2, footer=2, database=4, content=3, requested=1, dyamically=1"). (I'm generating that from a paragraph above.) Then do SQL search for search terms, you'll have a list of pages with hits, and you can look up each search term in the rows with hits and count the target word to calculate relevance. I used to do my site search sort of like that when I was back on PHP and flat files. Now everything is in the database and I just search all the pages. So what if it takes 3-4 seconds of SQL accesses?
Link to comment
Share on other sites

I just said I need the file version to exist. The pages are dynamic. They need to be rebuilt when they are accessed. Having them in the database, but not in the files does not work in my case. For static XHTML that may be a good idea, but these aren't static. Also, my pages load in less than one second right now, so slowing down to three or four seconds doesn't seem to be a good option.

Link to comment
Share on other sites

I don't think 3 or 4 secs for great content is a deal killer.

Link to comment
Share on other sites

Guest So Called

My content pages dynamically generated from content stored in MySQL database serve up typically in 100-200 ms. Searches don't take noticeably longer.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...