How to provide ‘search this site’ functionality?

By: on July 17, 2006

We wanted to add a ‘search this site’ function to a client’s website but did not have the time to study the 200+ existing ways of doing this. Perhaps using the “Microsoft Indexing Service” (or “Index Server”, IS), which fits well with the software running the existing site (IIS), can easily be extended to search within MS Office and PDF documents?

But there is a problem with using IS for this: IS can only index files on a local or remote file system, it does not crawl a website. In our case that is not good enough because the content lives in a database, and we have to follow links like ``. Moreover, we wanted to make sure exactly the content exported through HTTP is indexed, no more no less.

The solution we came up with works like this:

1. Use a standard webcrawler to download a copy of the site through HTTP and store it the local filesystem of the server.
2. Use Indexing Service to index the local copy of the site.
3. Use a small hashtable for mapping the filenames returned by a query back into URLs.

This cleanly separates the webcrawl and the indexing, and the search is entirely ignorant about the (possibly heterogeneous and complicated) software architecture of the site.

So far it is just a prototype, but it seems to work fine.



  1. Apache’s Lucene is also a good tool for indexing and effective for web searching.

  2. mikeb says:

    Well that’s kind of true — at least the last time I looked in-depth at Lucene, web-crawling and even extracting text from HTML were example code rather than in the core.

    Nutch, (confusingly, uses Lucene as a library but is a subproject of it), is a web-crawler — perhaps that’s what you were referring to, Bala.

