We wanted to add a ‘search this site’ function to a client’s website but did not have the time to study the 200+ existing ways of doing this. Perhaps using the “Microsoft Indexing Service” (or “Index Server”, IS), which fits well with the software running the existing site (IIS), can easily be extended to search within MS Office and PDF documents?
But there is a problem with using IS for this: IS can only index files on a local or remote file system, it does not crawl a website. In our case that is not good enough because the content lives in a database, and we have to follow links like `http://mysite.com?page=42`. Moreover, we wanted to make sure exactly the content exported through HTTP is indexed, no more no less.
The solution we came up with works like this:
1. Use a standard webcrawler to download a copy of the site through HTTP and store it the local filesystem of the server.
2. Use Indexing Service to index the local copy of the site.
3. Use a small hashtable for mapping the filenames returned by a query back into URLs.
This cleanly separates the webcrawl and the indexing, and the search is entirely ignorant about the (possibly heterogeneous and complicated) software architecture of the site.
So far it is just a prototype, but it seems to work fine.