Monday, June 12, 2006

The Google-Less Search Engine

Naturally, anyone who would do as I just did in the last few weeks would be a candidate for the padded cell. Why, you may ask, would I possibly build yet another Search Engine when there are plenty of them already out there?

The answer is partially due to "just because I can", but in truth I would never do this just for the heck of it. My current job project implicitly requires such a bunch of web software because the main thrust of the project is to analyze stuff on other web sites.

I tried Google Desktop search -- it has some good aspects but it wasn't really what I wanted and was way more overhead than I wanted to expend.

Anyhoo, it has come along quite well. I can do a fuzzy search with "nonsense" clues and get a sizeable hit list back from the central DELL server (affectionately named Goomba), for which I wrote the Fuzzy Indexer, Fuzzy Searcher, web-server CGI interface via Apache, and image converter (for displaying .TIF as .JPG without any specialty software on the client side.

Although I looked into stuff such as PHP, Perl and other server-side scripting languages, I felt that they did not specifically handle the problem of "fuzziness" well enough, nor the concept that some text pages originated from an OCR process and some didn't. If you know what OCR (Optical Character Recognition) does to text files you might know what I mean. Anyway, therefore, I wrote everything in plain C.

I indexed my entire desktop computer's collection of such data, gigabytes of text and even larger amounts of image data, including temporary internet storage of cached web pages and so forth. A typical search takes only a couple of seconds to grunt through the index and to create HTML web pages from the results, and displaying any particular .TIF image takes about another second, including converting it to a .JPG before creating another HTML page around the image.

Although this may have some commercial use or not, it did turn out quite interesting for me, since I found C source code files and various writings that I'd done through the years and forgotten about. For instance, looking for "water scorpions" came up with "warrior corps" and a search of "dangerous" came back with huge numbers of warnings in GNU c++ compiler source code, as well as a story about kangaroos.

So now I can sit in the patio with my dumb little laptop, Thunk, and peruse the infinite archives of my own accumulated electronic library -- plus I can also Google the world as I was doing already.

Onward and Upward.

No comments:

Post a Comment