Wednesday, October 1, 2008

Integrating Notes content with Google Enterprise Search

PROGRAMMING POWER

By Colin Neale

I've been playing around with search now for a number of years. Just over a year ago, I was talking to someone about search appliances. An appliance is an out of the box search solutions that comes with all the hardware, OS and software pre-configured and ready to go. We were discussing the Google Search Appliance (or GSA, shown in Figure A) as I was interested to find out how it handled Lotus Notes content.

FIGURE A

The entry level Google Search Appliance GB-1001 is quite yellow. (click for larger image)

Like many systems, the GSA will crawl Notes content over HTTP. However, not all Notes databases are Web enabled, and even where they are, this approach can be a somewhat hit and miss approach to getting at the underlying raw content of value, (i.e. the Notes documents themselves), especially for more complex applications.


"At this point (as every developer knows), the first time we run our software, it all works exactly as it was designed."

On the Google developer site, I found that the GSA accepts content from an external source through a feed submission process. Once indexed this content is made available for discovery alongside all the other types of data held on the appliance.

So I thought, why don't I adapt our own crawler for Notes "C-Search" into a crawl and feed system for the GSA. It can't be that difficult can it? After all, we had already done the hard part when we built the crawler.

I contacted Google and was lucky enough to be able to show them a mockup of how the system would work. Our system would:

  • allow administrators to choose which databases to index
  • select document sets for the GSA from template profiles built for each type of database to be indexed
  • choose which fields to index
  • keep a record of everything sent to the GSA to support incremental updates
  • handle search results authorization through integration with the GSAs own authentication and authorization SPIs

Understanding the challenges

Google seemed to like my idea, so with some confidence I set about the development. There were some challenges along the way. Here are just a few:

  • How do we handle Notes documents with multiple attachments and monitor changes to individual attachments on these documents?
  • What happens if the feed process fails half way through?
  • How do we break the XML files into manageable chunks for optimum performance?
  • How do we ensure Notes document level security is respected?

Development was done entirely from the Google API documentation before my brand-spanking-new appliance arrived at my door. When it did, in fact, arrive I was keen to see just how good (or bad) the documentation really was. It turned out to be near perfect. I had already installed our connector software, so I was able to focus on the appliance as soon as the server was delivered.