Search-building: custom or Google

Until earlier this week, I had a lousy site search in place. It was one of Google’s Custom Search Engines, barely configured and only on its own page, due to it’s hefty (and blocking!) JavaScript. I’d long since disabled WordPress’s search since my stories aren’t being run in WordPress, and I didn’t feel like trying to chew on the internal search mechanisms to include the stories.

Last week, I started playing around with a project to create my own (Python) site search, including a crawler and Whoosh-based search. I’d seen the implementation of a Lucene search in Zend go fairly easy-peasy, and liked the idea of a self-hosted search.

Problem is–well, one of the problems is–the crawl time for a site with 1200 posts (most of which are low-priority) is a deal-breaker on a shared hosting provider. It takes far longer than 5 minutes just to collect the links, even with multiple threads. Add the parse time to get indexable content for 1200 pages, and I was stuck contemplating how to crawl and index the site in parts.

This sounds like a great, fun, project. …Except that it’s already been done and I have other things I’d rather be doing. Google did it; their index for my site updates surprisingly quickly and doesn’t make me afraid that Dreamhost will smite me. (I’ve been with Dreamhost for several years now, and while I’ve learned how to properly deploy a site since moving here from… Brinkster, was it?, I don’t relish the idea of learning a new environment for all the stuff I run here.)

So instead of the 4-5 hours I’d spent screwing with the Ikea-esque assembly of a site crawler and search, I spent two this week really making Google’s Custom Search Engine (CSE) work for me. Yes, there are ads. Yes, it’s not a solution that I own. (Then again, neither is my email, in that sense.)

What’d you want, Willis?

  • A full search page that can give smartly-relevant search results and search quickly.
  • A search box on every page that doesn’t jeopardize page load times with blocking JavaScript.
  • The sidebar search box to take the user to the search page, rather than doing some AJAX-y search stuff as Google wants to do (in a 200-damn-pixel sidebar, no less).
  • Up-to-date search results, so that new entries will appear in the index by the time a new post is made.
  • Search results that included the django-driven stories on the site.

I didn’t go as far as uploading custom XML to define the engine for this, but I’ve done some good things here lately to improve page load times, and didn’t want to wreck that by using Google’s standard in-page JavaScript and CSS code snippet.

What it do?

Here’s what I did with my preexisting search engine:

  1. Set it to search both “sites” https://irrsinn.net and http://irrsin.net/fiction/*. I found that the stories got buried when I didn’t.
  2. Set a few “keywords” for the search engine. I’m not using a Google-hosted search page or results page, so I’m not sure if this matters one fig.
  3. Made sure that it’s picking up my sitemap. I use a Python sitemap generator that runs weekly and pushes out my sitemap. Rather than actively crawling my site, it uses my Apache logs to know what to crawl. If it wasn’t visited since the logs cleaned out, then it’s pretty low priority anyway.
  4. I set up Refinements. This is kind of slick, and I have no idea if it’ll end up used or not. On my search page now, you’ll see tabs for “Stories” and “Character Sheets” when you’ve searched. Each of those is a search refinement, where I’ve said, “if you’re looking for stories, then I should further filter the results by Transhuman Congress OR Witches of Ming Ung (etc.)”. I see it as a way to guide users either by just reminding them of what they may be looking for or by actively helping them get there. If you search my site for “Vampire” when you’re actually interested in if I’m doing a Vampire character sheet in my initial release, then you can filter away the posts about Greg being a vampire and just see what’s relevant.
  5. Chewed on the CSS and JS. More details of this below.

Was the CSS Tasty?

Why yes, it was. I had a couple of beefs here. First, the config panel for search engines doesn’t let you tweak a whole lot of visual options. Second, when you get the code for the engine, all the HTML, CSS, and JavaScript is in one big block. *shudder*

I went ahead and tweaked as much as I could through the config panel, then grabbed the code Google gave me and tossed it into a text file. The CSS went into my template’s style.css, and the JavaScript into my template’s footer. That JavaScript really needs to be in a separate include, so that it can be spliced, minified, and S3-stored with the rest of my JavaScript. The WordPress “Page” that is “/search/” contains one tidbit:

<div id="cse">Loading</div>

The JavaScript (as provided from Google) hooks onto that #cse, so I didn’t have to tweak anything there.

For the CSS itself, I had to use Firebug to grab IDs and classes and style stuff manually. The config panel wasn’t enough for getting rid of the fugly drop-shadows and too-light grey text. When I started making tweaks on the Minimal default style, it created a custom style for me, so I can always reset to the Minimal style if I want to. As I mentioned, all my custom CSS went into my template’s style.css, which is minified and stored on S3.

But Is it Handy?

It is now. The reason I’d previously (and unhappily) removed the sidebar search–not having it is a deal-breaker in blog usability–is that Google’s small search box wanted to show the results right there in the sidebar. Like… um. Have you seen how long my post titles are? That was an unreadable mess. To add insult to injury, the full Search page wouldn’t read queries out of the querystring, so I couldn’t (by default) send requests there, either.

What I ended up using was this example of a two-page search, which allows me to put a no-JavaScript-needed search form in my sidebar. Searching there will take you to the search results page.

There’s nothing new in all of this, but it’s a constructed enough solution that it’s worth sharing. Two hours of time from top to bottom, which ain’t bad.