Last week, I started playing around with a project to create my own (Python) site search, including a crawler and Whoosh-based search. I’d seen the implementation of a Lucene search in Zend go fairly easy-peasy, and liked the idea of a self-hosted search.
Problem is–well, one of the problems is–the crawl time for a site with 1200 posts (most of which are low-priority) is a deal-breaker on a shared hosting provider. It takes far longer than 5 minutes just to collect the links, even with multiple threads. Add the parse time to get indexable content for 1200 pages, and I was stuck contemplating how to crawl and index the site in parts.
This sounds like a great, fun, project. …Except that it’s already been done and I have other things I’d rather be doing. Google did it; their index for my site updates surprisingly quickly and doesn’t make me afraid that Dreamhost will smite me. (I’ve been with Dreamhost for several years now, and while I’ve learned how to properly deploy a site since moving here from… Brinkster, was it?, I don’t relish the idea of learning a new environment for all the stuff I run here.)
So instead of the 4-5 hours I’d spent screwing with the Ikea-esque assembly of a site crawler and search, I spent two this week really making Google’s Custom Search Engine (CSE) work for me. Yes, there are ads. Yes, it’s not a solution that I own. (Then again, neither is my email, in that sense.)
What’d you want, Willis?
- A full search page that can give smartly-relevant search results and search quickly.
- The sidebar search box to take the user to the search page, rather than doing some AJAX-y search stuff as Google wants to do (in a 200-damn-pixel sidebar, no less).
- Up-to-date search results, so that new entries will appear in the index by the time a new post is made.
- Search results that included the django-driven stories on the site.
What it do?
Here’s what I did with my preexisting search engine:
- Set it to search both “sites” https://irrsinn.net and http://irrsin.net/fiction/*. I found that the stories got buried when I didn’t.
- Set a few “keywords” for the search engine. I’m not using a Google-hosted search page or results page, so I’m not sure if this matters one fig.
- Made sure that it’s picking up my sitemap. I use a Python sitemap generator that runs weekly and pushes out my sitemap. Rather than actively crawling my site, it uses my Apache logs to know what to crawl. If it wasn’t visited since the logs cleaned out, then it’s pretty low priority anyway.
- I set up Refinements. This is kind of slick, and I have no idea if it’ll end up used or not. On my search page now, you’ll see tabs for “Stories” and “Character Sheets” when you’ve searched. Each of those is a search refinement, where I’ve said, “if you’re looking for stories, then I should further filter the results by Transhuman Congress OR Witches of Ming Ung (etc.)”. I see it as a way to guide users either by just reminding them of what they may be looking for or by actively helping them get there. If you search my site for “Vampire” when you’re actually interested in if I’m doing a Vampire character sheet in my initial release, then you can filter away the posts about Greg being a vampire and just see what’s relevant.
- Chewed on the CSS and JS. More details of this below.
Was the CSS Tasty?
For the CSS itself, I had to use Firebug to grab IDs and classes and style stuff manually. The config panel wasn’t enough for getting rid of the fugly drop-shadows and too-light grey text. When I started making tweaks on the Minimal default style, it created a custom style for me, so I can always reset to the Minimal style if I want to. As I mentioned, all my custom CSS went into my template’s style.css, which is minified and stored on S3.
But Is it Handy?
It is now. The reason I’d previously (and unhappily) removed the sidebar search–not having it is a deal-breaker in blog usability–is that Google’s small search box wanted to show the results right there in the sidebar. Like… um. Have you seen how long my post titles are? That was an unreadable mess. To add insult to injury, the full Search page wouldn’t read queries out of the querystring, so I couldn’t (by default) send requests there, either.
There’s nothing new in all of this, but it’s a constructed enough solution that it’s worth sharing. Two hours of time from top to bottom, which ain’t bad.