Solr Search — Enterprise Software that Does Not Suck

Posted by feydr | Posted in Uncategorized | Posted on 12-11-2009

View Comments

The past couple of weeks I’ve been asked to go find some article that talks about so-and-so or to go look at a forum topic that mentions XYZ.
This is not the biggest pain in the ass as I can pop into sql and fairly quickly find what I’m looking for but DAMN I would like to just type something into a search box and hit enter instead.

  • Search for a title on one table — pretty easy — BAM — it’s done.
  • Search within the title and the body — ok… we can do that fairly easily.
  • Search for title/body on two tables — doable but now we are getting kinda convulted on our controller logic.
  • Fuck everything — search for any fucking text anywhere on the website — oh, and we want this query which is most assuredly going to be different accessible to all of our users — this is where the shit starts to hit the fan and you start spending more time extending and refactoring your search code rather than closing out this ridiculous feature and getting on with the more important shit.

I did NOT go down this route as I’ve already tried to make something that does this and I know it’s a bitch and a half. So I looked into the ‘Enterprise’ software available. I must admit — every time I hear the word enterprise I think of a army of monkeys with shit-stained fingers tapping away on a 486 making java classes that are composed of ten thousand 2 line functions and the only REAL function is that it counts from 1 to 10.

code monkey

Enter Solr

Solr is a super fast full-text ‘enterprise search solution’. What the fuck is ‘enterprise search’ you may ask? Simply put, it is a solution to search on whatever you might want to index on your site without having to do shit tons of crazy custom code knowing full well that the engineers that came before you made it the best fucking piece of shit around. Chances are if you are a website that does any sort of traffic at all you either have or want enterprise search. Now, lucene is the core engine that solr uses but if you want to talk to lucene directly you might as well take the time to write your own goddamn search app.

Allright, let’s stop fucking around and …

let’s get it installed

wget http://www.bizdirusa.com/mirrors/apache/tomcat/tomcat-6/v6.0.20/bin/apache-tomcat-6.0.20.tar.gz
wget http://people.apache.org/builds/lucene/solr/nightly/solr-2009-11-10.tgz
 
tar xzf apache-tomcat*
tar xzf solr*
 
sudo mv apache-tomcat-6.0.20/ /usr/local/tomcat6
sudo cp apache-solr-1.5-dev/dist/apache-solr-1.5-dev.war /usr/local/tomcat6/webapps/solr.war
sudo cp -r apache-solr-1.5-dev/example/solr/ /usr/local/tomcat6/solr/
 
sudo mkdir /usr/local/tomcat6/conf/Catalina/
sudo mkdir /usr/local/tomcat6/conf/Catalina/localhost/
 
sudo gem sources -a http://gemcutter.org
sudo gem install rsolr
 
sudo update-rc.d tomcat6 start 91 2 3 4 5 . stop 20 0 1 6 .

that’ll get us installed but let’s go ahead and throw up a init script for tomcat as manually restarting it is just dumb

put this in your /etc/init.d/tomcat6 and smoke it

# Tomcat auto-start
#
# description: Auto-starts tomcat
# processname: tomcat
# pidfile: /var/run/tomcat.pid
 
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export JAVA_OPTS="$JAVA_OPTS -Dsolr.solr.home=/usr/local/tomcat6/solr"
 
case $1 in
start)
   sh /usr/local/tomcat6/bin/startup.sh
   ;;
stop)
   sh /usr/local/tomcat6/bin/shutdown.sh
   ;;
restart)
   sh /usr/local/tomcat6/bin/shutdown.sh
   sh /usr/local/tomcat6/bin/startup.sh
   ;;
esac
exit 0

There are two ruby libraries that look more or less the same to me:
rsolr and solr-ruby

erikhatcher who wrote solr-ruby told me to use the competitions stuff —

solr-ruby is my baby, but rsolr is inspired by it and took away some great lessons….rsolr has some ideas and refactorings i’d love to get into solr-ruby…..but i’d say rsolr is probably the most agile way to go right now

I am using rsolr but for no real reason.

I first decided to index the blogposts on my site to get a feel for how everything works. I put this in a rake task just to make it really easy to develop the functionality as I learned. As you can see I’m using merb but it would work fine for rails as well.

desc 'add solr indexes for blogposts'
task :populate_index => :merb_env do
  require 'rsolr'
  require 'lib/Colorify.rb'
  include Colorify
 
  solr = RSolr.connect :url => 'http://127.0.0.1:8080/solr'
 
  puts colorGreen("clearing index")
 
  # clear our index
  solr.delete_by_query '*:*'
 
  puts colorGreen("adding blogposts")
 
  if Merb.environment.eql? 'development' then
    host = "127.0.0.1"
    dbname = "my_dev"
  elsif Merb.environment.eql? 'staging' then
    dbname = "my_staging"
  else
    host = "my_production_host"
    dbname = "my_production"
  end
 
  DataMapper.logger.level = :error
  DataMapper::setup(:default, "mysql:/myuser:mypassword@#{host}/#{dbname}")
 
  Blogpost.all.each do |bp|
    begin
      solr.add :id => bp.id, :type => 'blogpost', :body => CGI.escapeHTML(bp.content), :title => bp.title,
                :anchor => bp.anchor, :description => bp.description, :slug => bp.slug
    rescue
      puts colorRed($!)
    end
  end
 
  puts colorGreen("adding forum posts")
 
  Post.all.each do |post|
    begin
      if(!post.parent.nil?) then
        solr.add :id => post.id, :type => 'forumpost', :body => CGI.escapeHTML(post.body), :title => post.parent.title,
                  :slug => post.parent.slug
      end
    rescue
      puts colorRed($!)
    end
  end
 
 
  solr.commit
end

my corresponding controller/action pair to look this up:

class Search < Application
  before :ensure_authenticated
  before :admin_login
 
  def index
    debugger
    require 'rsolr'
    solr = RSolr.connect :url => 'http://127.0.0.1:8080/solr'
 
    response = solr.select :q => "body: #{params[:query]} title:#{params[:query]}"
 
    @nresults = response["response"]["numFound"]
    @docs = response["response"]["docs"]
    render
  end
 
end

k… let’s tell ruby to fuck off…
as with every fucktard java project out there XML is the preferred method of setting up your configuration files.
to properly import all your data whenever you want (like with a cronjob) you’ll want to make a data-config.xml
that belongs in your /usr/local/tomcat6/solr/conf/ directory.

mine looks a little bit like this:
my data-config.xml:

<dataConfig>
  <dataSource type="JdbcDataSource" 
              driver="com.mysql.jdbc.Driver"
              url="jdbc:mysql://localhost/mycoolassfuckingdb_dev" 
              user="luser"
              password="assword"/>
  <document>
 
    <!-- blogposts -->
    <entity name="id" 
            query="select id, content, title, anchor, description, slug from blogposts"
            transformer="TemplateTransformer">
      <field column="content" name="body" />
      <field column="type" template="blogpost" />
    </entity>
 
    <!-- forum topics -->
    <entity name="id" 
            query="select a.id as id, a.body as body, b.slug as slug, b.title as title from posts as a, topics as b where b.id = a.parent_id" 
            transformer="TemplateTransformer">
          <field column="type" template="forumpost" />
    </entity>
 
  </document>
</dataConfig>

you’ll note that you there is a transformer that allows you to do all sorts of crazy shit on your data as you are importing it — like finding data that ‘sounds’ like what the user is trying to spell. You also note that in this example I use the TemplateTransformer to rename my column ‘content’ as ‘body’, even though I could have selected it in sql and naming it there.

Now, you’ll need to edit your schema.xml located in the same directory to add fields that match what you want to import as I’m sure the majority of you people don’t have SKUS on your website — but if you do — congratulations — you are already set!

How about importing this data?
Easy hit up this url: http://127.0.0.1:8080/solr/dataimport?command=full-import

having trouble with the logs? like where the fuck they are located???
try this:

tail -f /usr/local/tomcat6/logs/catalina.2009-11-12.log

Schema Shit

The key here is to make everything about as homogenous as possible — I know… wtf??!? No seriously, it’s cool — cause every field typically will have an index and the idiomatic way is to simply put a type bool on documents that share field names.

For more informatino on this please visit the pros:

Schema Design
Using Multiple Indexes

I’m kinda interested how sql-like injection works with solr and time permitting, I’ll have a new article on it in the future — needless to say a cursory scan of several popular hosting platforms revealed VERY OPEN solr installations.

Anyways, I’m drunk so this post will probably be revised in the future but I wanted to get it out there.

Go solr!