Jul
10

Solr vs ElasticSearch

Posted on July 10, 2013 by Ryan Tabora

Background

The first thing you should know about Solr and ElasticSearch is that they are competing search servers. Both ElasticSearch and Solr are built on top of Lucene, so many of their core features are identical. If you are unfamiliar, Lucene is a search engine packaged together in a set of jar files. Many custom applications embed the Lucene jar files directly into their application and manually create and search their Lucene index through the Lucene APIs.

Solr and ES take those Lucene APIs, add features on top of them, and make the APIs accessible through an easy to deploy web server (like tomcat or jetty). Instead of coding through the Lucene Java API, developers can now easily shoot http commands to the search server and index/search that way.

Distributed Search

Foundations

Solr was released in 2008. The Solr commiters focused on building new search features. Later, it became obvious that distributed search was a highly desired feature. In October of 2012 Solr released the SolrCloud feature set which was supposed to make distributed search easy. People like to say that Solr brought distributed search on as an afterthought. On the other hand, ElasticSearch was released in 2010 specifically designed to make up for the lacking distributed features of Solr. For this reason, you may find it easier and more intuitive to start up an ElasticSearch cluster rather than a SolrCloud cluster

Winner: ElasticSearch

Coordination

ElasticSearch uses its own internal coordination mechanism to handle cluster state while Solr uses ZooKeeper. This means in order to have a SolrCloud, you have to have a ZooKeeper quorum setup. For a lot of folks using different components in the Hadoop ecosystem, this isn’t a problem since they will most likely already have a ZooKeeper quorum started up. In addition, by using ZooKeeper Solr can avoid a split brain scenario that ElasticSearch is vulnerable to. I’ll mark this section as a toss up.

Winner: Toss Up

Shard Splitting

Shards are the partitioning unit for the Lucene index, both Solr and ElasticSearch have them. You can distribute your index by placing shards on different machines in a cluster. Until April 2013, both Solr and ElasticSearch would not allow you to change the number of shards in your index. So if you decided you wanted to split your index into 10 shards on day one, and two years later you want to add another 5 shards, you were not able to do that without completely starting over (reindexing everything). As of April 2013 Solr supports shard splitting, which allows you to create more shards by splitting existing shards. ElasticSearch still does not support this.

Winner: Solr

Automatic Shard Rebalancing

Let’s say you’re in charge of capacity planning for your ElasticSearch index. Today, you have 5 machines, but you know in the future you will have budget for 20 machines by the end of this year. To make best use of those 20 machines next year, you decide that it would make most sense to split your index into 10 shards, and have 1 replica of each shard (10 shards and 10 replica shards = 20 total shards). Then you would have either 1 shard or 1 replica shard on each machine in your cluster. Since you only have 5 machines today, multiple shards will have to shard the same machine. As you add new machines, ElasticSearch will automatically load balance and move shards to new nodes in the cluster. This automatic shard rebalancing behavior does not exist in Solr.

Winner: ElasticSearch

Schema

Schema-less?

To be 100% clear, both Solr and ElasticSearch provide dynamic typing so that you can index new fields on the fly (after you have already defined your schema).

Winner: Users

Schema Creation

ElasticSearch will automagically create your schema based on the data you are indexing. Solr on the other hand requires you to define a schema before you index anything. In production for either Solr or ElasticSearch, you’ll want to define your schema before you index anything. This is because there are many advanced analyzers/filters you will want to apply on the data before you index it.

Winner: Both

Nested Typing

ElasticSearch supports complex nested types. For example, you could have an address field that contains a home field and a work field. Each of those fields would have street, city, state, and zip fields. These nested types only work for 1 (parent) to many (child) relationships. There are also a lot of “gotchyas” here. For example, with parent-fields, all members of a relationship must fit onto one shard in your index. Or for nested fields, updating may be extremely slow if you make any updates to any field in the nest. Solr does not support nested typing, the document structure must be flat. The fact that these options exist in ElasticSearch is very cool, but you have to be very careful with how you use them.

http://www.elasticsearch.org/guide/reference/mapping/nested-type/
http://www.elasticsearch.org/guide/reference/mapping/object-type/
http://www.elasticsearch.org/guide/reference/mapping/parent-field/

Winner: ElasticSearch

Queries

Query Syntax

Solr’s query syntax is key/value pair based using / and () to delineate and nest queries. For example

q=((name:ryan* AND haircolor:brown) OR interest:zombies) OR (job: engineer*).

ElasticSearch’s uses JSON.  For example here is an ElasticSearch query: 

“bool” : {
       “must” : {
           “term” : { “user” : “kimchy” }
       },
       “must_not” : {
           “range” : {
               “age” : { “from” : 10, “to” : 20 }
           }
       },
       “should” : [
           {
               “term” : { “tag” : “wow” }
           },
           {
               “term” : { “tag” : “elasticsearch” }
           }
       ],
       “minimum_should_match” : 1,
   }
}

Winner: Users

Distributed Group By

Solr supports distributed group by (including grouped sorting, filtering, faceting, etc), ElasticSearch does not. This feature seems to be like a no brainer in most any search applications which is why I call it out specifically here.

Winner: Solr

Percolation Queries

ElasticSearch allows you to register certain queries that can generate notifications when indexed documents match that query. This is really great for things like alerts. This may cause performance issues if you have too many percolated queries as each document that is indexed will be queried by each percolated query. If the newly indexed document is returned by one of the percolated queries then an alert is sent out.

Winner: ElasticSearch

Community

Users

ElasticSearch is still fairly new but its community is growing very quickly. Solr has been around for much longer and therefore has a larger user base.

Winner: Solr

Vendor Support

MapR, Cloudera, and DataStax have all chosen Solr for their search technology. InfoChimps is using ElasticSearch. I haven’t heard any word on if HortonWorks is even looking into search at this point. LucidWorks has many of the Solr committers and provides an enterprise Solr product with more features, while ElasticSearch provides most of the support for their product. Think Big also supports Solr and ElasticSearch, especially when it comes to integrating these technologies with big data. I see DataStax and Cloudera as thought leaders in this area, which is why I give the win to Solr.

Winner: Solr

Conclusion

So ElasticSearch received four winner categories and Solr received four. Regardless of how to counts were going to end up, I never wanted to say that ElasticSearch is better than Solr or Solr is better than ElasticSearch. At the end of the day Solr and ElasticSearch are very close to each other in feature sets, and it would be really difficult to make a decision on one or the other without really knowing the exact requirements your organization has.

Contact me if you have any questions!

ryan.tabora@thinkbiganalytics.com

 

Additional Reference Material

http://www.elasticsearch.org

http://lucene.apache.org/solr/

http://solr-vs-elasticsearch.com 

Share Button



12 Comments
  1. Nice post, Ryan. There are obviously several tough, close calls here. For example, one might argue that ES+Zen for coordination don’t work as well as Solr+ZooKeeper. At Sematext we track both ES and SolrCloud and help our customers with both of them, and today we see more split brain issues with ES than with SolrCloud.

  2. Ryan Tabora says:

    Thanks Otis! I think there is definitely a lot of wiggle room here. The split brain issue is definitely an area of concern and probably should have been mentioned in that first ‘foundations’ paragraph.

  3. Hey Ryan! Thanks for the good writeup.

    Can you please expand on the distributed group by not being possible in elasticsearch. I’m having trouble grasping what that means, it has faceted search (which are sort of group by) and they work across all shards, as far as I know. Am I missing something?

  4. Dmitry says:

    ES does support distributed faceting (which is, basically, group-by functionality):
    http://www.elasticsearch.org/guide/reference/api/search/facets/
    It is especially powerful when combined with ES nested documents

  5. arduino says:

    Thanks for the marvelous posting! I quite enjoyed reading it, you happen to be a great author.

    I will make sure to bookmark your blog and will often come
    back in the future. I want to encourage one to continue your
    great posts, have a nice evening!

  6. acne says:

    Doees your website have a contact page? I’m having trouble locating
    it but, I’d like to send you an e-mail. I’ve got some suggestions for your blog you might be interested in hearing.
    Either way, great blog and I look forward to seeing it develop over time.

  7. Apparently HortonWorks has thrown its hat in the ring with elasticsearch:
    http://hortonworks.com/partner/elasticsearch/

  8. Sirap Limau says:

    Both have advantages. It depends on the architecture and design I guess.

  9. Awesome post.. exactly what i needed and that too in a concise way

  10. Jay says:

    One other feature that Solr supports is the ability to add New Shards (or DELETE existing shards) to your index (whenever you want) via the “implicit router” configuration (CREATE COLLECTION API).

    Lets say – you have to index all “Audit Trail” data of your application into Solr. New Data gets added every day. You might most probably want to shard by year.

    You could do something like the below during the initial setup of your collection:

    admin/collections?
    action=CREATE&
    name=AuditTrailIndex&
    router.name=implicit&
    shards=2010,2011,2012,2013,2014&
    router.field=year

    The above command:
    a) Creates 5 shards – one each for the current and the last 4 years 2010,2011,2012,2013,2014
    b) Routes data to the correct shard based on the value of the “year” field (specified as router.field)

    In December 2014, you might add a new shard in preparation for 2015 using the CREATESHARD API (part of the Collections API) – Do something like:

    /admin/collections?
    action=CREATESHARD&
    shard=2015&
    collection=AuditTrailIndex

    The above command creates a new shard on the same collection.

    When its 2015, all data will get automatically indexed into the “2015″ shard assuming your data has the “year” field populated correctly to 2015.

    In 2015, if you think you don’t need the 2010 shard (based on your data retention requirements) – you could always use the DELETESHARD API to do so:

    /admin/collections?
    action=DELETESHARD&
    shard=2015&
    collection=AuditTrailIndex

    P.S. This solution only works if you used the “implicit router” when creating your collection. Does NOT work when you use the default “compositeId router” – i.e. collections created with the numshards parameter.

    This feature is truly a gamechanger – allows shards to be added dynamically based on growing demands of your business.

    Is this feature available in Elastic Search. If not, I am sure they will in time.

  11. sheepdog says:

    Great post. Super concise. Love your work.

  12. AG says:

    Thanks for the article!

    A picture is worth a thousand words to reflect on what others are thinking:
    http://www.google.com/trends/explore#q=solr%2C%20elasticsearch%2C%20lucene&cmpt=q

    In July the current direction was visible, now it is obvious and in 2 month ES will be on top. And while trends are just one kind of metric and not by any means a full picture, then you can see all around the industry, and with whoever I recently talked to – the best case for solr I have heard that it was also tried, not picked over ES. At least Lucene wins either way :)

Leave a Reply