Search

Solr

Solr is a Java based Apache open source search server built on top of Lucene. Solr actually stores its index in Lucene format, and builds more features on top of it. Solr is generally run as a web application in a web container like Tomcat, but it can also be embedded inside of an application. Originally created by CNET in 2007, Solr has become a top level Apache project with many followers. Solr currently provides a vast array of features such as automatic sharding, replication, SQL like queries, and more. Some of the interesting query capabilities include faceting, group by, highlighting, spellcheck, autocomplete, and geospatial support. Solr is also becoming the go to search integration point for many NoSQL databases such as HBase and Cassandra to provide real time SQL like analytics on big data. One of the most advanced products in this space is DataStax, which integrates Solr search indices on data stored in Cassandra.

See also Lucene, SolrCloud, and DataStax Enterprise

Lucene

Lucene is an Apache open source index library written in Java. Lucene was originally written by Doug Cutting, creator of Hadoop, in 1999. Lucene is based on the inverted index, and provides many different features to enable advanced indexing and querying on many different data types. Many features of Solr, which is built on top of Lucene, are actually provided entirely by the Lucene query and index libraries. People generally look to Lucene when they do not need the features added on by Solr or when they find themselves requiring changes to the core way that Lucene indexes or queries data.

See also  ElasticSearch, Solr, SolrCloud, and DataStax Enterprise

ElasticSearch

ElasticSearch is a distributed, RESTful, open source search server based on Apache Lucene written in Java, providing JSON and Java APIs to expose features thereof. It supports facetting and percolating, which can be useful to be notified if new documents match for registered queries.

ElasticSearch can be used to search all kind of documents. It provides a scalable search solution, has near real-time search and support for multitenancy. "ElasticSearch is distributed, which means that indices can be divided into shards and each shard can have zero or more replicas. Each node hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s). Rebalancing and routing are done automatically [...]".[

SolrCloud

SolrCloud is a subset of distributed 'cloud based' features as part of the Apache Solr 4.x releases. Introducing functionality such as automatic sharding, automatic failover, query/index partitioning, automatic replication, and write durability, SolrCloud is quickly becoming very popular in the open source search community. SolrCloud achieves these distributed capabilities by integrating Solr with ZooKeeper. By storing cluster configuration in state in ZooKeeper, Solr can understand the cluster state to route queries/indexes or manage node failure. SolrCloud is based on a Master/Slave architecture, in SolrCloud terms Leaders/Replicas, that focuses on the Consistency and Partition Tolerance of the CAP theorem.

See also Lucene, Solr, and DataStax Enterprise

DataStax

Enterpise DataStax Enterprise is a commercial product that integrates Solr with Cassandra to provide real time SQL like query capabilities on top of a proven scalable database. Unlike SolrCloud, DataStax Enterprise is based on a peer to peer architecture that provides Availability, Partition Tolerance, and tunable Consistency. DataStax Enterprise stores the raw indexed values in Cassandra to enable automatic node failover and reindexing in case of schema changes or load balancing. DataStax Enterprise also integrates with Hadoop so that users can perform Map/Reduce jobs on top of data within Cassandra.

See also Lucene, Solr, and SolrCloud

Splunk

Splunk is software to search, monitor and analyze machine-generated data by applications, systems and IT infrastructure at scale via a web-style interface. Splunk captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations. Splunk aims to make machine data accessible across an organization and identifies data patterns, provides metrics, diagnoses problems and provides intelligence for business operation. Splunk is a horizontal technology used for application management, security and compliance, as well as business and web analytics.