Welcome!

Blog Feed Post

Top 15 Solr vs. Elasticsearch Differences

Solr vs. Elasticsearch. Elasticsearch vs. Solr.  Which one is better? How are they different? Which one should you use?

Before we start, check out two useful Cheat Sheets to guide you through both Solr and Elasticsearch and help boost your productivity and save time when you’re working with any of these two open-source search engines.

These two are the leading, competing open-source search engines known to anyone who has ever looked into (open-source) search.  They are both built around the core underlying search library – Lucene – but they are different.  Like everything, each of them has its set of strengths and weaknesses and each may be a better or worse fit depending on your needs and expectations. In the past, we’ve covered Solr and Elasticsearch differences in Solr Elasticsearch Comparison and in various conference talks, such as Side by Side with Elasticsearch and Solr: Performance and Scalability given at Berlin Buzzwords. Both Solr and Elasticsearch are evolving rapidly so, without further ado, here is up to date information about their top differences.

Feature

Solr/SolrCloud

Elasticsearch

Community & Developers Apache Software Foundation and community support Single commercial entity and its employees
Node Discovery Apache Zookeeper, mature and battle tested in a large number of projects Zen, built into Elasticsearch itself, requires dedicated master nodes to be split brain proof
Shard Placement Static in nature, requires manual work to migrate shards Dynamic, shards can be moved on demand depending on the cluster state
Caches Global, invalidated with each segment change Per segment, better for dynamically changing data
Analytics Engine Facets and powerful streaming aggregations Sophisticated and highly flexible aggregations
Optimized Query Execution Currently none Faster range queries depending on the context
Search Speed Best for static data, because of caches and uninverted reader Very good for rapidly changing data, because of per segment caches
Analysis Engine Performance Great for static data with exact calculations Exactness of the results depends on data placement
Full Text Search Features Language analysis based on Lucene, multiple suggesters, spell checkers, rich highlighting support Language analysis based on Lucene, single suggest API implementation, highlighting rescoring
DevOps Friendliness Not fully there yet, but coming Very good APIs
Non-flat Data Handling Nested documents and parent child support Natural support with nested and object types allowing for virtually endless nesting and parent – child support
Query DSL JSON (limited), XML (limited) or URL parameters JSON
Index/Collection Leader Control Leader placement control and leader rebalancing possibility to even the load on the nodes Not possible
Machine Learning Built-in – on top of streaming aggregations focused on logistic regression and learning to rank contrib module Commercial feature, focused on anomalies and outliers and time-series data
Ecosystem Modest – Banana, Zeppelin with community support Rich – Kibana, Grafana, with large entities support and big user base

Now that we know what are the top 15 differences let’s discuss each of the mentioned differences in greater details.

Community and Developers

The first major difference between Solr and Elasticsearch is how they are developed, maintained and supported. Solr, being a project of Apache Software Foundation is developed with ASF philosophy in mind – Community over code. Solr code is not always beautiful, but once the feature is there it usually stays there and is not removed from the code base. Also, the committers come from different companies and there is no single company controlling the code base. You can become a committer if you show you interest and continued support for the project. On the other hand we have Elasticsearch backed by a single entity – the Elastic company. The code is available under the Apache 2.0 software license and the code is open and available on Github, so you can take part in the development by submitting pull requests, but the community is not the one that decides what will get into the code base and what will not.  Also, to become a committer, you will have to be a part of the Elastic company itself.

Node Discovery

Another major difference between those two great products is the node discovery. When the cluster is initially formed, when a new node joins or when something bad happens to a node something in the cluster, based on the given criteria, has to decide what should be done. This is one of the responsibilities of so called node discovery. Elasticsearch uses its own discovery implementation called Zen that, for full fault tolerance (i.e. not being affected by network splits), requires three dedicated master nodes. Solr uses Apache Zookeeper for discovery and leader election. This requires an external Zookeeper ensemble, which for fault tolerant and fully available SolrCloud cluster requires at least three Zookeeper instances.

Shard Placement

Generally speaking Elasticsearch is very dynamic as far as placement of  indices and shards they are built of is concerned. It can move shards around the cluster when a certain action happens – for example, when a new node joins or a node is removed from the cluster. We can control where the shard should and shouldn’t be placed by using awareness tags and we can tell Elasticsearch to move move shards around on demand using an API call. Solr, on the other hand, is a bit more static. When a Solr node joins or leaves the cluster Solr doesn’t do anything on its own, it is up to us to rebalance the data. Of course, we can move shards, but it involves several steps – we need to create a replica, wait for it to synchronize the data and then remove the one that we no longer need. There is one thing that allows us to automate things a bit – removing or replacing a node in SolrCloud using Collection API, which is a quick way of removing all shards or quickly replicate them to another node. Though this still requires manual API call, not something that is done automatically.

Caches

Yet another big difference is the architecture of the two discussed search engines.  Not getting deep into how the caches work in both products we will point out just the major difference between them. Let’s start with what a segment is. A segment is a piece of Lucene index that is built of various files, is mostly immutable, and contains data. When you index data Lucene produces segments and can also merge multiple smaller, already existing ones into larger ones during process called segment merging. The caches in Solr are global, a single cache instance of a given type for a shard, for all its segments. When a single segment changes the whole cache needs to be invalidated and refreshed. That takes time and consumes hardware resources. In Elasticsearch caches are per segment, which means that if only a single segment changed then only a small portion of the cached data needs to be invalidated and refreshed. We will get to the pros and cons of such approach soon.

Analytics Engine

Solr is large and has a lot of data analysis capabilities. We can start with good, old facets – the first implementation that allowed to slice and dice through the data to understand it and get to know it. Then came the JSON facets with similar features, but faster and less memory demanding, and finally the stream based expressions called streaming expressions which can combine data from multiple sources (like SQL, Solr, facets) and decorate them using various expressions (sort, extract, count significant terms, etc). Elasticsearch provides a powerful aggregations engine that not only can do one level data analysis like most of the Solr legacy facets, but can also nest data analysis (e.g., calculate average price for each product category in each shop division), but supports for analysis on top of aggregation results, which leads to functionality like moving averages calculation. Finally, though marked as experimental, Elasticsearch provides support for matrix aggregation, which can compute statistics over a set of fields.

Optimized Query Execution

When dealing with time based data, range queries are very common and can become a bottleneck, because of the amount of data they need to process to match the given search results. With the recent releases of Elasticsearch, for the fields that have doc values enabled (like numeric fields), Elasticsearch is able to choose whether to iterate over all the documents or only match a particular set of documents. With the logic inside the search engine, Elasticsearch can provide a very efficient range queries without any data modifications. Hopefully we will see a similar functionality in Solr as well.

Search Speed

Some time ago we did a few comparisons of Solr and Elasticsearch and the results were pretty clear. Solr is awesome when it comes to the static data, because of its caches and the ability to use uninverted reader for faceting and sorting – for example e-commerce. Elasticsearch is great in rapidly changing environments, like log analysis use cases. If you want to learn more, check out the video from two of our engineers – Radu and Rafał giving Side by Side with Elasticsearch & Solr Part 2 – Performance and Scalability talk at Berlin Buzzwords 2015.

Analysis Engine Performance & Precision

When having a mostly static data and needing full precision for data analysis and blazingly fast performance you should look at Solr. With the tests we did for some conference talks (like the mentioned Side by Side with Elasticsearch & Solr Part 2 – Performance and Scalability talk at Berlin Buzzwords 2015) we saw that on static data Solr was awesome. What’s more, compared to Elasticsearch facets in Solr are precise and do not lose precision, which is not always true with Elasticsearch. In certain edge cases, you may find results in Elasticsearch aggregations not to be precise, because of how data in the shards is placed.

Full Text Search Features

The richness of full text search related features and the ones that are close to full text searching is enormous when looking into Solr code base. Our Solr training classes are chalk-full of this stuff!  Starting from a wide selection of request parsers, through various suggester implementations, to ability to correct user spelling mistakes using spell checkers and extensive highlighting support which is highly configurable. In Elasticsearch we have a dedicated suggesters API which hides the implementation details from the user giving us an easier way of implementing suggestions at the cost of reduced flexibility and of course highlighting which is less configurable than highlighting in Solr (though both are based on Lucene highlighting functionality).

DevOps Friendliness

If you were to ask a devops person what (s)he loves about Elasticsearch the answer would be the API, manageability and ease of installation. When it comes to troubleshooting Elasticsearch is just easy to get information about its state – from disk usage information, through memory and garbage collection work statistics to the internal of Elasticsearch like caches, buffers and thread pools utilization. Solr is not there yet – you can get some amount of information from it via  JMX MBean and from the new Solr Metrics API, but this means there are a few places one must look a and not everything is in there, though it’s getting there

Non-flat Data Handling

You have non-flat data, with lots of nested objects inside nested object and inside another nested object and you don’t want to flatten down the data, but just index your beautiful MongoDB JSON objects and have it ready for full text searching? Elasticsearch will be a perfect tool for that with its support for objects, nested documents and parent – child relationship. Solr may not be the best fit here, but remember that it also supports parent – child and nested documents both when indexing XML documents as well as JSON. Also there is one more very important thing – Solr supports query time joins inside and across different collections, so you are not limited to index time parent – child handling.

Query DSL

Let’s say it out loud – the query language of Elasticsearch is really great. If you love JSON. It lets you structure the query using JSON, so it will be well structured and give you the control over the whole logic. You can mix different kinds of queries to write very sophisticated matching logic. Of course, full text search is not everything and you can include aggregations, results collapsing and so on – basically everything that you need from your data can be expressed in the query language. Solr on the other hand is still using the URI search, at least in its most widely used API (there is also limited JSON API and XML query parser available). All the parameters go into the URI, which can lead to long and complicated queries. Both approaches have their pros and cons and novice users tend to need help with queries using both search engines.

Index/Collection Leader Control

While Elasticsearch is dynamic in its nature when it comes to shard placement around the cluster it doesn’t give us much control over which shards will take the role of the primaries and which ones will be the replicas. It is beyond our control. In Solr you have that control, which is a very good thing when you consider that during indexing the leaders are the ones that do more work, because of forwarding the data to all their replicas. With the ability to rebalance the leaders or explicitly say where they should be put we have the perfect ability to balance the load across the cluster, by providing exact information about where the leader shards should be.

Machine Learning

Trending topic  about which you will hear even more in the coming months and years – machine learning. In Solr it comes for free in a form of a contrib module and on top of streaming aggregations framework. With the use of the additional libraries in the contrib module you can use the machine learned ranking models and feature extraction on top of Solr, while the streaming aggregations based machine learning is focused on text classification using logistic regression. On the other hand we have Elasticsearch and its X-Pack commercial plugin which comes with a plugin for Kibana that supports machine learning algorithms focused on anomaly detection and outlier detection in time series data. It’s a nice package of tools bundled with professional services, but quite pricey. Thus, we unpacked the X-Pack and listed available X-Pack alternatives: from open source tools, commercial alternatives, or cloud services.

Ecosystem

When it comes to the Ecosystem, the tools that come with Solr are nice, but they feel modest. We have Kibana port called Banana which went its own way and tools like Apache Zeppelin integration that allows to run SQL on top of Apache Solr. Of course there are other tools, which can either read data from Solr, send data to Solr or use Solr as the data source – like Flume for example. Most of the tools are developed and supported by a wide variety of enthusiasts. If you look at the ecosystem around Elasticsearch it is very modern and sorted. You have new version of Kibana with new features popping up every month. If you don’t like Kibana, you have Grafana which is now a product on its own providing a wide variety of features, you have long list of data data shippers and tools that can use Elasticsearch as a data source. Finally, those products are not only backed up by enthusiasts but also by large, commercial entities. This is obviously not an exhaustive list of Solr and Elasticsearch differences. We could go on for several blog posts and make a book out of it, but hopefully the above list gave you the idea on what to expect from one and the other.

Want to learn more about Solr or Elasticsearch?

Don’t forget to download the Cheat Sheet you need. Here they are:

Then, subscribe to our blog or follow @sematext. If you need any help with Solr / SolrCloud or Elasticsearch – don’t forget that we provide Solr & Elasticsearch Consulting, Solr & Elasticsearch Production Support, and offer both Solr & Elasticsearch Training.

Read the original blog entry...

More Stories By Sematext Blog

Sematext is a globally distributed organization that builds innovative Cloud and On Premises solutions for performance monitoring, alerting and anomaly detection (SPM), log management and analytics (Logsene), and search analytics (SSA). We also provide Search and Big Data consulting services and offer 24/7 production support for Solr and Elasticsearch.

Latest Stories
In his session at 20th Cloud Expo, Scott Davis, CTO of Embotics, discussed how automation can provide the dynamic management required to cost-effectively deliver microservices and container solutions at scale. He also discussed how flexible automation is the key to effectively bridging and seamlessly coordinating both IT and developer needs for component orchestration across disparate clouds – an increasingly important requirement at today’s multi-cloud enterprise.
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the bene...
We are seeing a major migration of enterprises applications to the cloud. As cloud and business use of real time applications accelerate, legacy networks are no longer able to architecturally support cloud adoption and deliver the performance and security required by highly distributed enterprises. These outdated solutions have become more costly and complicated to implement, install, manage, and maintain.SD-WAN offers unlimited capabilities for accessing the benefits of the cloud and Internet. ...
"DevOps is set to be one of the most profound disruptions to hit IT in decades," said Andi Mann. "It is a natural extension of cloud computing, and I have seen both firsthand and in independent research the fantastic results DevOps delivers. So I am excited to help the great team at @DevOpsSUMMIT and CloudEXPO tell the world how they can leverage this emerging disruptive trend."
In this presentation, you will learn first hand what works and what doesn't while architecting and deploying OpenStack. Some of the topics will include:- best practices for creating repeatable deployments of OpenStack- multi-site considerations- how to customize OpenStack to integrate with your existing systems and security best practices.
Security, data privacy, reliability and regulatory compliance are critical factors when evaluating whether to move business applications from in-house client hosted environments to a cloud platform. In her session at 18th Cloud Expo, Vandana Viswanathan, Associate Director at Cognizant, In this session, will provide an orientation to the five stages required to implement a cloud hosted solution validation strategy.
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
Everyone wants the rainbow - reduced IT costs, scalability, continuity, flexibility, manageability, and innovation. But in order to get to that collaboration rainbow, you need the cloud! In this presentation, we'll cover three areas: First - the rainbow of benefits from cloud collaboration. There are many different reasons why more and more companies and institutions are moving to the cloud. Benefits include: cost savings (reducing on-prem infrastructure, reducing data center foot print, redu...
DXWorldEXPO LLC announced today that "IoT Now" was named media sponsor of CloudEXPO | DXWorldEXPO 2018 New York, which will take place on November 11-13, 2018 in New York City, NY. IoT Now explores the evolving opportunities and challenges facing CSPs, and it passes on some lessons learned from those who have taken the first steps in next-gen IoT services.
Founded in 2000, Chetu Inc. is a global provider of customized software development solutions and IT staff augmentation services for software technology providers. By providing clients with unparalleled niche technology expertise and industry experience, Chetu has become the premiere long-term, back-end software development partner for start-ups, SMBs, and Fortune 500 companies. Chetu is headquartered in Plantation, Florida, with thirteen offices throughout the U.S. and abroad.
DXWorldEXPO LLC announced today that ICC-USA, a computer systems integrator and server manufacturing company focused on developing products and product appliances, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City. ICC is a computer systems integrator and server manufacturing company focused on developing products and product appliances to meet a wide range of ...
René Bostic is the Technical VP of the IBM Cloud Unit in North America. Enjoying her career with IBM during the modern millennial technological era, she is an expert in cloud computing, DevOps and emerging cloud technologies such as Blockchain. Her strengths and core competencies include a proven record of accomplishments in consensus building at all levels to assess, plan, and implement enterprise and cloud computing solutions. René is a member of the Society of Women Engineers (SWE) and a m...
SYS-CON Events announced today that DatacenterDynamics has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY. DatacenterDynamics is a brand of DCD Group, a global B2B media and publishing company that develops products to help senior professionals in the world's most ICT dependent organizations make risk-based infrastructure and capacity decisions.
The technologies behind big data and cloud computing are converging quickly, offering businesses new capabilities for fast, easy, wide-ranging access to data. However, to capitalize on the cost-efficiencies and time-to-value opportunities of analytics in the cloud, big data and cloud technologies must be integrated and managed properly. Pythian's Director of Big Data and Data Science, Danil Zburivsky will explore: The main technology components and best practices being deployed to take advantage...
Nicolas Fierro is CEO of MIMIR Blockchain Solutions. He is a programmer, technologist, and operations dev who has worked with Ethereum and blockchain since 2014. His knowledge in blockchain dates to when he performed dev ops services to the Ethereum Foundation as one the privileged few developers to work with the original core team in Switzerland.