Solr, the Other In-Memory NoSQL Database?

Using Solr for general purpose analysis

If you ask well informed technical people what is Apache Solr used for? Your most likely response would be that Apache Solr + Lucene is an open source text search engine. Documents are indexed into Solr and after indexing, these same documents can be easily searched using free form queries in much the same way as you would query Google. Still others might add that Solr has some very capable geo-location indexing capabilities that support radius, bounded-box, and defined area searches. And both of the above answers would be well informed and correct.

What may be less well known is that Apache Solr (+Lucene) can be effectively used for certain indexed data queries and provide lightning fast response times too! By leveraging Solr in this way you can benefit by extending either your current use of Solr, or adding Solr to your existing cluster in order to better leverage your existing data assets.

This article will share how Solr can be leveraged to provide exceptional query response times to a wide variety of business style queries. Guidance will be provided on how to index documents into a Solr cluster and issue complex queries against the indexed documents. After the nuts and bolts are shared, guidance will be provided regarding important considerations for using Solr in such and way. The article will finish with a review of Solr's capabilities as compared to other in-memory NoSQL engines such as MongoDB.

In short, this article provides a great overview of how to leverage Solr as a NoSQL in-memory database.

Let's Get Some Data

In searching for some data to index into Solr I had a few criteria. I wanted the number of fields to be small, so that the data set could be easily understood. I also wanted a data set that really isn't a typical text based data set, but rather is more of a business type data set. Lastly I wanted a data set with some numerical values so that Solr's capabilities of comparison filtering and range filtering can be easily demonstrated and understood.

After a little searching on-line I found the following data set which I believe meets all of my criteria:



The above data set is a simple listing of electricity rates listed by zip code for 2011. The data set contains the following fields and types:

Field Name

Field Type

Field Description



Rate zip code



Energy provider id



Company name









Investor based company



$ / KWH



$ / KWH



Residential rate $ / KWH


The csv file from the above URL has been downloaded and for brevity the filename has been changed to rates.csv. What follows are the first few lines from that csv file:


35218,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35219,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35214,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35215,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35216,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35210,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35211,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35212,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35213,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

Let's Create and Load a Schema


Solr can infer a schema from indexed data, but when this is done you leave it up to Solr to determine the fields and types. To be assured that we get the appropriate indexing and type semantics, defining a schema is recommended. In our example we will query certain fields and we will apply comparison and range filters to certain fields. As a result, we must make sure that these certain fields are indexed and defined with the proper field type before we index data into Solr. We also take care not to index fields that will not be searched or faceted. This reduces or minimizes the memory needed to fulfil your business needs.

First we instruct Solr to create a default configuration set on the local file system. To do this we issue the following command, where /tmp/electric_rates is the local directory where Solr will place our default configuration set:

solrctl --zk localhost:2181/solr instancedir --generate /tmp/electric_rates

In the /tmp/electric_rates directory there will now be a file named schema.xml. This is a rather large xml file and contains some definitions that are leveraged in other areas. The main area of concern is the field definitions. All of the example field definitions can be removed. Listed below are the field definitions that we will use for our example electric rate data set:

<field name="zip" type="int" indexed="true" stored="true" required="true"/>

<field name="eiaid" type="int" indexed="false" stored="true"/>

<field name="utility_name" type="string" indexed="true" stored="true" omitNorms="true"/>

<field name="state" type="string" indexed="true" stored="true" omitNorms="true"/>

<field name="service_type" type="string" indexed="false" stored="true" omitNorms="true"/>

<field name="ownership" type="string" indexed="false" stored="true" omitNorms="true"/>

<field name="comm_rate" type="double" indexed="true" stored="true"/>

<field name="ind_rate" type="double" indexed="true" stored="true"/>

<field name="res_rate" type="double" indexed="true" stored="true"/>

You will note that there are a few "int" fields, a few "string" fields, and a few "double" fields. Also note that only some fields are designated as "indexed='true'" as these are fields that we will query on or apply grouping functions to. The "omitNorms" setting informs Solr that we will NOT be using these fields in any form of boosting searches. Using boosting in searches is an advanced way to instruct Solr that a specific field is more or less important in certain "boosted" queries.

After the schema.xml file has been edited zookeeper must be used to create a Solr instance directory using the following command:

solrctl --zk localhost:2181/solr instancedir --create electric_collection /tmp/electric_rates

Next we instruct Solr to create a new collection with the following command:

solrctl --zk localhost:2181/solr collection --create electric_collection

Finally to index the data into Solr we will use the already configured csv request handler to easily index this csv file into Solr. It should be noted that this is an excellent utility for small data sets, but is not really recommended for larger data sets. For larger data sets you might want to consider using the MapReduceIndexerTool, but I leave that to the reader to investigate its use. The following command will get our data indexed.

curl "http://localhost:8983/solr/electric_collection_shard1_replica1/update/csv?header=true& \ rowid=id&stream.file=/tmp/rates.csv&stream.contentType=text/csv;charset=utf-8"

Upon completion you will note that 37791 documents were indexed into Solr. Obviously this is not a large data set, but the intention is to demonstrate query capabilities first and response times only as secondary information.

Now Let's Get Some Business Answers and Make it Fast!


To demonstrate Solr's query capabilities on our newly indexed data set let's answer some business style questions targeting our newly indexed data. For each business question, I will provide the query along with a detail of each query element. In order to keep the article shorter I will not list the full Solr response, but only provide the answers in very short form.

Find out how many utility companies serve the state of Maryland (MD)?

To fulfill the above question we need to apply a filter to the state field specifying only results from ‘MD'. In order to determine how many utility companies exist in MD we will ask Solr to group the results based on the utility_name field but we will limit grouping results to just 1 as we only care to find out how many total groups there are. The following query fulfills the business needs requested above:



Listed below are the query elements decomposed for better understanding:

Query Element



Solr collection select URL


State="MD" filter


Results in json format


Indent the results


Group results


Group by utility_name


Limit # of groups to 10


Only 1 result per group


The number of groups returned is 4 and the result was returned in 23 milliseconds!

Which Maryland utility has the cheapest residential rates?

To fulfill the above question we only need to add one additional element to the prior query which will instruct Solr to sort the groups in ascending order which will place the cheapest residential rate at the top and we can also limit the number of groups to just 1.


Listed below are the new or modified query elements decomposed for better understanding:

Query Element



Limit # of groups to 1


Sort groups by res_rate


The cheapest utility in MD is "The Potomac Edison Company" @ 0.03079 / KWH and the result was returned in 4 milliseconds!

What are the minimum and maximum residential power rates excluding missing data elements?

To fulfil this query we need to filter out data rows where res_rate = 0.0 as these are missing data elements. We accomplish this using an "frange" query excluding the lower bound of 0.0. To get the minimum and maximum res_rate we instruct Solr to generate statistics for the res_rate indexed field. The query to answer the above business question is listed below:


Listed below are the query elements decomposed for better understanding:

Query Element



Solr collection select URL


Consider all documents


Restated below without URL encoded characters:

fq={!frange l=0.0 incl=false}res_rate

Range query excluding lower bound of 0.0 on res_rate field


Results in json format


Indent the results


No document results to be returned


Generate statistics


Return stats on the res_rate field


The res_rate minimum is 0.0260022258659 and the res_rate maximum is 0.849872773537. Results were returned in 5 milliseconds.

What is the state and zip code with the highest res_rate?

To fulfil the above business request we take the maximum res_rate returned from the prior query and use it as a filter for the next query as listed below:


Listed below are the query elements decomposed for better understanding:

Query Element



Solr collection select URL


Select the target res_rate documents


Results in json format


Indent the results


Return only 1 result if found


The highest residential electric rates are found in Alaska in zip code 99634. The results were returned in 1 millisecond!


Guidelines for Using Solr to Meet Your Analysis Needs


It is worth pointing out that Solr should not be thought of as a general purpose in-memory NoSQL engine. With that in mind here are some guidelines to help understand when it might be appropriate to leverage Solr's query capabilities:

1.      Your use case requires very fast query response times

2.      The data you need to analyze is already stored in Hadoop

3.      You can easily define a schema for the data to be indexed

4.      You need to query (filter) on many fields

5.      The amount of data to be indexed into Solr will not exceed your Solr cluster capabilities

If many or all of the above criteria apply then using Solr for your data analysis might just be a great fit.

Comparing Solr to MongoDB


MongoDB is one of several NoSQL database engines in existence today. MongoDB often get consideration when someone is investigating fast, scalable, general purpose databases. For comparison purposes I provide a table of features that details support for the listed features.




Supports In-Memory analysis



Requires schema definition

Yes (highly recommended)


Supports dynamic addition of new indexes on existing fields

No (requires re-indexing documents)


Scales to support more data



Supports SQL syntax



General Purpose In-Memory database



Supports the HDFS file system






As you can see Solr provides lightning fast query response times to a wide variety of business style queries. The query language is not nearly as well-known as SQL, but Solr has some excellent capabilities that can be leveraged with some thought and practice.

To get the answers needed above we leveraged grouping, group sorting, field selection (filtering), statistics generation and range selection. While Solr should not be considered to be a general purpose NoSQL in-memory database system, it can still be leveraged to yield some very capable analysis results with awesome response times. As such it should be viewed as another tool in the toolbox that when used correctly can simplify the life of the Hadoop Eco-System Architect!

System Specifications


All of the above queries were issued against a single Solr instance running in a virtual machine.


Cent OS 6.6


CDH 5.0.0



Solr Memory Available

5.84 GB






Related Links


Solr quick start:


Solr reference guide:


Solr in Action (book)



More Stories By Pete Whitney

Pete Whitney is a Solutions Architect for Cloudera. His primary role at Cloudera is guiding and assisting Cloudera's clients through successful adoption of Cloudera's Enterprise Data Hub and surrounding technologies.

Previously Pete served as VP of Cloud Development for FireScope Inc. In the advertising industry Pete designed and delivered DG Fastchannel’s internet-based advertising distribution architecture. Pete also excelled in other areas including design enhancements in robotic machine vision systems for FSI International Inc. These enhancements included mathematical changes for improved accuracy, improved speed, and automated calibration. He also designed a narrow spectrum light source, and a narrow spectrum band pass camera filter for controlled machine vision imaging.

Pete graduated Cum Laude from the University of Texas at Dallas, and holds a BS in Computer Science. Pete can be contacted via Email at [email protected]

Latest Stories
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
Daniel Jones is CTO of EngineerBetter, helping enterprises deliver value faster. Previously he was an IT consultant, indie video games developer, head of web development in the finance sector, and an award-winning martial artist. Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams.
The standardization of container runtimes and images has sparked the creation of an almost overwhelming number of new open source projects that build on and otherwise work with these specifications. Of course, there's Kubernetes, which orchestrates and manages collections of containers. It was one of the first and best-known examples of projects that make containers truly useful for production use. However, more recently, the container ecosystem has truly exploded. A service mesh like Istio addr...
As DevOps methodologies expand their reach across the enterprise, organizations face the daunting challenge of adapting related cloud strategies to ensure optimal alignment, from managing complexity to ensuring proper governance. How can culture, automation, legacy apps and even budget be reexamined to enable this ongoing shift within the modern software factory? In her Day 2 Keynote at @DevOpsSummit at 21st Cloud Expo, Aruna Ravichandran, VP, DevOps Solutions Marketing, CA Technologies, was jo...
Predicting the future has never been more challenging - not because of the lack of data but because of the flood of ungoverned and risk laden information. Microsoft states that 2.5 exabytes of data are created every day. Expectations and reliance on data are being pushed to the limits, as demands around hybrid options continue to grow.
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
As IoT continues to increase momentum, so does the associated risk. Secure Device Lifecycle Management (DLM) is ranked as one of the most important technology areas of IoT. Driving this trend is the realization that secure support for IoT devices provides companies the ability to deliver high-quality, reliable, secure offerings faster, create new revenue streams, and reduce support costs, all while building a competitive advantage in their markets. In this session, we will use customer use cases...
Evan Kirstel is an internationally recognized thought leader and social media influencer in IoT (#1 in 2017), Cloud, Data Security (2016), Health Tech (#9 in 2017), Digital Health (#6 in 2016), B2B Marketing (#5 in 2015), AI, Smart Home, Digital (2017), IIoT (#1 in 2017) and Telecom/Wireless/5G. His connections are a "Who's Who" in these technologies, He is in the top 10 most mentioned/re-tweeted by CMOs and CIOs (2016) and have been recently named 5th most influential B2B marketeer in the US. H...
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
As you move to the cloud, your network should be efficient, secure, and easy to manage. An enterprise adopting a hybrid or public cloud needs systems and tools that provide: Agility: ability to deliver applications and services faster, even in complex hybrid environments Easier manageability: enable reliable connectivity with complete oversight as the data center network evolves Greater efficiency: eliminate wasted effort while reducing errors and optimize asset utilization Security: implemen...
DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.