Welcome!

Article

Solr, the Other In-Memory NoSQL Database?

Using Solr for general purpose analysis

If you ask well informed technical people what is Apache Solr used for? Your most likely response would be that Apache Solr + Lucene is an open source text search engine. Documents are indexed into Solr and after indexing, these same documents can be easily searched using free form queries in much the same way as you would query Google. Still others might add that Solr has some very capable geo-location indexing capabilities that support radius, bounded-box, and defined area searches. And both of the above answers would be well informed and correct.

What may be less well known is that Apache Solr (+Lucene) can be effectively used for certain indexed data queries and provide lightning fast response times too! By leveraging Solr in this way you can benefit by extending either your current use of Solr, or adding Solr to your existing cluster in order to better leverage your existing data assets.

This article will share how Solr can be leveraged to provide exceptional query response times to a wide variety of business style queries. Guidance will be provided on how to index documents into a Solr cluster and issue complex queries against the indexed documents. After the nuts and bolts are shared, guidance will be provided regarding important considerations for using Solr in such and way. The article will finish with a review of Solr's capabilities as compared to other in-memory NoSQL engines such as MongoDB.

In short, this article provides a great overview of how to leverage Solr as a NoSQL in-memory database.

Let's Get Some Data

In searching for some data to index into Solr I had a few criteria. I wanted the number of fields to be small, so that the data set could be easily understood. I also wanted a data set that really isn't a typical text based data set, but rather is more of a business type data set. Lastly I wanted a data set with some numerical values so that Solr's capabilities of comparison filtering and range filtering can be easily demonstrated and understood.

After a little searching on-line I found the following data set which I believe meets all of my criteria:

 

http://catalog.data.gov/dataset/u-s-electric-utility-companies-and-rates-look-up-by-zipcode-feb-2011-57a7c

The above data set is a simple listing of electricity rates listed by zip code for 2011. The data set contains the following fields and types:

Field Name

Field Type

Field Description

zip

Numeric

Rate zip code

eiaid

Numeric

Energy provider id

utility_name

String

Company name

state

String

 

service_type

String

 

ownership

String

Investor based company

com_rate

double

$ / KWH

ind_rate

double

$ / KWH

res_rate

double

Residential rate $ / KWH

 

The csv file from the above URL has been downloaded and for brevity the filename has been changed to rates.csv. What follows are the first few lines from that csv file:

zip,eiaid,utility_name,state,service_type,ownership,comm_rate,ind_rate,res_rate

35218,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35219,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35214,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35215,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35216,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35210,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35211,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35212,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065

35213,195,Alabama Power Co,AL,Bundled,Investor Owned,0.105761195393,0.0602924366735,0.114943267065


Let's Create and Load a Schema

 

Solr can infer a schema from indexed data, but when this is done you leave it up to Solr to determine the fields and types. To be assured that we get the appropriate indexing and type semantics, defining a schema is recommended. In our example we will query certain fields and we will apply comparison and range filters to certain fields. As a result, we must make sure that these certain fields are indexed and defined with the proper field type before we index data into Solr. We also take care not to index fields that will not be searched or faceted. This reduces or minimizes the memory needed to fulfil your business needs.

First we instruct Solr to create a default configuration set on the local file system. To do this we issue the following command, where /tmp/electric_rates is the local directory where Solr will place our default configuration set:

solrctl --zk localhost:2181/solr instancedir --generate /tmp/electric_rates

In the /tmp/electric_rates directory there will now be a file named schema.xml. This is a rather large xml file and contains some definitions that are leveraged in other areas. The main area of concern is the field definitions. All of the example field definitions can be removed. Listed below are the field definitions that we will use for our example electric rate data set:

<field name="zip" type="int" indexed="true" stored="true" required="true"/>

<field name="eiaid" type="int" indexed="false" stored="true"/>

<field name="utility_name" type="string" indexed="true" stored="true" omitNorms="true"/>

<field name="state" type="string" indexed="true" stored="true" omitNorms="true"/>

<field name="service_type" type="string" indexed="false" stored="true" omitNorms="true"/>

<field name="ownership" type="string" indexed="false" stored="true" omitNorms="true"/>

<field name="comm_rate" type="double" indexed="true" stored="true"/>

<field name="ind_rate" type="double" indexed="true" stored="true"/>

<field name="res_rate" type="double" indexed="true" stored="true"/>

You will note that there are a few "int" fields, a few "string" fields, and a few "double" fields. Also note that only some fields are designated as "indexed='true'" as these are fields that we will query on or apply grouping functions to. The "omitNorms" setting informs Solr that we will NOT be using these fields in any form of boosting searches. Using boosting in searches is an advanced way to instruct Solr that a specific field is more or less important in certain "boosted" queries.

After the schema.xml file has been edited zookeeper must be used to create a Solr instance directory using the following command:

solrctl --zk localhost:2181/solr instancedir --create electric_collection /tmp/electric_rates

Next we instruct Solr to create a new collection with the following command:

solrctl --zk localhost:2181/solr collection --create electric_collection

Finally to index the data into Solr we will use the already configured csv request handler to easily index this csv file into Solr. It should be noted that this is an excellent utility for small data sets, but is not really recommended for larger data sets. For larger data sets you might want to consider using the MapReduceIndexerTool, but I leave that to the reader to investigate its use. The following command will get our data indexed.

curl "http://localhost:8983/solr/electric_collection_shard1_replica1/update/csv?header=true& \ rowid=id&stream.file=/tmp/rates.csv&stream.contentType=text/csv;charset=utf-8"

Upon completion you will note that 37791 documents were indexed into Solr. Obviously this is not a large data set, but the intention is to demonstrate query capabilities first and response times only as secondary information.


Now Let's Get Some Business Answers and Make it Fast!

 

To demonstrate Solr's query capabilities on our newly indexed data set let's answer some business style questions targeting our newly indexed data. For each business question, I will provide the query along with a detail of each query element. In order to keep the article shorter I will not list the full Solr response, but only provide the answers in very short form.

Find out how many utility companies serve the state of Maryland (MD)?

To fulfill the above question we need to apply a filter to the state field specifying only results from ‘MD'. In order to determine how many utility companies exist in MD we will ask Solr to group the results based on the utility_name field but we will limit grouping results to just 1 as we only care to find out how many total groups there are. The following query fulfills the business needs requested above:

http://localhost:8983/solr/electric_collection_shard1_replica1/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1

 

Listed below are the query elements decomposed for better understanding:

Query Element

Description

http://localhost:8983/solr/electric_collection_shard1_replica1/select

Solr collection select URL

state%3A%22MD%22

State="MD" filter

wt=jons

Results in json format

indent=true

Indent the results

group=true

Group results

group.field=utility_name

Group by utility_name

rows

Limit # of groups to 10

group.limit

Only 1 result per group

 

The number of groups returned is 4 and the result was returned in 23 milliseconds!

Which Maryland utility has the cheapest residential rates?

To fulfill the above question we only need to add one additional element to the prior query which will instruct Solr to sort the groups in ascending order which will place the cheapest residential rate at the top and we can also limit the number of groups to just 1.

http://localhost:8983/solr/electric_collection_shard1_replica1/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc

Listed below are the new or modified query elements decomposed for better understanding:

Query Element

Description

rows=1

Limit # of groups to 1

sort=res_rate+asc

Sort groups by res_rate

 

The cheapest utility in MD is "The Potomac Edison Company" @ 0.03079 / KWH and the result was returned in 4 milliseconds!

What are the minimum and maximum residential power rates excluding missing data elements?

To fulfil this query we need to filter out data rows where res_rate = 0.0 as these are missing data elements. We accomplish this using an "frange" query excluding the lower bound of 0.0. To get the minimum and maximum res_rate we instruct Solr to generate statistics for the res_rate indexed field. The query to answer the above business question is listed below:

http://localhost:8983/solr/electric_collection_shard1_replica1/select?q=*:*&fq={!frange+l%3D0.0+incl%3Dfalse}res_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate

Listed below are the query elements decomposed for better understanding:

Query Element

Description

http://localhost:8983/solr/electric_collection_shard1_replica1/select

Solr collection select URL

*:*

Consider all documents

fq={!frange+l%3D0.0+incl%3Dfalse}res_rate

Restated below without URL encoded characters:

fq={!frange l=0.0 incl=false}res_rate

Range query excluding lower bound of 0.0 on res_rate field

wt=jons

Results in json format

indent=true

Indent the results

rows=0

No document results to be returned

stats=true

Generate statistics

stats.field=res_rate

Return stats on the res_rate field

 

The res_rate minimum is 0.0260022258659 and the res_rate maximum is 0.849872773537. Results were returned in 5 milliseconds.

What is the state and zip code with the highest res_rate?

To fulfil the above business request we take the maximum res_rate returned from the prior query and use it as a filter for the next query as listed below:

http://localhost:8983/solr/electric_collection_shard1_replica1/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1

Listed below are the query elements decomposed for better understanding:

Query Element

Description

http://localhost:8983/solr/electric_collection_shard1_replica1/select

Solr collection select URL

q=res_rate:0.849872773537

Select the target res_rate documents

wt=jons

Results in json format

indent=true

Indent the results

rows=1

Return only 1 result if found

 

The highest residential electric rates are found in Alaska in zip code 99634. The results were returned in 1 millisecond!

 

Guidelines for Using Solr to Meet Your Analysis Needs

 

It is worth pointing out that Solr should not be thought of as a general purpose in-memory NoSQL engine. With that in mind here are some guidelines to help understand when it might be appropriate to leverage Solr's query capabilities:

1.      Your use case requires very fast query response times

2.      The data you need to analyze is already stored in Hadoop

3.      You can easily define a schema for the data to be indexed

4.      You need to query (filter) on many fields

5.      The amount of data to be indexed into Solr will not exceed your Solr cluster capabilities

If many or all of the above criteria apply then using Solr for your data analysis might just be a great fit.


Comparing Solr to MongoDB

 

MongoDB is one of several NoSQL database engines in existence today. MongoDB often get consideration when someone is investigating fast, scalable, general purpose databases. For comparison purposes I provide a table of features that details support for the listed features.

Consideration

Solr

MongoDB

Supports In-Memory analysis

Yes

Yes

Requires schema definition

Yes (highly recommended)

No

Supports dynamic addition of new indexes on existing fields

No (requires re-indexing documents)

Yes

Scales to support more data

Yes

Yes

Supports SQL syntax

No

No

General Purpose In-Memory database

No

Yes

Supports the HDFS file system

Yes

No

 

Summary

 

As you can see Solr provides lightning fast query response times to a wide variety of business style queries. The query language is not nearly as well-known as SQL, but Solr has some excellent capabilities that can be leveraged with some thought and practice.

To get the answers needed above we leveraged grouping, group sorting, field selection (filtering), statistics generation and range selection. While Solr should not be considered to be a general purpose NoSQL in-memory database system, it can still be leveraged to yield some very capable analysis results with awesome response times. As such it should be viewed as another tool in the toolbox that when used correctly can simplify the life of the Hadoop Eco-System Architect!


System Specifications

 

All of the above queries were issued against a single Solr instance running in a virtual machine.

OS

Cent OS 6.6

Hadoop

CDH 5.0.0

Solr

4.10.3

Solr Memory Available

5.84 GB

Java

1.7.0_67

Processors

1

 


Related Links

 

Solr quick start:

http://lucene.apache.org/solr/quickstart.html

Solr reference guide:

https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide

Solr in Action (book)

https://www.manning.com/books/solr-in-action

 


More Stories By Pete Whitney

Pete Whitney is a Solutions Architect for Cloudera. His primary role at Cloudera is guiding and assisting Cloudera's clients through successful adoption of Cloudera's Enterprise Data Hub and surrounding technologies.

Previously Pete served as VP of Cloud Development for FireScope Inc. In the advertising industry Pete designed and delivered DG Fastchannel’s internet-based advertising distribution architecture. Pete also excelled in other areas including design enhancements in robotic machine vision systems for FSI International Inc. These enhancements included mathematical changes for improved accuracy, improved speed, and automated calibration. He also designed a narrow spectrum light source, and a narrow spectrum band pass camera filter for controlled machine vision imaging.

Pete graduated Cum Laude from the University of Texas at Dallas, and holds a BS in Computer Science. Pete can be contacted via Email at [email protected]

Latest Stories
Organizations do not need a Big Data strategy; they need a business strategy that incorporates Big Data. Most organizations lack a road map for using Big Data to optimize key business processes, deliver a differentiated customer experience, or uncover new business opportunities. They do not understand what’s possible with respect to integrating Big Data into the business model.
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities – ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups. As a result, many firms employ new business models that place enormous impor...
Amazon is pursuing new markets and disrupting industries at an incredible pace. Almost every industry seems to be in its crosshairs. Companies and industries that once thought they were safe are now worried about being “Amazoned.”. The new watch word should be “Be afraid. Be very afraid.” In his session 21st Cloud Expo, Chris Kocher, a co-founder of Grey Heron, will address questions such as: What new areas is Amazon disrupting? How are they doing this? Where are they likely to go? What are th...
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
Though cloud is the future of enterprise computing, a smooth transition of legacy applications and systems is critical for seamless business operations. IT professionals are eager to start leveraging the cost, scale and other benefits of cloud, but with massive investments already in place in existing infrastructure and a number of compliance and resource hurdles, it can be challenging to move to a cloud-based infrastructure.
SYS-CON Events announced today that NetApp has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. NetApp is the data authority for hybrid cloud. NetApp provides a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with their partners, NetApp emp...
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, will provide a fun and simple way to introduce Machine Leaning to anyone and everyone. Together we will solve a machine learning problem and find an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intellige...
SYS-CON Events announced today that TidalScale, a leading provider of systems and services, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale has been involved in shaping the computing landscape. They've designed, developed and deployed some of the most important and successful systems and services in the history of the computing industry - internet, Ethernet, operating s...
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
Infoblox delivers Actionable Network Intelligence to enterprise, government, and service provider customers around the world. They are the industry leader in DNS, DHCP, and IP address management, the category known as DDI. We empower thousands of organizations to control and secure their networks from the core-enabling them to increase efficiency and visibility, improve customer service, and meet compliance requirements.
In his session at 21st Cloud Expo, Michael Burley, a Senior Business Development Executive in IT Services at NetApp, will describe how NetApp designed a three-year program of work to migrate 25PB of a major telco's enterprise data to a new STaaS platform, and then secured a long-term contract to manage and operate the platform. This significant program blended the best of NetApp’s solutions and services capabilities to enable this telco’s successful adoption of private cloud storage and launchi...
Data scientists must access high-performance computing resources across a wide-area network. To achieve cloud-based HPC visualization, researchers must transfer datasets and visualization results efficiently. HPC clusters now compute GPU-accelerated visualization in the cloud cluster. To efficiently display results remotely, a high-performance, low-latency protocol transfers the display from the cluster to a remote desktop. Further, tools to easily mount remote datasets and efficiently transfer...
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
Cloud Expo, Inc. has announced today that Andi Mann and Aruna Ravichandran have been named Co-Chairs of @DevOpsSummit at Cloud Expo Silicon Valley which will take place Oct. 31-Nov. 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. "DevOps is at the intersection of technology and business-optimizing tools, organizations and processes to bring measurable improvements in productivity and profitability," said Aruna Ravichandran, vice president, DevOps product and solutions marketing...