Welcome!

Blog Feed Post

Solr Query Segmenter: How to Provide Better Search Experience

One way to create a better search experience is to understand the user intent.  One of the phases in that process is query understanding, and one simple step in that direction is query segmentation. In this post, we’ll cover what query segmentation is and when it is useful. We will also introduce to you Solr Query Segmenter, a open-sourced Solr component that we developed to make search experience better.

What is query segmentation and when is it useful?

Query segmentation refers to processing of the query in order to break it down into meaningful segments.  Such segments may be single tokens, or sequences of multiple tokens.  Once such meaningful segments are discovered one can use them to enhance the search experience in various ways.  One use of query segmentation is to rewrite the query in order to make it more precise.  Below are two such examples everyone will be able to relate to.

Think about searching for people at LinkedIn.  Sometimes you search for a specific person using their first and last name.  If that name is fairly unique, it’s easy to locate that person (e.g. at the moment, there is just one Otis Gospodnetic on the planet, so it’s easy to find his LinkedIn profile).  However, when the name is not enough people use additional criteria to make their query more precise.  For example, there are over 35,000 people named Satya in LinkedIn, but if you search for Satya Microsoft there is only one match.  While I do not know how exactly LinkedIn handles queries like “Satya Microsoft”, they could be using query segmentation for it.  By using query segmentation they could determine that Satya is the first name (or at least a part of a personal name) and Microsoft is the name of an organization.  Using this knowledge they could rewrite the query into the equivalent of firstName:Satya AND organization:Microsoft, which would be more precise than a generic version of this query such as keywords:(Satya, Microsoft).

Another query segmentation use case can be found in retail, where people may search for things like “red dress” or “toaster stainless steel Braun”.  Using query segmentation one could rewrite the query with the understanding that red is a color, dress and toaster are the actual items, stainless steel is material, and Braun is the name of the company.  A query rewritten using this knowledge can yield more precise results, thus helping people find what you are looking for waster instead of wading through hundreds of items that are really just lose query matches.

Query segmentation can also be used to extract locations or points of interest information from query and turn them into geospatial queries, as you can see in the Solr Query Segmenter README.

Setup Solr Example

We’ll assume you already have Solr running.  In the example below we’ll use Solr 6.0.1, but other versions should work, as long as there is a version of Solr Query Segmenter that is based on it (if you don’t find one, send us a PR!)  Solr ships with several examples.  We’ll use the “techproducts” example to show how Solr Query Segmenter works and what is in it for you.

Let’s first run the techproducts example:

bin/solr start -e techproducts

Just to make sure all is working, you should be able to visit http://localhost:8983/solr/#/techproducts and see the Solr web admin interface:

If all is OK, we can stop Solr for now:

bin/solr stop -e techproducts

Setup Query Segmenter

Query Segmenter setup has two parts:

  1. Download / install of the required Jar files
  2. Configuration

Setup library

Download Query Segmenter jars from central maven repo.  

mkdir example/techproducts/solr/techproducts/lib
cd example/techproducts/solr/techproducts/lib
wget https://oss.sonatype.org/content/repositories/releases/com/sematext/querysegmenter/st-QuerySegmenter-core/1.3.6.3.0/st-QuerySegmenter-core-1.3.6.3.0.jar
wget https://oss.sonatype.org/content/repositories/releases/com/sematext/querysegmenter/st-QuerySegmenter-solr/1.3.6.3.0/st-QuerySegmenter-solr-1.3.6.3.0.jar

Configuration

The Query Segmenter Solr library includes Solr components that use QueryComponent – the Solr SearchComponent that handles queries. The library currently contains 2 components – QuerySegmenterQParser and CentroidComponent. Let’s have a look at each of them.

Dictionary-based Segmentation

Query segmentation is based on matching of dictionary elements against queries.  Dictionary elements are specified in dictionary files. Dictionary files are plain text files that contain segments to look for when parsing and segmenting a query.  A few dictionary files used for unit tests can be found under https://github.com/sematext/query-segmenter/tree/master/core/src/test/resources.

Dictionary Structure

There are 3 types of dictionaries:

Segment Dictionary

This is used with QuerySegmenterQParser and QuerySegmenterComponent and is nothing more than a text file with a set of keywords, one keyword per line. For example:

electronics
currency
memory
wireless mouse
Centroid Dictionary

This is used with QuerySegmenterQParser & CentroidComponent.  It contains a set of points, one point per line. Points have the format of name|lat|lon.  For example:

Aaronsburg|40.9068|-77.4081

Area Dictionary

This is another type of location dictionary for QuerySegmenterQParser & CentroidComponent.  Instead of having a point per line it contains an area per line, specified using the name|maxlat|maxlon|minlat|minlon format.  For example:

Northeast,61.235009,-149.703891,61.195252,-149.778423

If there is a segment in the user query that matches an element of the dictionary (built from the dictionary file), the query is rewritten using either the field specified in the segmenter configuration or the location (only when area segment dictionary is used, shown later in this article). For example, for the query “pizza brooklyn”, if “new  york” is an area in the dictionary, the query may be rewritten to “pizza neighborhood:brooklyn”, or perhaps “pizza location:[minlat,minlon TO maxlat, maxlon]”. The field to use and whether we should use the label or the location is configurable.

Segment Dictionary Centroid Dictionary Area Dictionary
QuerySegmenterQParser x x x
QuerySegmenterComponent x
CentroidComponent x x

QuerySegmenterQParser

This QParser is used to parse the query, extract segments from the query, and then rewrite it before letting Solr execute it.

Configuration

Configure QuerySegmenterQParser in the solrconfig.xml (example/techproducts/solr/techproducts/conf/) file:

<queryParser name="seg"
   class="com.sematext.querysegmenter.solr.QuerySegmenterQParserPlugin">
   <lst name="segments">
     <lst name="cats">
       <str name="field">cat</str>
       <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str>
       <str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/categories.txt</str>
       <bool name="useLatLon">false</bool>
     </lst>
   </lst>
 </queryParser>

Create dictionary file:

mkdir example/techproducts/solr/techproducts/conf/segmenter
cat <<EOF > example/techproducts/solr/techproducts/conf/segmenter/categories.txt
electronics
currency
memory
currency
software
camera
copier
music
printer
scanner
EOF

Usage

Let’s start Solr again after adding the above segmenter component configuration.

bin/solr start -e techproducts

To use the QParser directly, use LocalParams syntax:

http://localhost:8983/solr/techproducts/select/?q={!seg}electronics%20device

Note that the “seg” part in {!seg} local parameter matches the “seg” name of the config section above.

In the above example, the Query Segmenter will first spot the “electronics” segment because that was one of the dictionary elements we provided.  Thus, it will rewrite the query to cat:”electronics”. Why does it use the “cat” field? Because that is the field we specified in config earlier. Once the query is rewritten like this it is handled by the eDismax parser which then uses just the remaining “device” part with fields defined in its qf. The cat:”electronics” portion of the query would not be used with qf because of the field-specific prefix.

Using our “techproducts” Solr example such a segmented query returns 12 docs, all of which are in the “electronics” category — this is the key here!

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">2</int>
        <lst name="params">
            <str name="q">{!seg}electronics device</str>
            <str name="fl">cat</str>
        </lst>
    </lst>
    <result name="response" numFound="12" start="0">
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>connector</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>connector</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>memory</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>memory</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>memory</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>music</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>graphics card</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>graphics card</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>multifunction printer</str>
                <str>printer</str>
                <str>scanner</str>
                <str>copier</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>camera</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>hard drive</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>hard drive</str>
            </arr>
        </doc>
    </result>



</response>

Now compare that to the results of a query without segmenter component:

http://localhost:8983/solr/techproducts/select/?q=electronics%20device
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">5</int>
  <lst name="params">
    <str name="q">electronics device</str>
    <str name="fl">cat</str>
  </lst>
<result name="response" numFound="14" start="0">

Note something different?  We got 14 hits, not just 12.  Let’s see those 2 extra hits:

<doc>
 <arr name="cat">
  <str>electronics and computer1</str>
 </arr>
</doc>
<doc>
 <arr name="cat">
  <str>electronics</str>
  <str>memory</str>
 </arr>
</doc>
<doc>
 <arr name="cat">
  <str>electronics and stuff2</str>
 </arr>
</doc>

You see the problem?  Our “electronics” query picked up matches in other categories.  Sometimes that is what you want, but sometimes you really don’t want that, and the Solr Query Segmenter helps you avoid that and return more precise results.

QuerySegmenterComponent

A component that works like the QParser described above, but implemented as a Solr SearchComponent instead of a QParser. Using QuerySegmenterComponent lets us configure each individual Request Handler to include or not include query segmentation.  One could also configure multiple QuerySegmentedComponents, perhaps with different dictionaries and/or different fields.
Using this component also means you don’t need to add prefix {!seg} for every user query, such as q={!seg}electronics%20device
Note that you should put this component before the standard query component (or simply define it to be the first component), because it needs to rewrite the query before the query is made against Solr.

Configuration

<searchComponent name="segmenter"
  class="com.sematext.querysegmenter.solr.QuerySegmenterComponent">   
  <lst name="segments">
    <lst name="cats">
      <str name="field">cat</str>
      <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str>
      <str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/categories.txt</str>
      <bool name="useLatLon">false</bool>
    </lst>
  </lst>
</searchComponent>

<requestHandler name="/qs" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">edismax</str>
    <str name="qf">
        name^1.2 id^10.0 features^1.0 manu^1.1 cat^1.4
    </str>
  </lst>
  <arr name="first-components">
    <str>segmenter</str>
  </arr>
</requestHandler>

Usage

http://localhost:8983/solr/techproducts/qs?q=electronics%20device

CentroidComponent

This SearchComponent is used to rewrite queries by segmenting them, looking for segments that match a centroid in the provided area dictionary, and then centering queries using that centroid. It must be used within a RequestHandler that uses a location filter (bbox or geofilt). If a match is found, the user location (the required pt request param) is changed to the center location of the centroid. The effect is that instead of using the user location for the location filter, it will use the centroid location. If multiple centroid segments are returned from the user query, the closest centroid to the original user location is used.

For example, if a user searches for “pizza Aaronsburg”, the segment “Aaronsburg” might be returned as a centroid with location 40.9068, -77.4081. This location would then be used instead of the original user’s location (think a person sitting in front of a computer in Cleveland, Ohio and looking where to eat pizza in Aaronsburg, Ohio). This would filter results and return only matches in some radius around the centroid location.  This radius is specified in the configuration, as shown below.

Configuration

We’ll define the SearchComponent in solrconfig.xml:

<searchComponent name="centroidcomp"
   class="com.sematext.querysegmenter.solr.CentroidComponent">
  <str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/centroid.csv</str>
  <str name="separator">|</str>
</searchComponent>

Note how we’ve specified a dictionary file with centroid information and that it’s in the csv format, which was described earlier.  You can see an example centroid.csv in https://github.com/sematext/query-segmenter/tree/master/core/src/test/resources.

Next, we need to add this component to a request handler:

<requestHandler name="/centroid" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="sfield">store</str>
    <str name="fq">{!geofilt}</str>
    <str name="q.alt">*:*</str>
    <str name="d">75</str> <!-- radius from location, in kilometers by default -->
    </lst>
    <arr name="first-components">
      <str>centroidcomp</str>
    </arr>
</requestHandler>

The “sfield” needs to specify a location field.  In this example that field is “store”.  The “d” setting specifies the radius from the location, in kilometers.  Any point outside that radius will be filtered out.

Usage

We can use it with the /centroid request handler defined above. Let’s search for adelphia radeon:
http://localhost:8983/solr/techproducts/centroid?q=adelphia%20radeon
Searching for adelphia radeon will return the following:

<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
   <str name="q">adelphia radeon</str>
 </lst>
</lst>
<result name="response" numFound="1" start="0">
 <doc>
  <str name="id">100-435805</str>
  <str name="name">ATI Radeon X1900 XTX 512 MB PCIE Video Card</str>
  <str name="manu">ATI Technologies</str>
  <str name="manu_id_s">ati</str>
  <arr name="cat">
   <str>electronics</str>
   <str>graphics card</str>
  </arr>
  <arr name="features">
   <str>ATI RADEON X1900 GPU/VPU clocked at 650MHz</str>
   <str>512MB GDDR3 SDRAM clocked at 1.55GHz</str>
   <str>PCI Express x16</str>
   <str>dual DVI, HDTV, svideo, composite out</str>
   <str>OpenGL 2.0, DirectX 9.0</str>
  </arr>
  <float name="weight">48.0</float>
  <float name="price">649.99</float>
  <str name="price_c">649.99,USD</str>
   <int name="popularity">7</int>
   <bool name="inStock">false</bool>
   <date name="manufacturedate_dt">2006-02-13T00:00:00Z</date>
   <str name="store">40.7143,-74.006</str>
   <long name="_version_">1538980276785381376</long>
  </doc>
 </result>
</response>

 

What happened here?  One of the centroid dictionary entries is this:

Adelphia|40.2295|-74.2954

Thus, the Solr Query Segmenter matched adelphia in the dictionary and rewrote that part of the query to use the Adelphia lat,lon.  It limited the query to stores in 75km radius around that point, and then also looked for the keyword radeon in documents from that filtered set.

As the result, it found the ATI Radeon X1900 XTX 512 MB PCIE Video Card that is being sold in a store in or near Adelphia.

Want to learn more about Solr? Subscribe to our blog or follow @sematext. If you need any help with Solr / SolrCloud – don’t forget that we provide Solr Consulting, Production Support, and offer Solr Training!

Read the original blog entry...

More Stories By Sematext Blog

Sematext is a globally distributed organization that builds innovative Cloud and On Premises solutions for performance monitoring, alerting and anomaly detection (SPM), log management and analytics (Logsene), and search analytics (SSA). We also provide Search and Big Data consulting services and offer 24/7 production support for Solr and Elasticsearch.

Latest Stories
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone inn...
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He expla...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With technology insights and real-world examples, in his general session at @DevOpsSummit, at 21st Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, explored how leading organizations use data-driven DevOps to close th...
Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams. In his session at 22nd Cloud Expo | DXWorld Expo, Daniel Jones, CTO of EngineerBetter, will answer: How can we improve willpower and decrease technical debt? Is the present bias real? How can we turn it to our advantage? Can you increase a team’s effective IQ? How do DevOps & Product Teams increase empathy, and what impact does empath...
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software. They hope to capture value from emerging technologies such as IoT, SDN, and AI. Ultimately, irrespective of the vertical, it is about deriving value from independent software applications participating in an ecosystem as one comprehensive solution. In his session at @ThingsExpo, Kausik Sridhar, founder and CTO of Pulzze Systems, discussed how given the magnitude of today's application ...
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the bene...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
You know you need the cloud, but you're hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You're looking at private cloud solutions based on hyperconverged infrastructure, but you're concerned with the limits inherent in those technologies. What do you do?
Sanjeev Sharma Joins June 5-7, 2018 @DevOpsSummit at @Cloud Expo New York Faculty. Sanjeev Sharma is an internationally known DevOps and Cloud Transformation thought leader, technology executive, and author. Sanjeev's industry experience includes tenures as CTO, Technical Sales leader, and Cloud Architect leader. As an IBM Distinguished Engineer, Sanjeev is recognized at the highest levels of IBM's core of technical leaders.
Recently, WebRTC has a lot of eyes from market. The use cases of WebRTC are expanding - video chat, online education, online health care etc. Not only for human-to-human communication, but also IoT use cases such as machine to human use cases can be seen recently. One of the typical use-case is remote camera monitoring. With WebRTC, people can have interoperability and flexibility for deploying monitoring service. However, the benefit of WebRTC for IoT is not only its convenience and interopera...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
The 22nd International Cloud Expo | 1st DXWorld Expo has announced that its Call for Papers is open. Cloud Expo | DXWorld Expo, to be held June 5-7, 2018, at the Javits Center in New York, NY, brings together Cloud Computing, Digital Transformation, Big Data, Internet of Things, DevOps, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...