Welcome!

Blog Feed Post

Solr Query Segmenter: How to Provide Better Search Experience

One way to create a better search experience is to understand the user intent.  One of the phases in that process is query understanding, and one simple step in that direction is query segmentation. In this post, we’ll cover what query segmentation is and when it is useful. We will also introduce to you Solr Query Segmenter, a open-sourced Solr component that we developed to make search experience better.

What is query segmentation and when is it useful?

Query segmentation refers to processing of the query in order to break it down into meaningful segments.  Such segments may be single tokens, or sequences of multiple tokens.  Once such meaningful segments are discovered one can use them to enhance the search experience in various ways.  One use of query segmentation is to rewrite the query in order to make it more precise.  Below are two such examples everyone will be able to relate to.

Think about searching for people at LinkedIn.  Sometimes you search for a specific person using their first and last name.  If that name is fairly unique, it’s easy to locate that person (e.g. at the moment, there is just one Otis Gospodnetic on the planet, so it’s easy to find his LinkedIn profile).  However, when the name is not enough people use additional criteria to make their query more precise.  For example, there are over 35,000 people named Satya in LinkedIn, but if you search for Satya Microsoft there is only one match.  While I do not know how exactly LinkedIn handles queries like “Satya Microsoft”, they could be using query segmentation for it.  By using query segmentation they could determine that Satya is the first name (or at least a part of a personal name) and Microsoft is the name of an organization.  Using this knowledge they could rewrite the query into the equivalent of firstName:Satya AND organization:Microsoft, which would be more precise than a generic version of this query such as keywords:(Satya, Microsoft).

Another query segmentation use case can be found in retail, where people may search for things like “red dress” or “toaster stainless steel Braun”.  Using query segmentation one could rewrite the query with the understanding that red is a color, dress and toaster are the actual items, stainless steel is material, and Braun is the name of the company.  A query rewritten using this knowledge can yield more precise results, thus helping people find what you are looking for waster instead of wading through hundreds of items that are really just lose query matches.

Query segmentation can also be used to extract locations or points of interest information from query and turn them into geospatial queries, as you can see in the Solr Query Segmenter README.

Setup Solr Example

We’ll assume you already have Solr running.  In the example below we’ll use Solr 6.0.1, but other versions should work, as long as there is a version of Solr Query Segmenter that is based on it (if you don’t find one, send us a PR!)  Solr ships with several examples.  We’ll use the “techproducts” example to show how Solr Query Segmenter works and what is in it for you.

Let’s first run the techproducts example:

bin/solr start -e techproducts

Just to make sure all is working, you should be able to visit http://localhost:8983/solr/#/techproducts and see the Solr web admin interface:

If all is OK, we can stop Solr for now:

bin/solr stop -e techproducts

Setup Query Segmenter

Query Segmenter setup has two parts:

  1. Download / install of the required Jar files
  2. Configuration

Setup library

Download Query Segmenter jars from central maven repo.  

mkdir example/techproducts/solr/techproducts/lib
cd example/techproducts/solr/techproducts/lib
wget https://oss.sonatype.org/content/repositories/releases/com/sematext/querysegmenter/st-QuerySegmenter-core/1.3.6.3.0/st-QuerySegmenter-core-1.3.6.3.0.jar
wget https://oss.sonatype.org/content/repositories/releases/com/sematext/querysegmenter/st-QuerySegmenter-solr/1.3.6.3.0/st-QuerySegmenter-solr-1.3.6.3.0.jar

Configuration

The Query Segmenter Solr library includes Solr components that use QueryComponent – the Solr SearchComponent that handles queries. The library currently contains 2 components – QuerySegmenterQParser and CentroidComponent. Let’s have a look at each of them.

Dictionary-based Segmentation

Query segmentation is based on matching of dictionary elements against queries.  Dictionary elements are specified in dictionary files. Dictionary files are plain text files that contain segments to look for when parsing and segmenting a query.  A few dictionary files used for unit tests can be found under https://github.com/sematext/query-segmenter/tree/master/core/src/test/resources.

Dictionary Structure

There are 3 types of dictionaries:

Segment Dictionary

This is used with QuerySegmenterQParser and QuerySegmenterComponent and is nothing more than a text file with a set of keywords, one keyword per line. For example:

electronics
currency
memory
wireless mouse
Centroid Dictionary

This is used with QuerySegmenterQParser & CentroidComponent.  It contains a set of points, one point per line. Points have the format of name|lat|lon.  For example:

Aaronsburg|40.9068|-77.4081

Area Dictionary

This is another type of location dictionary for QuerySegmenterQParser & CentroidComponent.  Instead of having a point per line it contains an area per line, specified using the name|maxlat|maxlon|minlat|minlon format.  For example:

Northeast,61.235009,-149.703891,61.195252,-149.778423

If there is a segment in the user query that matches an element of the dictionary (built from the dictionary file), the query is rewritten using either the field specified in the segmenter configuration or the location (only when area segment dictionary is used, shown later in this article). For example, for the query “pizza brooklyn”, if “new  york” is an area in the dictionary, the query may be rewritten to “pizza neighborhood:brooklyn”, or perhaps “pizza location:[minlat,minlon TO maxlat, maxlon]”. The field to use and whether we should use the label or the location is configurable.

Segment Dictionary Centroid Dictionary Area Dictionary
QuerySegmenterQParser x x x
QuerySegmenterComponent x
CentroidComponent x x

QuerySegmenterQParser

This QParser is used to parse the query, extract segments from the query, and then rewrite it before letting Solr execute it.

Configuration

Configure QuerySegmenterQParser in the solrconfig.xml (example/techproducts/solr/techproducts/conf/) file:

<queryParser name="seg"
   class="com.sematext.querysegmenter.solr.QuerySegmenterQParserPlugin">
   <lst name="segments">
     <lst name="cats">
       <str name="field">cat</str>
       <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str>
       <str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/categories.txt</str>
       <bool name="useLatLon">false</bool>
     </lst>
   </lst>
 </queryParser>

Create dictionary file:

mkdir example/techproducts/solr/techproducts/conf/segmenter
cat <<EOF > example/techproducts/solr/techproducts/conf/segmenter/categories.txt
electronics
currency
memory
currency
software
camera
copier
music
printer
scanner
EOF

Usage

Let’s start Solr again after adding the above segmenter component configuration.

bin/solr start -e techproducts

To use the QParser directly, use LocalParams syntax:

http://localhost:8983/solr/techproducts/select/?q={!seg}electronics%20device

Note that the “seg” part in {!seg} local parameter matches the “seg” name of the config section above.

In the above example, the Query Segmenter will first spot the “electronics” segment because that was one of the dictionary elements we provided.  Thus, it will rewrite the query to cat:”electronics”. Why does it use the “cat” field? Because that is the field we specified in config earlier. Once the query is rewritten like this it is handled by the eDismax parser which then uses just the remaining “device” part with fields defined in its qf. The cat:”electronics” portion of the query would not be used with qf because of the field-specific prefix.

Using our “techproducts” Solr example such a segmented query returns 12 docs, all of which are in the “electronics” category — this is the key here!

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">2</int>
        <lst name="params">
            <str name="q">{!seg}electronics device</str>
            <str name="fl">cat</str>
        </lst>
    </lst>
    <result name="response" numFound="12" start="0">
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>connector</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>connector</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>memory</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>memory</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>memory</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>music</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>graphics card</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>graphics card</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>multifunction printer</str>
                <str>printer</str>
                <str>scanner</str>
                <str>copier</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>camera</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>hard drive</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>hard drive</str>
            </arr>
        </doc>
    </result>



</response>

Now compare that to the results of a query without segmenter component:

http://localhost:8983/solr/techproducts/select/?q=electronics%20device
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">5</int>
  <lst name="params">
    <str name="q">electronics device</str>
    <str name="fl">cat</str>
  </lst>
<result name="response" numFound="14" start="0">

Note something different?  We got 14 hits, not just 12.  Let’s see those 2 extra hits:

<doc>
 <arr name="cat">
  <str>electronics and computer1</str>
 </arr>
</doc>
<doc>
 <arr name="cat">
  <str>electronics</str>
  <str>memory</str>
 </arr>
</doc>
<doc>
 <arr name="cat">
  <str>electronics and stuff2</str>
 </arr>
</doc>

You see the problem?  Our “electronics” query picked up matches in other categories.  Sometimes that is what you want, but sometimes you really don’t want that, and the Solr Query Segmenter helps you avoid that and return more precise results.

QuerySegmenterComponent

A component that works like the QParser described above, but implemented as a Solr SearchComponent instead of a QParser. Using QuerySegmenterComponent lets us configure each individual Request Handler to include or not include query segmentation.  One could also configure multiple QuerySegmentedComponents, perhaps with different dictionaries and/or different fields.
Using this component also means you don’t need to add prefix {!seg} for every user query, such as q={!seg}electronics%20device
Note that you should put this component before the standard query component (or simply define it to be the first component), because it needs to rewrite the query before the query is made against Solr.

Configuration

<searchComponent name="segmenter"
  class="com.sematext.querysegmenter.solr.QuerySegmenterComponent">   
  <lst name="segments">
    <lst name="cats">
      <str name="field">cat</str>
      <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str>
      <str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/categories.txt</str>
      <bool name="useLatLon">false</bool>
    </lst>
  </lst>
</searchComponent>

<requestHandler name="/qs" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">edismax</str>
    <str name="qf">
        name^1.2 id^10.0 features^1.0 manu^1.1 cat^1.4
    </str>
  </lst>
  <arr name="first-components">
    <str>segmenter</str>
  </arr>
</requestHandler>

Usage

http://localhost:8983/solr/techproducts/qs?q=electronics%20device

CentroidComponent

This SearchComponent is used to rewrite queries by segmenting them, looking for segments that match a centroid in the provided area dictionary, and then centering queries using that centroid. It must be used within a RequestHandler that uses a location filter (bbox or geofilt). If a match is found, the user location (the required pt request param) is changed to the center location of the centroid. The effect is that instead of using the user location for the location filter, it will use the centroid location. If multiple centroid segments are returned from the user query, the closest centroid to the original user location is used.

For example, if a user searches for “pizza Aaronsburg”, the segment “Aaronsburg” might be returned as a centroid with location 40.9068, -77.4081. This location would then be used instead of the original user’s location (think a person sitting in front of a computer in Cleveland, Ohio and looking where to eat pizza in Aaronsburg, Ohio). This would filter results and return only matches in some radius around the centroid location.  This radius is specified in the configuration, as shown below.

Configuration

We’ll define the SearchComponent in solrconfig.xml:

<searchComponent name="centroidcomp"
   class="com.sematext.querysegmenter.solr.CentroidComponent">
  <str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/centroid.csv</str>
  <str name="separator">|</str>
</searchComponent>

Note how we’ve specified a dictionary file with centroid information and that it’s in the csv format, which was described earlier.  You can see an example centroid.csv in https://github.com/sematext/query-segmenter/tree/master/core/src/test/resources.

Next, we need to add this component to a request handler:

<requestHandler name="/centroid" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="sfield">store</str>
    <str name="fq">{!geofilt}</str>
    <str name="q.alt">*:*</str>
    <str name="d">75</str> <!-- radius from location, in kilometers by default -->
    </lst>
    <arr name="first-components">
      <str>centroidcomp</str>
    </arr>
</requestHandler>

The “sfield” needs to specify a location field.  In this example that field is “store”.  The “d” setting specifies the radius from the location, in kilometers.  Any point outside that radius will be filtered out.

Usage

We can use it with the /centroid request handler defined above. Let’s search for adelphia radeon:
http://localhost:8983/solr/techproducts/centroid?q=adelphia%20radeon
Searching for adelphia radeon will return the following:

<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
   <str name="q">adelphia radeon</str>
 </lst>
</lst>
<result name="response" numFound="1" start="0">
 <doc>
  <str name="id">100-435805</str>
  <str name="name">ATI Radeon X1900 XTX 512 MB PCIE Video Card</str>
  <str name="manu">ATI Technologies</str>
  <str name="manu_id_s">ati</str>
  <arr name="cat">
   <str>electronics</str>
   <str>graphics card</str>
  </arr>
  <arr name="features">
   <str>ATI RADEON X1900 GPU/VPU clocked at 650MHz</str>
   <str>512MB GDDR3 SDRAM clocked at 1.55GHz</str>
   <str>PCI Express x16</str>
   <str>dual DVI, HDTV, svideo, composite out</str>
   <str>OpenGL 2.0, DirectX 9.0</str>
  </arr>
  <float name="weight">48.0</float>
  <float name="price">649.99</float>
  <str name="price_c">649.99,USD</str>
   <int name="popularity">7</int>
   <bool name="inStock">false</bool>
   <date name="manufacturedate_dt">2006-02-13T00:00:00Z</date>
   <str name="store">40.7143,-74.006</str>
   <long name="_version_">1538980276785381376</long>
  </doc>
 </result>
</response>

 

What happened here?  One of the centroid dictionary entries is this:

Adelphia|40.2295|-74.2954

Thus, the Solr Query Segmenter matched adelphia in the dictionary and rewrote that part of the query to use the Adelphia lat,lon.  It limited the query to stores in 75km radius around that point, and then also looked for the keyword radeon in documents from that filtered set.

As the result, it found the ATI Radeon X1900 XTX 512 MB PCIE Video Card that is being sold in a store in or near Adelphia.

Want to learn more about Solr? Subscribe to our blog or follow @sematext. If you need any help with Solr / SolrCloud – don’t forget that we provide Solr Consulting, Production Support, and offer Solr Training!

Read the original blog entry...

More Stories By Sematext Blog

Sematext is a globally distributed organization that builds innovative Cloud and On Premises solutions for performance monitoring, alerting and anomaly detection (SPM), log management and analytics (Logsene), and search analytics (SSA). We also provide Search and Big Data consulting services and offer 24/7 production support for Solr and Elasticsearch.

Latest Stories
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex software systems for startups and enterprises. Since 2009 it has grown from a small group of passionate engineers and business...
SYS-CON Events announced today that GrapeUp, the leading provider of rapid product development at the speed of business, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market acr...
SYS-CON Events announced today that Ayehu will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara California. Ayehu provides IT Process Automation & Orchestration solutions for IT and Security professionals to identify and resolve critical incidents and enable rapid containment, eradication, and recovery from cyber security breaches. Ayehu provides customers greater control over IT infras...
What's the role of an IT self-service portal when you get to continuous delivery and Infrastructure as Code? This general session showed how to create the continuous delivery culture and eight accelerators for leading the change. Don Demcsak is a DevOps and Cloud Native Modernization Principal for Dell EMC based out of New Jersey. He is a former, long time, Microsoft Most Valuable Professional, specializing in building and architecting Application Delivery Pipelines for hybrid legacy, and cloud ...
Automation is enabling enterprises to design, deploy, and manage more complex, hybrid cloud environments. Yet the people who manage these environments must be trained in and understanding these environments better than ever before. A new era of analytics and cognitive computing is adding intelligence, but also more complexity, to these cloud environments. How smart is your cloud? How smart should it be? In this power panel at 20th Cloud Expo, moderated by Conference Chair Roger Strukhoff, pane...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
Join us at Cloud Expo June 6-8 to find out how to securely connect your cloud app to any cloud or on-premises data source – without complex firewall changes. More users are demanding access to on-premises data from their cloud applications. It’s no longer a “nice-to-have” but an important differentiator that drives competitive advantages. It’s the new “must have” in the hybrid era. Users want capabilities that give them a unified view of the data to get closer to customers and grow business. The...
DevOps at Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to w...
The current age of digital transformation means that IT organizations must adapt their toolset to cover all digital experiences, beyond just the end users’. Today’s businesses can no longer focus solely on the digital interactions they manage with employees or customers; they must now contend with non-traditional factors. Whether it's the power of brand to make or break a company, the need to monitor across all locations 24/7, or the ability to proactively resolve issues, companies must adapt to...
In this presentation, Striim CTO and founder Steve Wilkes will discuss practical strategies for counteracting fraud and cyberattacks by leveraging real-time streaming analytics. In his session at @ThingsExpo, Steve Wilkes, Founder and Chief Technology Officer at Striim, will provide a detailed look into leveraging streaming data management to correlate events in real time, and identify potential breaches across IoT and non-IoT systems throughout the enterprise. Strategies for processing massive ...
SYS-CON Events announced today that Cloud Academy named "Bronze Sponsor" of 21st International Cloud Expo which will take place October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara, CA. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud com...
Cloud Expo, Inc. has announced today that Andi Mann and Aruna Ravichandran have been named Co-Chairs of @DevOpsSummit at Cloud Expo Silicon Valley which will take place Oct. 31-Nov. 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. "DevOps is at the intersection of technology and business-optimizing tools, organizations and processes to bring measurable improvements in productivity and profitability," said Aruna Ravichandran, vice president, DevOps product and solutions marketing...
In his session at Cloud Expo, Alan Winters, an entertainment executive/TV producer turned serial entrepreneur, presented a success story of an entrepreneur who has both suffered through and benefited from offshore development across multiple businesses: The smart choice, or how to select the right offshore development partner Warning signs, or how to minimize chances of making the wrong choice Collaboration, or how to establish the most effective work processes Budget control, or how to ma...