Welcome!

Blog Feed Post

Solr Query Segmenter: How to Provide Better Search Experience

One way to create a better search experience is to understand the user intent.  One of the phases in that process is query understanding, and one simple step in that direction is query segmentation. In this post, we’ll cover what query segmentation is and when it is useful. We will also introduce to you Solr Query Segmenter, a open-sourced Solr component that we developed to make search experience better.

What is query segmentation and when is it useful?

Query segmentation refers to processing of the query in order to break it down into meaningful segments.  Such segments may be single tokens, or sequences of multiple tokens.  Once such meaningful segments are discovered one can use them to enhance the search experience in various ways.  One use of query segmentation is to rewrite the query in order to make it more precise.  Below are two such examples everyone will be able to relate to.

Think about searching for people at LinkedIn.  Sometimes you search for a specific person using their first and last name.  If that name is fairly unique, it’s easy to locate that person (e.g. at the moment, there is just one Otis Gospodnetic on the planet, so it’s easy to find his LinkedIn profile).  However, when the name is not enough people use additional criteria to make their query more precise.  For example, there are over 35,000 people named Satya in LinkedIn, but if you search for Satya Microsoft there is only one match.  While I do not know how exactly LinkedIn handles queries like “Satya Microsoft”, they could be using query segmentation for it.  By using query segmentation they could determine that Satya is the first name (or at least a part of a personal name) and Microsoft is the name of an organization.  Using this knowledge they could rewrite the query into the equivalent of firstName:Satya AND organization:Microsoft, which would be more precise than a generic version of this query such as keywords:(Satya, Microsoft).

Another query segmentation use case can be found in retail, where people may search for things like “red dress” or “toaster stainless steel Braun”.  Using query segmentation one could rewrite the query with the understanding that red is a color, dress and toaster are the actual items, stainless steel is material, and Braun is the name of the company.  A query rewritten using this knowledge can yield more precise results, thus helping people find what you are looking for waster instead of wading through hundreds of items that are really just lose query matches.

Query segmentation can also be used to extract locations or points of interest information from query and turn them into geospatial queries, as you can see in the Solr Query Segmenter README.

Setup Solr Example

We’ll assume you already have Solr running.  In the example below we’ll use Solr 6.0.1, but other versions should work, as long as there is a version of Solr Query Segmenter that is based on it (if you don’t find one, send us a PR!)  Solr ships with several examples.  We’ll use the “techproducts” example to show how Solr Query Segmenter works and what is in it for you.

Let’s first run the techproducts example:

bin/solr start -e techproducts

Just to make sure all is working, you should be able to visit http://localhost:8983/solr/#/techproducts and see the Solr web admin interface:

If all is OK, we can stop Solr for now:

bin/solr stop -e techproducts

Setup Query Segmenter

Query Segmenter setup has two parts:

  1. Download / install of the required Jar files
  2. Configuration

Setup library

Download Query Segmenter jars from central maven repo.  

mkdir example/techproducts/solr/techproducts/lib
cd example/techproducts/solr/techproducts/lib
wget https://oss.sonatype.org/content/repositories/releases/com/sematext/querysegmenter/st-QuerySegmenter-core/1.3.6.3.0/st-QuerySegmenter-core-1.3.6.3.0.jar
wget https://oss.sonatype.org/content/repositories/releases/com/sematext/querysegmenter/st-QuerySegmenter-solr/1.3.6.3.0/st-QuerySegmenter-solr-1.3.6.3.0.jar

Configuration

The Query Segmenter Solr library includes Solr components that use QueryComponent – the Solr SearchComponent that handles queries. The library currently contains 2 components – QuerySegmenterQParser and CentroidComponent. Let’s have a look at each of them.

Dictionary-based Segmentation

Query segmentation is based on matching of dictionary elements against queries.  Dictionary elements are specified in dictionary files. Dictionary files are plain text files that contain segments to look for when parsing and segmenting a query.  A few dictionary files used for unit tests can be found under https://github.com/sematext/query-segmenter/tree/master/core/src/test/resources.

Dictionary Structure

There are 3 types of dictionaries:

Segment Dictionary

This is used with QuerySegmenterQParser and QuerySegmenterComponent and is nothing more than a text file with a set of keywords, one keyword per line. For example:

electronics
currency
memory
wireless mouse
Centroid Dictionary

This is used with QuerySegmenterQParser & CentroidComponent.  It contains a set of points, one point per line. Points have the format of name|lat|lon.  For example:

Aaronsburg|40.9068|-77.4081

Area Dictionary

This is another type of location dictionary for QuerySegmenterQParser & CentroidComponent.  Instead of having a point per line it contains an area per line, specified using the name|maxlat|maxlon|minlat|minlon format.  For example:

Northeast,61.235009,-149.703891,61.195252,-149.778423

If there is a segment in the user query that matches an element of the dictionary (built from the dictionary file), the query is rewritten using either the field specified in the segmenter configuration or the location (only when area segment dictionary is used, shown later in this article). For example, for the query “pizza brooklyn”, if “new  york” is an area in the dictionary, the query may be rewritten to “pizza neighborhood:brooklyn”, or perhaps “pizza location:[minlat,minlon TO maxlat, maxlon]”. The field to use and whether we should use the label or the location is configurable.

Segment Dictionary Centroid Dictionary Area Dictionary
QuerySegmenterQParser x x x
QuerySegmenterComponent x
CentroidComponent x x

QuerySegmenterQParser

This QParser is used to parse the query, extract segments from the query, and then rewrite it before letting Solr execute it.

Configuration

Configure QuerySegmenterQParser in the solrconfig.xml (example/techproducts/solr/techproducts/conf/) file:

<queryParser name="seg"
   class="com.sematext.querysegmenter.solr.QuerySegmenterQParserPlugin">
   <lst name="segments">
     <lst name="cats">
       <str name="field">cat</str>
       <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str>
       <str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/categories.txt</str>
       <bool name="useLatLon">false</bool>
     </lst>
   </lst>
 </queryParser>

Create dictionary file:

mkdir example/techproducts/solr/techproducts/conf/segmenter
cat <<EOF > example/techproducts/solr/techproducts/conf/segmenter/categories.txt
electronics
currency
memory
currency
software
camera
copier
music
printer
scanner
EOF

Usage

Let’s start Solr again after adding the above segmenter component configuration.

bin/solr start -e techproducts

To use the QParser directly, use LocalParams syntax:

http://localhost:8983/solr/techproducts/select/?q={!seg}electronics%20device

Note that the “seg” part in {!seg} local parameter matches the “seg” name of the config section above.

In the above example, the Query Segmenter will first spot the “electronics” segment because that was one of the dictionary elements we provided.  Thus, it will rewrite the query to cat:”electronics”. Why does it use the “cat” field? Because that is the field we specified in config earlier. Once the query is rewritten like this it is handled by the eDismax parser which then uses just the remaining “device” part with fields defined in its qf. The cat:”electronics” portion of the query would not be used with qf because of the field-specific prefix.

Using our “techproducts” Solr example such a segmented query returns 12 docs, all of which are in the “electronics” category — this is the key here!

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">2</int>
        <lst name="params">
            <str name="q">{!seg}electronics device</str>
            <str name="fl">cat</str>
        </lst>
    </lst>
    <result name="response" numFound="12" start="0">
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>connector</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>connector</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>memory</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>memory</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>memory</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>music</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>graphics card</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>graphics card</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>multifunction printer</str>
                <str>printer</str>
                <str>scanner</str>
                <str>copier</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>camera</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>hard drive</str>
            </arr>
        </doc>
        <doc>
            <arr name="cat">
                <str>electronics</str>
                <str>hard drive</str>
            </arr>
        </doc>
    </result>



</response>

Now compare that to the results of a query without segmenter component:

http://localhost:8983/solr/techproducts/select/?q=electronics%20device
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">5</int>
  <lst name="params">
    <str name="q">electronics device</str>
    <str name="fl">cat</str>
  </lst>
<result name="response" numFound="14" start="0">

Note something different?  We got 14 hits, not just 12.  Let’s see those 2 extra hits:

<doc>
 <arr name="cat">
  <str>electronics and computer1</str>
 </arr>
</doc>
<doc>
 <arr name="cat">
  <str>electronics</str>
  <str>memory</str>
 </arr>
</doc>
<doc>
 <arr name="cat">
  <str>electronics and stuff2</str>
 </arr>
</doc>

You see the problem?  Our “electronics” query picked up matches in other categories.  Sometimes that is what you want, but sometimes you really don’t want that, and the Solr Query Segmenter helps you avoid that and return more precise results.

QuerySegmenterComponent

A component that works like the QParser described above, but implemented as a Solr SearchComponent instead of a QParser. Using QuerySegmenterComponent lets us configure each individual Request Handler to include or not include query segmentation.  One could also configure multiple QuerySegmentedComponents, perhaps with different dictionaries and/or different fields.
Using this component also means you don’t need to add prefix {!seg} for every user query, such as q={!seg}electronics%20device
Note that you should put this component before the standard query component (or simply define it to be the first component), because it needs to rewrite the query before the query is made against Solr.

Configuration

<searchComponent name="segmenter"
  class="com.sematext.querysegmenter.solr.QuerySegmenterComponent">   
  <lst name="segments">
    <lst name="cats">
      <str name="field">cat</str>
      <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str>
      <str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/categories.txt</str>
      <bool name="useLatLon">false</bool>
    </lst>
  </lst>
</searchComponent>

<requestHandler name="/qs" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">edismax</str>
    <str name="qf">
        name^1.2 id^10.0 features^1.0 manu^1.1 cat^1.4
    </str>
  </lst>
  <arr name="first-components">
    <str>segmenter</str>
  </arr>
</requestHandler>

Usage

http://localhost:8983/solr/techproducts/qs?q=electronics%20device

CentroidComponent

This SearchComponent is used to rewrite queries by segmenting them, looking for segments that match a centroid in the provided area dictionary, and then centering queries using that centroid. It must be used within a RequestHandler that uses a location filter (bbox or geofilt). If a match is found, the user location (the required pt request param) is changed to the center location of the centroid. The effect is that instead of using the user location for the location filter, it will use the centroid location. If multiple centroid segments are returned from the user query, the closest centroid to the original user location is used.

For example, if a user searches for “pizza Aaronsburg”, the segment “Aaronsburg” might be returned as a centroid with location 40.9068, -77.4081. This location would then be used instead of the original user’s location (think a person sitting in front of a computer in Cleveland, Ohio and looking where to eat pizza in Aaronsburg, Ohio). This would filter results and return only matches in some radius around the centroid location.  This radius is specified in the configuration, as shown below.

Configuration

We’ll define the SearchComponent in solrconfig.xml:

<searchComponent name="centroidcomp"
   class="com.sematext.querysegmenter.solr.CentroidComponent">
  <str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/centroid.csv</str>
  <str name="separator">|</str>
</searchComponent>

Note how we’ve specified a dictionary file with centroid information and that it’s in the csv format, which was described earlier.  You can see an example centroid.csv in https://github.com/sematext/query-segmenter/tree/master/core/src/test/resources.

Next, we need to add this component to a request handler:

<requestHandler name="/centroid" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="sfield">store</str>
    <str name="fq">{!geofilt}</str>
    <str name="q.alt">*:*</str>
    <str name="d">75</str> <!-- radius from location, in kilometers by default -->
    </lst>
    <arr name="first-components">
      <str>centroidcomp</str>
    </arr>
</requestHandler>

The “sfield” needs to specify a location field.  In this example that field is “store”.  The “d” setting specifies the radius from the location, in kilometers.  Any point outside that radius will be filtered out.

Usage

We can use it with the /centroid request handler defined above. Let’s search for adelphia radeon:
http://localhost:8983/solr/techproducts/centroid?q=adelphia%20radeon
Searching for adelphia radeon will return the following:

<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
   <str name="q">adelphia radeon</str>
 </lst>
</lst>
<result name="response" numFound="1" start="0">
 <doc>
  <str name="id">100-435805</str>
  <str name="name">ATI Radeon X1900 XTX 512 MB PCIE Video Card</str>
  <str name="manu">ATI Technologies</str>
  <str name="manu_id_s">ati</str>
  <arr name="cat">
   <str>electronics</str>
   <str>graphics card</str>
  </arr>
  <arr name="features">
   <str>ATI RADEON X1900 GPU/VPU clocked at 650MHz</str>
   <str>512MB GDDR3 SDRAM clocked at 1.55GHz</str>
   <str>PCI Express x16</str>
   <str>dual DVI, HDTV, svideo, composite out</str>
   <str>OpenGL 2.0, DirectX 9.0</str>
  </arr>
  <float name="weight">48.0</float>
  <float name="price">649.99</float>
  <str name="price_c">649.99,USD</str>
   <int name="popularity">7</int>
   <bool name="inStock">false</bool>
   <date name="manufacturedate_dt">2006-02-13T00:00:00Z</date>
   <str name="store">40.7143,-74.006</str>
   <long name="_version_">1538980276785381376</long>
  </doc>
 </result>
</response>

 

What happened here?  One of the centroid dictionary entries is this:

Adelphia|40.2295|-74.2954

Thus, the Solr Query Segmenter matched adelphia in the dictionary and rewrote that part of the query to use the Adelphia lat,lon.  It limited the query to stores in 75km radius around that point, and then also looked for the keyword radeon in documents from that filtered set.

As the result, it found the ATI Radeon X1900 XTX 512 MB PCIE Video Card that is being sold in a store in or near Adelphia.

Want to learn more about Solr? Subscribe to our blog or follow @sematext. If you need any help with Solr / SolrCloud – don’t forget that we provide Solr Consulting, Production Support, and offer Solr Training!

Read the original blog entry...

More Stories By Sematext Blog

Sematext is a globally distributed organization that builds innovative Cloud and On Premises solutions for performance monitoring, alerting and anomaly detection (SPM), log management and analytics (Logsene), and search analytics (SSA). We also provide Search and Big Data consulting services and offer 24/7 production support for Solr and Elasticsearch.

Latest Stories
What if you could build a web application that could support true web-scale traffic without having to ever provision or manage a single server? Sounds magical, and it is! In his session at 20th Cloud Expo, Chris Munns, Senior Developer Advocate for Serverless Applications at Amazon Web Services, will show how to build a serverless website that scales automatically using services like AWS Lambda, Amazon API Gateway, and Amazon S3. We will review several frameworks that can help you build serverle...
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, will provide a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services ...
VeriStor Systems has announced that CRN has named VeriStor to its 2017 Managed Service Provider (MSP) 500 list in the Elite 150 category. This annual list recognizes North American solution providers with cutting-edge approaches to delivering managed services. Their offerings help companies navigate the complex and ever-changing landscape of IT, improve operational efficiencies, and maximize their return on IT investments. In today’s fast-paced business environments, MSPs play an important role...
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
SYS-CON Events announced today that Juniper Networks (NYSE: JNPR), an industry leader in automated, scalable and secure networks, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Juniper Networks challenges the status quo with products, solutions and services that transform the economics of networking. The company co-innovates with customers and partners to deliver automated, scalable and secure network...
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In his Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, will explore t...
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
Have you ever noticed how some IT people seem to lead successful, rewarding, and satisfying lives and careers, while others struggle? IT author and speaker Don Crawley uncovered the five principles that successful IT people use to build satisfying lives and careers and he shares them in this fast-paced, thought-provoking webinar. You'll learn the importance of striking a balance with technical skills and people skills, challenge your pre-existing ideas about IT customer service, and gain new in...
"When you think about the data center today, there's constant evolution, The evolution of the data center and the needs of the consumer of technology change, and they change constantly," stated Matt Kalmenson, VP of Sales, Service and Cloud Providers at Veeam Software, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...