Doing Full-Text Searches in .NET with Sphinx - Part 3: Searching

by Dennis 9. July 2013 16:38

Now that we’ve successfully imported our data into Sphinx in the previous part of this series, we can finally start to implement searching! As before, the current state of the sample is available for download. Before we start though, I want to make an addendum to the last post.

Word Forms

 

In the last post I forgot to setup a file containing word forms. With the help of word forms, you can remap one or more words to another. This feature can be used to normalize different word forms to one normal form or to map words to commonly used abbreviations. If you index text such as that from Stack Overflow, it practically screams for a list of word forms that maps words to their abbreviations, because there are many that are commonly used in the programming world, e.g. VS for Visual Studio, NH for NHibernate, HG for Mercurial and so on. I’ve therefore added a small word forms file to the sample that contains a few of these abbreviations.

Searching

 

To create and execute our search queries we'll use SphinxConnector.NET's fluent query API which is similar to LINQ. It translates the queries you create in code to SphinxQL and maps the query results back to an object of your choice.

Query Preprocessing and Match Clause Creation

Upon receiving a search string from the client, you usually don’t want to send it to Sphinx as is. The very least you should do, is to escape certain characters in the search string that would otherwise be interpreted as operators or modifiers of Sphinx’ extended query syntax (instead of escaping them, you might also choose to completely remove them from incoming search strings). To escape the incoming query string, we’ll use SphinxConnector.NET’s SphinxHelper.EscapeString() method.

What your match clause will look like depends on how you want Sphinx to determine what results are the most relevant to your users. As the definition of what a good result is, can vary greatly between use cases, there is no one size fits all solution and we’re only going to do some basic stuff here.  But this should be enough to get you started. To see a complete list of what Sphinx query syntax offers, please refer to the relevant section in the Sphinx manual.

The first thing we do after escaping, is to split our incoming string into an array of strings by using ‘space’ as the separator. If we only have one search term, we’ll just send it to Sphinx as is. If on the other hand, there is more than one search term, we tell Sphinx to:

  1. match the terms as a phrase (by wrapping them in ") OR
  2. match the terms as is (with an implicit AND, so documents have to contain all terms to match)  OR
  3. match a least one term wrapping the query in “ again and specifying a quorum of 1

This way, documents that contain the given keywords as a phrase or at least all keywords somewhere in the text, will be ranked higher than documents that only contain one of the keywords.

private static string BuildMatchClause(string escapedQuery)
{
    var terms = escapedQuery.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);

    string matchClause = String.Empty;

    if (terms.Length > 1)
    {
        matchClause = escapedQuery.ToPhrase() + " | "; 
} matchClause += "(" + escapedQuery + ")";

if (terms.Length > 1) { matchClause += " | " + escapedQuery.ToPhrase(1);
} return matchClause; }

ToPhrase is an extension method for String, that I've defined to make the code more readable. It also takes the quorum count as an optional parameter.

Now that we’ve created our match clause we’re ready to send our query to Sphinx. The complete query looks like this:

Lazy<QueryMetadata> searchMetaData;
var searchResults = FulltextSession.Query<Post>().
                                    Match(matchClause).
                                    Where(x => x.PostType == PostType.Question).
                                    Select(x => new SearchResult {
                                        Id = x.Id,
                                        Title = x.Title,
                                        CreationDate = x.CreationDate,
                                        Tags = x.Tags,
                                        Snippet = x.Body.GetSnippet(matchClause),
                                        Weight=x.CreationDate >= DateTime.Today.AddDays(-90) ?
				x.Weight * 1.1 : x.Weight //Boost weight of recent posts }).
                                    Page(page).
                                    Options(options => options.Ranker(SphinxRankMode.SPH04)).
                                    ToFutureList(out searchMetaData);

Let's walk through it: we first create a query by invoking the Query method of the IFulltextSession interface that SphinxConnector.NET provides and specify our Post document model (see part 2 for details on this) as the document type parameter. We then pass the previously created match clause to the Match method. Next, we add a filter on PostType and tell Sphinx to only search through documents that are questions. We now use the Select method to project the results into a class named SearchResult. This is a view model class that we use to provide our search result view with the necessary data to display the results. Note how we’re boosting the weight of results that aren’t older than 90 days (though these aren’t necessarily more relevant than older posts, but I wanted to demonstrate how this can be done). As not all search results are displayed on a single page, I’ve defined an extension method named Page(), as a wrapper around Limit() to improve readability. Last, we set the ranker to SPH04, which uses phrase proximity and BM25 to rank the results, but also boosts the rank of a result if it contains a search term at the very beginning or end of a text field.

Finally, we invoke ToFutureList with an out parameter to retrieve the query meta data which, among other data, contains the time Sphinx needed to execute the search. We will display this on our search result page, to brag about our fast full-text search Winking smile.

In case you’re wondering what the call to ToFutureList means (compared to calling ToList), here’s an explanation: when we invoke this method, SphinxConnector.NET will not immediately execute the query, but rather wait until the results are actually accessed. This enables SphinxConnector.NET to send more than one query to Sphinx, thus saving some network overhead. In the next paragraph we’ll be creating two queries for facets that can be send along with our first query.

Facets

Now, that we’ve created our full-text query, we can move on to creating facets that are displayed along with the search result. We will be showing two facets of our query: the number of questions asked per month and the number of questions with an accepted answer .

These are probably not the most interesting facets one could display, but they are good enough to demonstrate the creation of facets with Sphinx and SphinxConnector.NET:

var resultsPerMonth = FulltextSession.Query<Post>().
                                      Match(matchClause).
                                      Where(x => x.PostType == PostType.Question).
                                      Select(x => new {
                                          Count = Projection.Count(),
                                          x.CreationDate.Month
                                      }).
                                      GroupBy(x => x.Month).
                                      ToFutureList();

var acceptedAnswers = FulltextSession.Query<Post>().
                                      Match(matchClause).
                                      Where(x => x.PostType == PostType.Question).
                                      Select(x => new {
                                          Count = Projection.Count(),
                                          HasAcceptedAnswer = x.AcceptedAnswerId > 0
                                      }).
                                      GroupBy(x => x.HasAcceptedAnswer).
                                      ToFutureList();

Conclusion

 

In this post we looked at how to create a basic Sphinx full-text query with SphinxConnector.NET. We did some preprocessing on the search term(s) we received from the client and created a match clause for our query. We then created queries for two facets, namely the number of questions with an accepted answer and the number of questions per month for our query. Finally, we also made use of SphinxConnector.NET’s future query capabilities to execute the queries as efficiently as possible. The complete working example is available for download here. In the next post, we'll be looking at how to implement autocomplete and correction suggestions.

Tags:

Tutorial