Automatic Generation of Document Ids With the Fluent API

by Dennis 15. August 2013 09:48

Sometimes you’ll want to insert documents into a Sphinx real-time index that don’t come with a id assigned. This might be for instance because they don’t originate from a database. As a matter of fact, the search functionality for this website is implemented without first going through a database: the content is both scrapped by a crawler and read from the corresponding files and then stored directly in a Sphinx real-time index.

So, wouldn’t it be nice if SphinxConnector.NET could automatically assign an id to these documents before saving, so you don’t have to do it yourself? This is what a user thought and what we implemented for the next release (version 3.7).

Two new members have been added to IFulltextStore.Conventions:

public Func<object, bool> DocumentNeedsIdAssigned { get; set; }

public Func<object, object> DocumentIdGenerator { get; set; }

The first, DocumentNeedsIdAssigned, is responsible for telling SphinxConnector.NET whether a document that is about to be saved, needs an id assigned. It comes with a default that will return false if the document in question already has an id with a value greater 0, and true otherwise.

The second, DocumentIdGenerator, is responsible for generating the id. It receives the document that needs to have an id generated (in case the generator has to inspect it before creating an id) and should return the generated id. There is no default generator, so one needs to be provided by the developer for this functionality to work. Let’s have a look at some id generation algorithms you could use:

Id Generation Algorithms

For an algorithm to be suitable for use with Sphinx, it has to generate a unique, positive 32-bit or 64-bit integer number for each document. There are a couple of id generation algorithms that fulfill this requirement: for example a simple sequential generation algorithm could be used, or the hi/lo algorithm which some of you might know from NHibernate. Both need some kind of persistent storage that provides transactional semantics, to avoid duplicate ids being generated. If you are only performing a one time import, you could use a simple sequential id generator that increments id values in memory though.

Alternatively, you could use an id generator such as Snowflake (created by Twitter) or Flake. Snowflake has a .NET port, RustFlakes is a derivative of Flake for .NET which is also available via NuGet. Both can generate unique, positive 64-bit integers without the need of persistent storage.

Example

As stated above, this site’s search functionality is based on Sphinx with the documents being stored in a real-time index. Indexing is done by crawling the site and directly storing the documents in the index, which requires us to generate an id.

Before, we generated and assigned the id manually prior to each call to Save. Now, we assign the id generation method to IFulltextStore.Conventions.DocumentIdGenerator and SphinxConnector.NET will take care of both invoking this method and assigning the id on each call to Save:

var idGenerator = new IdGenerator();
var fulltextStore = new FulltextStore().Initialize(); fulltextStore.Conventions.DocumentIdGenerator = _ => idGenerator.NextId(); using (var session = fulltextStore.StartSession()) { var webpage = new Webpage //<-- Our document, note that we don’t assign an id { Title = "SphinxConnector.NET", Url = "http://www.sphinxconnector.net", Content = "Sphinx .NET API" }; session.Save(webpage); //<-- Id will be generated and assigned to document here
session.FlushChanges(); }

Conclusion

This new functionality allows you to delegate the task of generating an id and assigning it to a document to SphinxConnector.NET. By using existing id generators such as Snowflake or RustFlakes, you can create your setup within a couple of minutes and just a few lines of code.

Tags:

How-to

Doing Full-Text Searches in .NET with Sphinx - Part 3: Searching

by Dennis 9. July 2013 16:38

Now that we’ve successfully imported our data into Sphinx in the previous part of this series, we can finally start to implement searching! As before, the current state of the sample is available for download. Before we start though, I want to make an addendum to the last post.

Word Forms

 

In the last post I forgot to setup a file containing word forms. With the help of word forms, you can remap one or more words to another. This feature can be used to normalize different word forms to one normal form or to map words to commonly used abbreviations. If you index text such as that from Stack Overflow, it practically screams for a list of word forms that maps words to their abbreviations, because there are many that are commonly used in the programming world, e.g. VS for Visual Studio, NH for NHibernate, HG for Mercurial and so on. I’ve therefore added a small word forms file to the sample that contains a few of these abbreviations.

Searching

 

To create and execute our search queries we'll use SphinxConnector.NET's fluent query API which is similar to LINQ. It translates the queries you create in code to SphinxQL and maps the query results back to an object of your choice.

Query Preprocessing and Match Clause Creation

Upon receiving a search string from the client, you usually don’t want to send it to Sphinx as is. The very least you should do, is to escape certain characters in the search string that would otherwise be interpreted as operators or modifiers of Sphinx’ extended query syntax (instead of escaping them, you might also choose to completely remove them from incoming search strings). To escape the incoming query string, we’ll use SphinxConnector.NET’s SphinxHelper.EscapeString() method.

What your match clause will look like depends on how you want Sphinx to determine what results are the most relevant to your users. As the definition of what a good result is, can vary greatly between use cases, there is no one size fits all solution and we’re only going to do some basic stuff here.  But this should be enough to get you started. To see a complete list of what Sphinx query syntax offers, please refer to the relevant section in the Sphinx manual.

The first thing we do after escaping, is to split our incoming string into an array of strings by using ‘space’ as the separator. If we only have one search term, we’ll just send it to Sphinx as is. If on the other hand, there is more than one search term, we tell Sphinx to:

  1. match the terms as a phrase (by wrapping them in ") OR
  2. match the terms as is (with an implicit AND, so documents have to contain all terms to match)  OR
  3. match a least one term wrapping the query in “ again and specifying a quorum of 1

This way, documents that contain the given keywords as a phrase or at least all keywords somewhere in the text, will be ranked higher than documents that only contain one of the keywords.

private static string BuildMatchClause(string escapedQuery)
{
    var terms = escapedQuery.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);

    string matchClause = String.Empty;

    if (terms.Length > 1)
    {
        matchClause = escapedQuery.ToPhrase() + " | "; 
} matchClause += "(" + escapedQuery + ")";

if (terms.Length > 1) { matchClause += " | " + escapedQuery.ToPhrase(1);
} return matchClause; }

ToPhrase is an extension method for String, that I've defined to make the code more readable. It also takes the quorum count as an optional parameter.

Now that we’ve created our match clause we’re ready to send our query to Sphinx. The complete query looks like this:

Lazy<QueryMetadata> searchMetaData;
var searchResults = FulltextSession.Query<Post>().
                                    Match(matchClause).
                                    Where(x => x.PostType == PostType.Question).
                                    Select(x => new SearchResult {
                                        Id = x.Id,
                                        Title = x.Title,
                                        CreationDate = x.CreationDate,
                                        Tags = x.Tags,
                                        Snippet = x.Body.GetSnippet(matchClause),
                                        Weight=x.CreationDate >= DateTime.Today.AddDays(-90) ?
				x.Weight * 1.1 : x.Weight //Boost weight of recent posts }).
                                    Page(page).
                                    Options(options => options.Ranker(SphinxRankMode.SPH04)).
                                    ToFutureList(out searchMetaData);

Let's walk through it: we first create a query by invoking the Query method of the IFulltextSession interface that SphinxConnector.NET provides and specify our Post document model (see part 2 for details on this) as the document type parameter. We then pass the previously created match clause to the Match method. Next, we add a filter on PostType and tell Sphinx to only search through documents that are questions. We now use the Select method to project the results into a class named SearchResult. This is a view model class that we use to provide our search result view with the necessary data to display the results. Note how we’re boosting the weight of results that aren’t older than 90 days (though these aren’t necessarily more relevant than older posts, but I wanted to demonstrate how this can be done). As not all search results are displayed on a single page, I’ve defined an extension method named Page(), as a wrapper around Limit() to improve readability. Last, we set the ranker to SPH04, which uses phrase proximity and BM25 to rank the results, but also boosts the rank of a result if it contains a search term at the very beginning or end of a text field.

Finally, we invoke ToFutureList with an out parameter to retrieve the query meta data which, among other data, contains the time Sphinx needed to execute the search. We will display this on our search result page, to brag about our fast full-text search Winking smile.

In case you’re wondering what the call to ToFutureList means (compared to calling ToList), here’s an explanation: when we invoke this method, SphinxConnector.NET will not immediately execute the query, but rather wait until the results are actually accessed. This enables SphinxConnector.NET to send more than one query to Sphinx, thus saving some network overhead. In the next paragraph we’ll be creating two queries for facets that can be send along with our first query.

Facets

Now, that we’ve created our full-text query, we can move on to creating facets that are displayed along with the search result. We will be showing two facets of our query: the number of questions asked per month and the number of questions with an accepted answer .

These are probably not the most interesting facets one could display, but they are good enough to demonstrate the creation of facets with Sphinx and SphinxConnector.NET:

var resultsPerMonth = FulltextSession.Query<Post>().
                                      Match(matchClause).
                                      Where(x => x.PostType == PostType.Question).
                                      Select(x => new {
                                          Count = Projection.Count(),
                                          x.CreationDate.Month
                                      }).
                                      GroupBy(x => x.Month).
                                      ToFutureList();

var acceptedAnswers = FulltextSession.Query<Post>().
                                      Match(matchClause).
                                      Where(x => x.PostType == PostType.Question).
                                      Select(x => new {
                                          Count = Projection.Count(),
                                          HasAcceptedAnswer = x.AcceptedAnswerId > 0
                                      }).
                                      GroupBy(x => x.HasAcceptedAnswer).
                                      ToFutureList();

Conclusion

 

In this post we looked at how to create a basic Sphinx full-text query with SphinxConnector.NET. We did some preprocessing on the search term(s) we received from the client and created a match clause for our query. We then created queries for two facets, namely the number of questions with an accepted answer and the number of questions per month for our query. Finally, we also made use of SphinxConnector.NET’s future query capabilities to execute the queries as efficiently as possible. The complete working example is available for download here. In the next post, we'll be looking at how to implement autocomplete and correction suggestions.

Tags:

Tutorial

SphinxConnector.NET 3.6 has been released

by Dennis 13. June 2013 11:02

We're pleased to announce that SphinxConnector.NET 3.6 is available for download and via NuGet!

This release adds support for features that have been added to the current Sphinx development branch (Sphinx 2.2.1). This includes the ability to filter on JSON strings with the native API and support for the new Tf-Idf option, which has been added to both fluent and native API.

The fluent API now also supports the max_predicted_time option (Sphinx 2.1.1) and retrieving predicted metadata values (Sphinx 2.2.1). Additionally a regression in the FlushChanges method has been fixed, where internal exceptions weren't correctly wrapped in a FulltextException.

As always, the complete list of changes is available in the version history.

Tags:

Announcements

SphinxConnector.NET 3.5 has been released

by Dennis 14. May 2013 09:35

We're pleased to announce that SphinxConnector.NET 3.5 is available for download and via NuGet!

The new version introduces support for subqueries with the fluent query API. Additionally, supported DateTime properties can now be used directly in a group-by clause. And speaking of group-by, we've added support for selecting the n best results of a group, for those of you that are already experimenting with Sphinx 2.2.1!

We've also fixed two bugs and made some more optimizations to the SphinxQL and fluent API.

Tags:

Announcements

Doing Full-Text Searches in .NET with Sphinx - Part 2: Data Import

by Dennis 18. April 2013 11:52

This is the second post in a series that is intended as an introduction to Sphinx for .NET developers who have not yet heard of Sphinx and are looking for a powerful full-text search engine for their websites or applications. The first part of this series served as an introduction to Sphinx, in this part we’ll get our hands dirty and start working on our ASP.NET MVC based sample application.

We’ll be creating a simple website that will allow us to search through Stackoverflow data (or rather the posts made by Stackoverflow users). The Stackoverflow team kindly provides this data under a Creative Commons license in an XML format. As the dataset is pretty large, just the posts are 7 GB in size (uncompressed), we’ll only use about 10,000 posts for demonstration purposes. A file with only these posts is included in the download so you won’t have to download the whole archive.

Having an existing set of data that needs to be imported into Sphinx is probably the most common scenario when people start to work with Sphinx. We will begin by creating the document model class and the Sphinx configuration file. We will be storing the documents in a real-time index, so we also need to create a small program to read the input data from the XML file.

The current state of this sample application is available for download, it uses .NET 4.0, ASP.NET MVC3, Sphinx 2.1.1, SphinxConnector.NET and contains a Visual Studio 2010 solution. It contains the console application for importing data and the web application which will be described in detail in the next part. Please note that the layout of the website is simple and might have some display quirks as I don’t want to spend too much time on making it look pretty. 

Creating the Document Model


The first step is to determine which part of the data should be stored in the index. One approach is to store only the bare minimum that is needed for full-text searches. Often, this requires that after the search the original data is retrieved from the data source, most likely a database, to display a meaningful search result to the user. As this requires one or more additional network round trips, I personally like to store as much data in the index as is needed (and feasible) to avoid this overhead.

In this example we’ll be storing everything in the index, because it reduces the complexity of the example. For instance, we use Sphinx to display the questions on the front page, which does not involve searching at all. But, storing your complete dataset in Sphinx is usually not a good idea because it can drastically increase the resources needed by Sphinx. Also Sphinx isn’t a database and shouldn’t be treated as such, for instance it doesn’t execute certain types of queries as efficient as a database would (though there are a some tricks for that). We are doing this here for the sake of simplicity, and it might even be feasible in some cases, but should be done with care!

Let’s take a look at our input data. The Stackoverflow data archive contains a readme file with a description of the input data. Here is the relevant part for the posts.xml file which we are interested in:

**posts**.xml
       - Id
       - PostTypeId
          - 1: Question
          - 2: Answer
       - ParentID (only present if PostTypeId is 2)
       - AcceptedAnswerId (only present if PostTypeId is 1)
       - CreationDate
       - Score
       - ViewCount
       - Body
       - OwnerUserId
       - LastEditorUserId
       - LastEditorDisplayName="Jeff Atwood"
       - LastEditDate="2009-03-05T22:28:34.823"
       - LastActivityDate="2009-03-11T12:51:01.480"
       - CommunityOwnedDate="2009-03-11T12:51:01.480"
       - ClosedDate="2009-03-11T12:51:01.480"
       - Title=
       - Tags=
       - AnswerCount
       - CommentCount
       - FavoriteCount

We’ll leave out some attributes that are not needed in our example which leads to the following model class:

public class Post
{
    public int Id { get; set; }

    public PostType PostType { get; set; }
    public long ParentId { get; set; }
    public long AcceptedAnswerId { get; set; }
    public string Body { get; set; }
    public string Title { get; set; }
    public string Tags { get; set; }
    public long Score { get; set; }
    public long ViewCount { get; set; }
    public long AnswerCount { get; set; }
    public DateTime CreationDate { get; set; }
    
    public int Weight { get; set; }
}

You might have already noticed that we’ve added a property named ‘Weight’ to our document class which is not present in the data source. When Sphinx searches the index for results it assigns each match a weight depending on its relevancy to the search query. The higher the weight, the more relevant it is for the query, and because we might want to use the weight in our own relevancy calculations, we need to add a property for it to our model.

Configuring Sphinx


Next we need to create the configuration for the Sphinx server. By default Sphinx will look for a file named sphinx.conf in the directory where the executable resides so we’ll name it just that. The configuration file consists of several sections: one for the server itself, one for the indexer program which is not relevant for our sample, and the index configuration. For our documents the index configuration looks like this:

index posts
{
    type                    = rt
    path                    = posts
    
    rt_field                = body 
    rt_attr_string          = body 
    rt_field                = title     
    rt_attr_string          = title
    rt_field                = tags          
    rt_attr_string          = tags 

    rt_attr_uint            = parentid
    rt_attr_uint            = score 
    rt_attr_uint            = viewcount
    rt_attr_uint            = posttype
    rt_attr_uint            = answercount
    rt_attr_timestamp       = creationdate    
    rt_attr_uint            = acceptedanswerid
    rt_attr_uint            = votecount
    
    charset_type            = utf-8
    min_word_len            = 1    
    dict                    = keywords                    
    expand_keywords         = 1                            
    rt_mem_limit            = 1024M
    
    charset_table           = 0..9, a..z, A..Z->a..z, U+DF, \
                              U+FC->u, U+DC->u, U+FC,U+DC, \
                              U+F6->o, U+D6->o, U+F6,U+D6, \
                              U+E4->a, U+C4->a, U+C4,U+E4, \
                              U+E1->a, U+C9->a, U+E9->e, U+C9->e, \
                              U+410..U+42F->U+430..U+44F, U+430..U+44F, U+00E6, \
                              U+00C6->U+00E6, U+01E2->U+00E6, U+01E3->U+00E6, \
                              U+01FC->U+00E6, U+01FD->U+00E6, U+1D01->U+00E6, \
                              U+1D02->U+00E6, U+1D2D->U+00E6, U+1D46->U+00E6
    
    blend_chars             = U+23, U+2B, -, ., @, &

stopwords               = stopwords.txt }

Let’s look through the different configuration options: first we specify the index type and the path where the index files should be stored. Next we declare the fields and attributes; it is not necessary to explicitly declare the id attribute, Sphinx does that automatically for us.

The next thing to note is the declaration of the full-text fields which are the fields that are searched by Sphinx when it processes a full-text query. In order to be able to retrieve the contents of these fields, so that we can display them to the user, we also declare string attributes for each of them. Why is it necessary to declare both? When the keywords are inserted into the index, they are hashed. This implies that there is no way to retrieve the original data that was inserted. This is where string attributes come into play. They enable us to store and retrieve the original content, which we then use to display the search result to the user, which allows us to avoid accessing the database altogether.

Next we specify the charset to use for our documents which is utf8 in our case. For the minimum word length we use one, the keywords dictionary type is set to ‘keywords’, we also enable Sphinx’ internal keyword expansion to allow for partial word matches and set the memory limit for RT index RAM chunks to 1024MB.

The charset table is up next. It defines which characters are recognized by Sphinx and also allows to remap a character to another. It is important to understand that if a character is not in the charset table, Sphinx will treat that character as a separator during indexing. To configure the charset table we can use both the character itself and/or its Unicode number. In our charset table we first add numbers from 0 to 9, lower case chars and remap upper case chars to lower case chars. Additionally we define some more mappings e.g. German umlauts are mapped to their non-umlaut equivalent like ü –> u. When defining your own charset table you can refer to the Sphinx wiki which contains a page with charset tables.

Lastly, we define the blend chars and configure a file with stop words. Blend chars are characters that are treated by Sphinx as both regular characters and separators. Suppose that you’re indexing email addresses and configure the ‘@’ symbol as a blend char, an email address such as ‘foo@bar.com’ will be indexed as a whole and as ‘foo’ and ‘bar.com’. This enables users to find an email address even if they don’t remember the domain or search for addresses that belong to a particular domain.

The stop words file contains a list of words that should be ignored by Sphinx during indexing. While any word can be configured as a stop word, most of the time frequently occurring words such as ‘the’, ‘is’, etc. are configured as stop words because it reduces the size of the full-text index and thereby improves query times.

Importing the Documents


To import the Stackoverflow data we create a simple console application that reads the input data from the XML file and uses SphinxConnector.NET’s fluent API to save the data into the index. Here’s the ImportPosts method:

public static void ImportPosts()
{
    IFulltextStore fulltextStore = new FulltextStore().Initialize();

    using (IFulltextSession session = fulltextStore.StartSession())
    {
        session.Advanced.TruncateIndex<Post>();

        int count = 0;

        foreach (var post in GetPosts())
        {
            session.Save(post);

            if (count++ % fulltextStore.Settings.SaveBatchSize == 0)
                session.FlushChanges();
        }

	session.FlushChanges();
    }
}

The FulltextStore class is the entry point to the fluent API. It is used to configure the connection string, mapping conventions and other settings. It also serves as a factory for creating instances of the IFulltextSession interface which provides the methods we need to execute full-text queries and to save and delete documents from RT-indexes. In the above code we’re creating an instance of the IFulltextSession interface by calling StartSession and then proceed to insert the documents. The GetPosts method is responsible for reading the documents from the XML file.

Conclusion


In the second installment of this series we created our document model and a Sphinx real-time index setup for these documents. We then created a small console application that performs the import of the documents from the source XML file. In the next part, we’ll create a website that’ll allow us to search the index we just created.

Tags: , ,

Tutorial