Performance Improvements in SphinxConnector.NET 4.0

by Dennis 23. March 2018 14:02

As we're preparing the release of SphinxConnector.NET 4.0 I'd like to share some improvements we have made since V3. Lets start with performance: as many of you might be aware, there currently is a strong focus in the .NET ecosystem on improving performance, often done by reducing memory consumption, especially in low-level code. Case in point: .NET Core. The development of .NET Core is driven by a strong focus on providing great performance. This is accomplished, among other things, by using less memory to eliminate unnecessary garbage that would need to be collected by the GC. This is especially true for the aforementioned low-level stuff that might be called several hundred or even thousand times a second,e.g. in a very busy web application or API.

During the development of SphinxConnector.NET we took it upon ourselves to reduce the memory that is being used by the fluent API and SphinxQL. Following are two benchmarks (created with the great BenchmarkDotNet) based on our Stackoverflow with Sphinx sample web app. The benchmarked methods are based on the search method from the sample, which executes three queries: the search itself and two facet queries, so it resembles a scenario which is akin to what is done in a real-world application. The same queries are executed via plain SphinxQL and the fluent API. The benchmark is run against Manticore Search 2.6.2 on a Ubuntu Server 16 VM with 1 GB RAM and 1 CPU core, the search term used is ‘c#’.

BenchmarkDotNet=v0.10.12, OS=Windows 7 SP1 (6.1.7601.0)
Intel Core i5-3570 CPU 3.40GHz (Ivy Bridge), 1 CPU, 4 logical cores and 4 physical cores
Frequency=3330156 Hz, Resolution=300.2862 ns, Timer=TSC
  [Host]     : .NET Framework 4.7 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2558.0
  Job-MZUNZG : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2558.0
Jit=RyuJit  Platform=X64  Force=False  
MethodMeanErrorStdDevOp/sGen 0Allocated
SearchSphinxQL 7.421 ms 0.1472 ms 0.1305 ms 134.76 23.4375 72.07 KB
SearchFluentAPI 10.555 ms 0.2626 ms 0.2579 ms 94.75 93.7500 314.95 KB

 

As you can see, there’s significantly less memory being allocated when queries are executed directly via SphinxQL compared to the fluent API. Some of this is of course expected, as the fluent API does the heavy lifting of creating the queries and the result objects for you, which comes with a cost. But there’s certainly room for improvement as you can see from the following results with SphinxConnector.NET 4.0:

MethodMeanErrorStdDevOp/sGen 0Allocated
SearchSphinxQL 7.552 ms 0.1415 ms 0.2441 ms 132.4 7.8125 45.99 KB
SearchFluentAPI 9.656 ms 0.1931 ms 0.4030 ms 103.6 31.2500 131.81 KB

 

Memory usage with SphinxQL has been reduced by about a third and Gen0 collections are down by over 60%! The reduction seen with the fluent API is even higher with more than 50% of allocations now gone and Gen0 collections reduced to a third of their previous value!

So what did we do to achieve these numbers? In this release we focused on addressing all the obvious issues that popped up under profiling and manual code inspections: avoid unnecessary copying of data, avoid or defer the creation of objects when possible, reuse buffers etc. This mostly applies to the SphinxQL infrastructure which needed a major overhaul anyway, to provide async implementations of query methods.

One thing that had a major impact on the fluent API, is the use of the awesome FastExpressionCompiler to compile expressions. It not only compiles expression trees faster than Expression.Compile() but also often produces faster code, which both results in an increase of Op/s.

To conclude, SphinxConnector.NET 4.0 comes with major improvements in this area, but there’s still room for more optimizations, which will be done over the next releases, so stay tuned!

Tags: ,

Doing Full-Text Searches in .NET with Sphinx - Part 2: Data Import

by Dennis 18. April 2013 11:52

This is the second post in a series that is intended as an introduction to Sphinx for .NET developers who have not yet heard of Sphinx and are looking for a powerful full-text search engine for their websites or applications. The first part of this series served as an introduction to Sphinx, in this part we’ll get our hands dirty and start working on our ASP.NET MVC based sample application.

We’ll be creating a simple website that will allow us to search through Stackoverflow data (or rather the posts made by Stackoverflow users). The Stackoverflow team kindly provides this data under a Creative Commons license in an XML format. As the dataset is pretty large, just the posts are 7 GB in size (uncompressed), we’ll only use about 10,000 posts for demonstration purposes. A file with only these posts is included in the download so you won’t have to download the whole archive.

Having an existing set of data that needs to be imported into Sphinx is probably the most common scenario when people start to work with Sphinx. We will begin by creating the document model class and the Sphinx configuration file. We will be storing the documents in a real-time index, so we also need to create a small program to read the input data from the XML file.

The current state of this sample application is available for download, it uses .NET 4.0, ASP.NET MVC3, Sphinx 2.1.1, SphinxConnector.NET and contains a Visual Studio 2010 solution. It contains the console application for importing data and the web application which will be described in detail in the next part. Please note that the layout of the website is simple and might have some display quirks as I don’t want to spend too much time on making it look pretty. 

Creating the Document Model


The first step is to determine which part of the data should be stored in the index. One approach is to store only the bare minimum that is needed for full-text searches. Often, this requires that after the search the original data is retrieved from the data source, most likely a database, to display a meaningful search result to the user. As this requires one or more additional network round trips, I personally like to store as much data in the index as is needed (and feasible) to avoid this overhead.

In this example we’ll be storing everything in the index, because it reduces the complexity of the example. For instance, we use Sphinx to display the questions on the front page, which does not involve searching at all. But, storing your complete dataset in Sphinx is usually not a good idea because it can drastically increase the resources needed by Sphinx. Also Sphinx isn’t a database and shouldn’t be treated as such, for instance it doesn’t execute certain types of queries as efficient as a database would (though there are a some tricks for that). We are doing this here for the sake of simplicity, and it might even be feasible in some cases, but should be done with care!

Let’s take a look at our input data. The Stackoverflow data archive contains a readme file with a description of the input data. Here is the relevant part for the posts.xml file which we are interested in:

**posts**.xml
       - Id
       - PostTypeId
          - 1: Question
          - 2: Answer
       - ParentID (only present if PostTypeId is 2)
       - AcceptedAnswerId (only present if PostTypeId is 1)
       - CreationDate
       - Score
       - ViewCount
       - Body
       - OwnerUserId
       - LastEditorUserId
       - LastEditorDisplayName="Jeff Atwood"
       - LastEditDate="2009-03-05T22:28:34.823"
       - LastActivityDate="2009-03-11T12:51:01.480"
       - CommunityOwnedDate="2009-03-11T12:51:01.480"
       - ClosedDate="2009-03-11T12:51:01.480"
       - Title=
       - Tags=
       - AnswerCount
       - CommentCount
       - FavoriteCount

We’ll leave out some attributes that are not needed in our example which leads to the following model class:

public class Post
{
    public int Id { get; set; }

    public PostType PostType { get; set; }
    public long ParentId { get; set; }
    public long AcceptedAnswerId { get; set; }
    public string Body { get; set; }
    public string Title { get; set; }
    public string Tags { get; set; }
    public long Score { get; set; }
    public long ViewCount { get; set; }
    public long AnswerCount { get; set; }
    public DateTime CreationDate { get; set; }
    
    public int Weight { get; set; }
}

You might have already noticed that we’ve added a property named ‘Weight’ to our document class which is not present in the data source. When Sphinx searches the index for results it assigns each match a weight depending on its relevancy to the search query. The higher the weight, the more relevant it is for the query, and because we might want to use the weight in our own relevancy calculations, we need to add a property for it to our model.

Configuring Sphinx


Next we need to create the configuration for the Sphinx server. By default Sphinx will look for a file named sphinx.conf in the directory where the executable resides so we’ll name it just that. The configuration file consists of several sections: one for the server itself, one for the indexer program which is not relevant for our sample, and the index configuration. For our documents the index configuration looks like this:

index posts
{
    type                    = rt
    path                    = posts
    
    rt_field                = body 
    rt_attr_string          = body 
    rt_field                = title     
    rt_attr_string          = title
    rt_field                = tags          
    rt_attr_string          = tags 

    rt_attr_uint            = parentid
    rt_attr_uint            = score 
    rt_attr_uint            = viewcount
    rt_attr_uint            = posttype
    rt_attr_uint            = answercount
    rt_attr_timestamp       = creationdate    
    rt_attr_uint            = acceptedanswerid
    rt_attr_uint            = votecount
    
    charset_type            = utf-8
    min_word_len            = 1    
    dict                    = keywords                    
    expand_keywords         = 1                            
    rt_mem_limit            = 1024M
    
    charset_table           = 0..9, a..z, A..Z->a..z, U+DF, \
                              U+FC->u, U+DC->u, U+FC,U+DC, \
                              U+F6->o, U+D6->o, U+F6,U+D6, \
                              U+E4->a, U+C4->a, U+C4,U+E4, \
                              U+E1->a, U+C9->a, U+E9->e, U+C9->e, \
                              U+410..U+42F->U+430..U+44F, U+430..U+44F, U+00E6, \
                              U+00C6->U+00E6, U+01E2->U+00E6, U+01E3->U+00E6, \
                              U+01FC->U+00E6, U+01FD->U+00E6, U+1D01->U+00E6, \
                              U+1D02->U+00E6, U+1D2D->U+00E6, U+1D46->U+00E6
    
    blend_chars             = U+23, U+2B, -, ., @, &

stopwords               = stopwords.txt }

Let’s look through the different configuration options: first we specify the index type and the path where the index files should be stored. Next we declare the fields and attributes; it is not necessary to explicitly declare the id attribute, Sphinx does that automatically for us.

The next thing to note is the declaration of the full-text fields which are the fields that are searched by Sphinx when it processes a full-text query. In order to be able to retrieve the contents of these fields, so that we can display them to the user, we also declare string attributes for each of them. Why is it necessary to declare both? When the keywords are inserted into the index, they are hashed. This implies that there is no way to retrieve the original data that was inserted. This is where string attributes come into play. They enable us to store and retrieve the original content, which we then use to display the search result to the user, which allows us to avoid accessing the database altogether.

Next we specify the charset to use for our documents which is utf8 in our case. For the minimum word length we use one, the keywords dictionary type is set to ‘keywords’, we also enable Sphinx’ internal keyword expansion to allow for partial word matches and set the memory limit for RT index RAM chunks to 1024MB.

The charset table is up next. It defines which characters are recognized by Sphinx and also allows to remap a character to another. It is important to understand that if a character is not in the charset table, Sphinx will treat that character as a separator during indexing. To configure the charset table we can use both the character itself and/or its Unicode number. In our charset table we first add numbers from 0 to 9, lower case chars and remap upper case chars to lower case chars. Additionally we define some more mappings e.g. German umlauts are mapped to their non-umlaut equivalent like ü –> u. When defining your own charset table you can refer to the Sphinx wiki which contains a page with charset tables.

Lastly, we define the blend chars and configure a file with stop words. Blend chars are characters that are treated by Sphinx as both regular characters and separators. Suppose that you’re indexing email addresses and configure the ‘@’ symbol as a blend char, an email address such as ‘foo@bar.com’ will be indexed as a whole and as ‘foo’ and ‘bar.com’. This enables users to find an email address even if they don’t remember the domain or search for addresses that belong to a particular domain.

The stop words file contains a list of words that should be ignored by Sphinx during indexing. While any word can be configured as a stop word, most of the time frequently occurring words such as ‘the’, ‘is’, etc. are configured as stop words because it reduces the size of the full-text index and thereby improves query times.

Importing the Documents


To import the Stackoverflow data we create a simple console application that reads the input data from the XML file and uses SphinxConnector.NET’s fluent API to save the data into the index. Here’s the ImportPosts method:

public static void ImportPosts()
{
    IFulltextStore fulltextStore = new FulltextStore().Initialize();

    using (IFulltextSession session = fulltextStore.StartSession())
    {
        session.Advanced.TruncateIndex<Post>();

        int count = 0;

        foreach (var post in GetPosts())
        {
            session.Save(post);

            if (count++ % fulltextStore.Settings.SaveBatchSize == 0)
                session.FlushChanges();
        }

	session.FlushChanges();
    }
}

The FulltextStore class is the entry point to the fluent API. It is used to configure the connection string, mapping conventions and other settings. It also serves as a factory for creating instances of the IFulltextSession interface which provides the methods we need to execute full-text queries and to save and delete documents from RT-indexes. In the above code we’re creating an instance of the IFulltextSession interface by calling StartSession and then proceed to insert the documents. The GetPosts method is responsible for reading the documents from the XML file.

Conclusion


In the second installment of this series we created our document model and a Sphinx real-time index setup for these documents. We then created a small console application that performs the import of the documents from the source XML file. In the next part, we’ll create a website that’ll allow us to search the index we just created.

Tags: , ,

Tutorial

Optimized Attribute Filtering with SphinxConnector.NET’s Fluent API

by Dennis 1. February 2013 12:11

An interesting article over at the MySQL Performance Blog was recently published about optimizing Sphinx queries that only filter by an attribute (i.e. do not contain a full-text query). I recommend reading the article first and then coming back here, but here’s a quick summary: a Sphinx query that only filters by an attribute may be relatively slow compared to an equivalent query in a regular DBMS. The reason for this is the fact that one cannot create indexes (as in B-tree indexes) for attributes in Sphinx as one would do in a DBMS. So to retrieve the results of such a query, Sphinx has to perform a full-scan of the index which is relatively costly depending on the size of the index.

The article describes a neat trick to get around this limitation: by adding a full-text indexed field for an attribute and querying that, one can achieve a greatly improved query time. In this post I’d like to demonstrate how this technique can be used with SphinxConnector.NET’s fluent API in conjunction with a real-time index.

The index in the articles example contains data about books, so I’ll be using that here as well. These documents have an integer attribute for a user id that we’d like to store as a full-text indexed field. Let’s take a look at what the document model should look like and which additional settings need to be applied.

To add the user id attribute to the full-text index it needs to be converted to a string. We’ll also add a prefix to each value to avoid it being included in the results of a “regular” full-text query. To do this, we add a string property to the document model that returns the converted and prefixed value:

public class CatalogItem
{
    public int Id { get; set; }

    public int UserId { get; set; }

    public string Title { get; set; }

    public string UserIdKey
    {
        get { return "userkey_" + UserId; }
    }
}

As of Version 3.2, SphinxConnector.NET will automatically exclude any read-only property when selecting the results of a query, so no further setup is required here (it will of course still be inserted into the index during a save).

In previous versions of SphinxConnector.NET the UserIdKey property would have to be configured as follows:

fulltextStore.Conventions.IsFulltextFieldOnly = memberInfo => memberInfo.Name == "UserIdKey";

A query that uses the new attribute would then look this:

IList<CatalogItem> results = session.Query<CatalogItem>().
                                     Match("@UserIdKey userkey_42").
                                     ToList();

For the sake of completeness, here’s the corresponding Sphinx configuration:

index catalog
{
    type = rt
    path = catalog
rt_field = title rt_field = useridkey rt_attr_string = title rt_attr_uint = userid }

Tags: , ,

How-to

Using SphinxConnector.NET with ASP.NET MVC

by Dennis 21. December 2012 11:41

As using Sphinx from a web application is probably the most common use case,  I thought I’d post some guidelines and examples on how to use the fluent API in an ASP.NET MVC application with regards to setup and proper handling of IFulltextStore and IFulltextSession. The documentation already mentions that there should (usually) be one instance of the FulltextStore per application and one IFulltextSession per thread/(web-) request. Let’s take a look at a few different approaches to this:

Using Lazy

 

This approach makes use of the Lazy<T> class that was introduced with .NET 4.0. We create a base controller that holds the IFulltextStore instance which will be initialized upon the first access. Lazy<T> will make sure that the FulltextStore is created only once in a thread-safe way.

public abstract class SearchController : Controller
{
    private static readonly Lazy<IFulltextStore> Store = new Lazy<IFulltextStore>(() =>
    {
        IFulltextStore fulltextStore = new FulltextStore().Initialize();
        fulltextStore.ConnectionString.IsThis("pooling=true");

        return fulltextStore;
    });

    protected static IFulltextStore FulltextStore
    {
        get { return Store.Value; }
    }

    protected IFulltextSession FulltextSession { get; private set; }

    protected override void OnActionExecuting(ActionExecutingContext filterContext)
    {
        FulltextSession = FulltextStore.StartSession();
    }

    protected override void OnActionExecuted(ActionExecutedContext filterContext)
    {
        if (filterContext.IsChildAction || FulltextSession == null)
            return;

        using (FulltextSession)
        {
            if (filterContext.Exception != null)
                return;

            FulltextSession.FlushChanges();
        }
    }
}

The IFulltextSession for every request is created in an override of OnActionExecuting by assigning the result of StartSession to the FulltextSession property. This way, every controller that inherits from SearchController automatically gets an open session that is ready for use. In the override of OnActionExecuted we tell the FulltextSession to flush all pending changes. The using statement ensures that it is properly disposed of.

Using an IoC-Container

 

Following is an example installer for Castle Windsor:

public class SphinxConnectorInstaller : IWindsorInstaller
{
    public void Install(IWindsorContainer container, IConfigurationStore store)
    {
        container.Register(Component.For<IFulltextStore>().
                                     Instance(new FulltextStore().Initialize()).
                                     LifestyleSingleton(),
                           Component.For<IFulltextSession>().
                                     UsingFactoryMethod(kernel =>
                                         kernel.Resolve<IFulltextStore>().StartSession()).
                                     LifestylePerWebRequest());
    }
}

In this example, we setup Castle Windsor so that it can create both IFulltextStore and IFulltextSession. If you wanted to create IFulltextSession yourself (by injecting IFulltextStore into your classes and calling StartSession), you could remove the corresponding code from the installer.

We instruct Windsor to use the Singleton lifestyle for IFulltextStore, which means that Windsor will create one instance per container. In fact, Windsor uses Singleton is the default lifestyle, but in cases like this I’d like to make that explicit, so that developers that are not familiar with Windsor immediately see what’s going on. For IFulltextSession we set LifestylePerWebRequest so that Windsor will create an instance for each request; it will also automatically call Dispose at the end of each request, so we don’t have to worry about that. If you wanted Windsor to also call FlushChanges, you could do so with the help of Windsor’s OnDestroy method.

Initialization at Application Startup

 

Like with the first approach, we create a base controller, this time with a static property hat holds the IFulltextStore instance. The instance is initialized in the Global.asax.cs file in Application_Start:

public abstract class SearchController : Controller 
{
    public static IFulltextStore FulltextStore { get; set; }
protected IFulltextSession FulltextSession { get; private set; }
//Overrides of OnActionExecuting and OnActionExecuted omitted }
protected void Application_Start()
{
    AreaRegistration.RegisterAllAreas();

    RegisterGlobalFilters(GlobalFilters.Filters);
    RegisterRoutes(RouteTable.Routes);

    InitFulltextStore();
}

private static void InitFulltextStore()
{
    IFulltextStore fulltextStore = new FulltextStore().Initialize();
    fulltextStore.ConnectionString.IsThis("pooling=true");

    SearchController.FulltextStore = fulltextStore;
}

Tags: , ,

How-to

Importing Data into Sphinx RT-Indexes with SphinxConnector.NET’s Fluent API

by Dennis 19. November 2012 12:04

If you are facing the task of importing data into a Sphinx RT-index, SphinxConnector.NET’s fluent API makes this really easy with just a couple lines of code (the document model class is omitted for brevity):

void Import()
{
    IFulltextStore fulltextStore = new FulltextStore().Initialize();
    fulltextStore.ConnectionString.IsThis("pooling=true");

    int count = 0;

    using (IFulltextSession session = fulltextStore.StartSession())
    {
        foreach (var document in GetDocuments())
        {
            session.Save(document);

            if (++count % fulltextStore.Settings.SaveBatchSize == 0)
                session.FlushChanges();
        }

        session.FlushChanges();   
    }
}   

The important part is the call to FlushChanges each time a batch of documents has been passed to Save. This avoids high memory usage when importing many documents, because SphinxConnector.NET has to keep each document in memory until FlushChanges is called (though for smaller datasets it might be acceptable to flush all changes at the end of the import process).

Not only is this much simpler than writing SphinxQL by hand, it’s also faster because of SphinxConnector.NET’s automatic batching. The default value for SaveBatchSize is 16, which provides good performance, but can of course be adjusted for environments where a higher batch size leads to even more performance.

Tags: , ,

Tutorial | How-to