Doing Full-Text Searches in .NET with Sphinx - Part 2: Data Import

by Dennis 18. April 2013 11:52

This is the second post in a series that is intended as an introduction to Sphinx for .NET developers who have not yet heard of Sphinx and are looking for a powerful full-text search engine for their websites or applications. The first part of this series served as an introduction to Sphinx, in this part we’ll get our hands dirty and start working on our ASP.NET MVC based sample application.

We’ll be creating a simple website that will allow us to search through Stackoverflow data (or rather the posts made by Stackoverflow users). The Stackoverflow team kindly provides this data under a Creative Commons license in an XML format. As the dataset is pretty large, just the posts are 7 GB in size (uncompressed), we’ll only use about 10,000 posts for demonstration purposes. A file with only these posts is included in the download so you won’t have to download the whole archive.

Having an existing set of data that needs to be imported into Sphinx is probably the most common scenario when people start to work with Sphinx. We will begin by creating the document model class and the Sphinx configuration file. We will be storing the documents in a real-time index, so we also need to create a small program to read the input data from the XML file.

The current state of this sample application is available for download, it uses .NET 4.0, ASP.NET MVC3, Sphinx 2.1.1, SphinxConnector.NET and contains a Visual Studio 2010 solution. It contains the console application for importing data and the web application which will be described in detail in the next part. Please note that the layout of the website is simple and might have some display quirks as I don’t want to spend too much time on making it look pretty.

Creating the Document Model

The first step is to determine which part of the data should be stored in the index. One approach is to store only the bare minimum that is needed for full-text searches. Often, this requires that after the search the original data is retrieved from the data source, most likely a database, to display a meaningful search result to the user. As this requires one or more additional network round trips, I personally like to store as much data in the index as is needed (and feasible) to avoid this overhead.

In this example we’ll be storing everything in the index, because it reduces the complexity of the example. For instance, we use Sphinx to display the questions on the front page, which does not involve searching at all. But, storing your complete dataset in Sphinx is usually not a good idea because it can drastically increase the resources needed by Sphinx. Also Sphinx isn’t a database and shouldn’t be treated as such, for instance it doesn’t execute certain types of queries as efficient as a database would (though there are a some tricks for that). We are doing this here for the sake of simplicity, and it might even be feasible in some cases, but should be done with care!

Let’s take a look at our input data. The Stackoverflow data archive contains a readme file with a description of the input data. Here is the relevant part for the posts.xml file which we are interested in:

**posts**.xml
       - Id
       - PostTypeId
          - 1: Question
          - 2: Answer
       - ParentID (only present if PostTypeId is 2)
       - AcceptedAnswerId (only present if PostTypeId is 1)
       - CreationDate
       - Score
       - ViewCount
       - Body
       - OwnerUserId
       - LastEditorUserId
       - LastEditorDisplayName="Jeff Atwood"
       - LastEditDate="2009-03-05T22:28:34.823"
       - LastActivityDate="2009-03-11T12:51:01.480"
       - CommunityOwnedDate="2009-03-11T12:51:01.480"
       - ClosedDate="2009-03-11T12:51:01.480"
       - Title=
       - Tags=
       - AnswerCount
       - CommentCount
       - FavoriteCount

We’ll leave out some attributes that are not needed in our example which leads to the following model class:

public class Post
{
    public int Id { get; set; }

    public PostType PostType { get; set; }
    public long ParentId { get; set; }
    public long AcceptedAnswerId { get; set; }
    public string Body { get; set; }
    public string Title { get; set; }
    public string Tags { get; set; }
    public long Score { get; set; }
    public long ViewCount { get; set; }
    public long AnswerCount { get; set; }
    public DateTime CreationDate { get; set; }
    
    public int Weight { get; set; }
}

You might have already noticed that we’ve added a property named ‘Weight’ to our document class which is not present in the data source. When Sphinx searches the index for results it assigns each match a weight depending on its relevancy to the search query. The higher the weight, the more relevant it is for the query, and because we might want to use the weight in our own relevancy calculations, we need to add a property for it to our model.

Configuring Sphinx

Next we need to create the configuration for the Sphinx server. By default Sphinx will look for a file named sphinx.conf in the directory where the executable resides so we’ll name it just that. The configuration file consists of several sections: one for the server itself, one for the indexer program which is not relevant for our sample, and the index configuration. For our documents the index configuration looks like this:

index posts
{
    type                    = rt
    path                    = posts
    
    rt_field                = body 
    rt_attr_string          = body 
    rt_field                = title     
    rt_attr_string          = title
    rt_field                = tags          
    rt_attr_string          = tags 

    rt_attr_uint            = parentid
    rt_attr_uint            = score 
    rt_attr_uint            = viewcount
    rt_attr_uint            = posttype
    rt_attr_uint            = answercount
    rt_attr_timestamp       = creationdate    
    rt_attr_uint            = acceptedanswerid
    rt_attr_uint            = votecount
    
    charset_type            = utf-8
    min_word_len            = 1    
    dict                    = keywords                    
    expand_keywords         = 1                            
    rt_mem_limit            = 1024M
    
    charset_table           = 0..9, a..z, A..Z->a..z, U+DF, \
                              U+FC->u, U+DC->u, U+FC,U+DC, \
                              U+F6->o, U+D6->o, U+F6,U+D6, \
                              U+E4->a, U+C4->a, U+C4,U+E4, \
                              U+E1->a, U+C9->a, U+E9->e, U+C9->e, \
                              U+410..U+42F->U+430..U+44F, U+430..U+44F, U+00E6, \
                              U+00C6->U+00E6, U+01E2->U+00E6, U+01E3->U+00E6, \
                              U+01FC->U+00E6, U+01FD->U+00E6, U+1D01->U+00E6, \
                              U+1D02->U+00E6, U+1D2D->U+00E6, U+1D46->U+00E6
    
    blend_chars             = U+23, U+2B, -, ., @, &

    stopwords               = stopwords.txt
}

Let’s look through the different configuration options: first we specify the index type and the path where the index files should be stored. Next we declare the fields and attributes; it is not necessary to explicitly declare the id attribute, Sphinx does that automatically for us.

The next thing to note is the declaration of the full-text fields which are the fields that are searched by Sphinx when it processes a full-text query. In order to be able to retrieve the contents of these fields, so that we can display them to the user, we also declare string attributes for each of them. Why is it necessary to declare both? When the keywords are inserted into the index, they are hashed. This implies that there is no way to retrieve the original data that was inserted. This is where string attributes come into play. They enable us to store and retrieve the original content, which we then use to display the search result to the user, which allows us to avoid accessing the database altogether.

Next we specify the charset to use for our documents which is utf8 in our case. For the minimum word length we use one, the keywords dictionary type is set to ‘keywords’, we also enable Sphinx’ internal keyword expansion to allow for partial word matches and set the memory limit for RT index RAM chunks to 1024MB.

The charset table is up next. It defines which characters are recognized by Sphinx and also allows to remap a character to another. It is important to understand that if a character is not in the charset table, Sphinx will treat that character as a separator during indexing. To configure the charset table we can use both the character itself and/or its Unicode number. In our charset table we first add numbers from 0 to 9, lower case chars and remap upper case chars to lower case chars. Additionally we define some more mappings e.g. German umlauts are mapped to their non-umlaut equivalent like ü –> u. When defining your own charset table you can refer to the Sphinx wiki which contains a page with charset tables.

Lastly, we define the blend chars and configure a file with stop words. Blend chars are characters that are treated by Sphinx as both regular characters and separators. Suppose that you’re indexing email addresses and configure the ‘@’ symbol as a blend char, an email address such as ‘foo@bar.com’ will be indexed as a whole and as ‘foo’ and ‘bar.com’. This enables users to find an email address even if they don’t remember the domain or search for addresses that belong to a particular domain.

The stop words file contains a list of words that should be ignored by Sphinx during indexing. While any word can be configured as a stop word, most of the time frequently occurring words such as ‘the’, ‘is’, etc. are configured as stop words because it reduces the size of the full-text index and thereby improves query times.

Importing the Documents

To import the Stackoverflow data we create a simple console application that reads the input data from the XML file and uses SphinxConnector.NET’s fluent API to save the data into the index. Here’s the ImportPosts method:

public static void ImportPosts()
{
    IFulltextStore fulltextStore = new FulltextStore().Initialize();

    using (IFulltextSession session = fulltextStore.StartSession())
    {
        session.Advanced.TruncateIndex<Post>();

        int count = 0;

        foreach (var post in GetPosts())
        {
            session.Save(post);

            if (count++ % fulltextStore.Settings.SaveBatchSize == 0)
                session.FlushChanges();
        }

	session.FlushChanges();
    }
}

The FulltextStore class is the entry point to the fluent API. It is used to configure the connection string, mapping conventions and other settings. It also serves as a factory for creating instances of the IFulltextSession interface which provides the methods we need to execute full-text queries and to save and delete documents from RT-indexes. In the above code we’re creating an instance of the IFulltextSession interface by calling StartSession and then proceed to insert the documents. The GetPosts method is responsible for reading the documents from the XML file.

Conclusion

In the second installment of this series we created our document model and a Sphinx real-time index setup for these documents. We then created a small console application that performs the import of the documents from the source XML file. In the next part, we’ll create a website that’ll allow us to search the index we just created.

79f4e201-2e25-4274-916d-44da805a7450|1|5.0

Tags: c#, Sphinxconnector.NET, ASP.NET MVC

Tutorial

Importing Data into Sphinx RT-Indexes with SphinxConnector.NET’s Fluent API

by Dennis 19. November 2012 12:04

If you are facing the task of importing data into a Sphinx RT-index, SphinxConnector.NET’s fluent API makes this really easy with just a couple lines of code (the document model class is omitted for brevity):

void Import()
{
    IFulltextStore fulltextStore = new FulltextStore().Initialize();
    fulltextStore.ConnectionString.IsThis("pooling=true");

    int count = 0;

    using (IFulltextSession session = fulltextStore.StartSession())
    {
        foreach (var document in GetDocuments())
        {
            session.Save(document);

            if (++count % fulltextStore.Settings.SaveBatchSize == 0)
                session.FlushChanges();
        }

        session.FlushChanges();   
    }
}

The important part is the call to FlushChanges each time a batch of documents has been passed to Save. This avoids high memory usage when importing many documents, because SphinxConnector.NET has to keep each document in memory until FlushChanges is called (though for smaller datasets it might be acceptable to flush all changes at the end of the import process).

Not only is this much simpler than writing SphinxQL by hand, it’s also faster because of SphinxConnector.NET’s automatic batching. The default value for SaveBatchSize is 16, which provides good performance, but can of course be adjusted for environments where a higher batch size leads to even more performance.

e7aded71-967c-4a16-8bd9-ea5ed5c8f710|0|.0

Tags: Sphinxconnector.NET, sphinx, c#

Tutorial | How-to

Doing Full-Text Searches in .NET with Sphinx - Part 1: Introduction

by Dennis 25. October 2012 11:16

This is the first post in a series (part 2, part 3) that is intended as an introduction to Sphinx for .NET developers who have not yet heard of Sphinx and are looking for a powerful full-text search engine for their websites or applications. The first part of this series serves as an introduction to Sphinx, and over the next parts we’ll build an ASP.NET MVC site that uses Sphinx.

What is Sphinx?

Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It's written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems.
Source: http://sphinxsearch.com/about/sphinx/

As we can see, Sphinx runs a stand-alone server (as opposed to embedded solutions) on pretty much any important operating system. Its key benefits are high search and indexing performance, scalability and providing high quality search results. It is used by a variety of sites from different domains like news, ecommerce and social media e.g. Slashdot, Craigslist and tumblr. Before you can use Sphinx for full-text searches you need to import your data into Sphinx.

How to get your data into Sphinx

First, let’s talk about some terminology: Sphinx stores our data in an index and each entry in an index is called a document. An index is comparable to a table in a database and a document is analogous to a row in database table. A document can contain text and attributes and must have a unique id. The text is the data that we want to search through while attributes contain additional data that is stored along with our text. Attributes can be used to hold data like dates, numbers, booleans and strings. The latter is especially interesting and useful as we’ll see in the next part.

So, how do we get our documents into Sphinx? There are several answers to that and it can sometimes be confusing for the beginner but once you get behind it, it’s not that difficult.

Index Types

As we already mentioned, Sphinx stores documents in an index. Let’s first take a look at the index types that Sphinx provides. Sphinx comes with two different index types: disk-based indexes (also referred to as batch indexes) and real-time (RT) indexes. The name disk-based index can be a bit confusing, because RT indexes are of course also stored on disk. RT indexes consist of so called disk-chunks which in turn are just regular disk-based indexes. The difference between these two index types is the way they are filled with data.

Disk-Based or Batch Indexes

With disk-based indexes, there are two ways to add documents: from a database or from a XML pipe. A disk-based index always needs to have a data source configured. As stated before, this source can be either a database or a stream of XML data. Sphinx can talk to many databases directly; it supports MSSQL, MySQL and PostgreSQL natively, other databases can be accessed via ODBC. If a database isn’t supported or the data is not stored in one, it can be exported to XML in a schema that Sphinx understands, and be processed from there.

The indexing for disk-based indexes is not done by the Sphinx search server itself, but by a separate program aptly called indexer. It fetches the data from a data source and creates the index files which are then served to clients by a program called searchd (the actual search server). Once a disk-based index is created, no new documents can be added to it, though attributes of existing documents can be updated. So how does one add new documents to an index? The easiest way is to just run the indexer again and re-create the index from the ground up. Schedule an indexer run every 20 minutes or so and new documents are available in the index with a maximum delay of 20 minutes plus the time it takes to create the index. While this seems crude, it can be a valid approach for smaller datasets because of its simplicity.

The approach commonly used for larger datasets, is a so called main-delta scheme, where one has a big index that is created once (main) and a smaller one that holds new documents (delta). Since the delta index is small it can re-build regularly. As the delta index grows it can be merged with the main index to keep indexing time short. Next, let’s take a look at how RT-indexes work:

Real-Time Indexes

The most notable difference between disk-based indexes and RT-indexes is the fact that the latter are not created by the indexer program. Then how do my documents get in there, you ask. Simple, we just INSERT them. With RT-indexes we can insert, update and delete documents in real-time (hence the name), all changes to an RT-index are made nearly instantaneously. These operations are done via Sphinx’ own query language named SphinxQL which supports INSERT, UPDATE, and DELETE statements similar to SQL. To reiterate (because sometimes new Sphinx users are confused by this): unlike disk-based indexes, real-time indexes are independent of the data source! It does not matter where your data comes from, you retrieve it from any source and then insert it into a real-time index via an INSERT INTO… statement, just as you would insert data into a database.

So which index type should I use?

Unfortunately, that is not an easy question to answer. It really depends on each project’s specific requirements but here are some general guidelines:

RT-indexes are missing some features (infix indexing for example) that disk-based indexes have. So if you can’t live without that, disk-based indexes are the way to go (Update: Infixes are supported as of Sphinx 2.1.1)
If you need instantaneous index updates, RT-indexes are the way to go (not entirely true, see below)
The indexing speed of disk-based indexes is often higher than that of RT-indexes

It is also possible to use both index types e.g. disk-based indexes for existing data and RT-indexes for new documents. Also, if you start out with disk-based indexes and decide to switch to RT indexes later, Sphinx provides an easy way to convert your existing indexes via the ATTACH INDEX statement.

Conclusion

In the first part of this series we took a high-level look at Sphinx - what index types it provides, how indexing works and different strategies for indexing data. In the next part we’ll get our hands dirty and do some actual coding!

17e62130-5fb7-4fef-9654-65d3670ca693|0|.0

Tags: sphinx, c#, ASP.NET MVC

Tutorial

SphinxConnector.NET 3.0.4 released

by Dennis 11. September 2012 10:13

This is just a small maintenance release that contains a few bugfixes. A list of resolved issues is available in the version history. NuGet users can update to the latest version via the package manager, a ZIP package can be downloaded from the download page.

9008ed72-28ce-4963-8de1-90b6cc0140f1|0|.0

Tags: Sphinxconnector.NET, .NET, sphinx, c#

Announcements

SphinxConnector.NET 3.0 has been released

by Dennis 3. September 2012 10:03

We are pleased to announce that the new major version of SphinxConnector.NET is now available for download!

Those of you who have been following the blog already know about the big new feature coming with this release: the fluent query API. The fluent API provides you with a LINQ-like query API to design your full-text queries. It operates directly on your document models and also lets you comfortably save and delete documents from real-time indexes.

A description with much more details is available on the features page.

Another highlight of this release is the newly added support for the Mono runtime. Additionally, we've upgraded Common.Logging to version 2, which provides support for recent releases of the supported logging frameworks. We've also added support for running SphinxConnector.NET in medium-trust environments. There are a bunch of other improvements which are listed in the version history.

SphinxConnector.NET is now available as a NuGet package, which we know many of you have been waiting for!

Licensing and Upgrading

With the new release we're switching to a subscription based licensing system. All new purchases and upgrades come with a 1 year upgrade subscription which gives you access to all major and minor releases made during the subscription period. At the end of the subscription period you can renew your license for just 40% of the then current price.

If you bought your license in 2012, you will receive SphinxConnector.NET 3.0 and all other releases made this year for free! Afterwards you can renew your licenses at the conditions outlined above.

If you bought your license before 2012, you can also renew your license for just 40% of the current price!

We are also introducing a new license type, the 'Large Team License' for up to eight developers, to make up for the fact that we had to raise the price for the site license quite a bit more than we wished. If you have purchased a Site License you can downgrade to a Large Team License if you're eligible.

You can now also purchase a premium support subscription along with your license or license renewal. All details can be found on our purchase page.

If you would like to send us feedback about the new version, you can use the contact form or send us an e-mail to contact@sphinxconnector.net.

b9e08360-bdd6-41ad-8eeb-89947613cabb|0|.0

Tags: Sphinxconnector.NET, .NET, sphinx, c#

Announcements

Doing Full-Text Searches in .NET with Sphinx - Part 2: Data Import

Creating the Document Model

Configuring Sphinx

Importing the Documents

Conclusion

Importing Data into Sphinx RT-Indexes with SphinxConnector.NET’s Fluent API

Doing Full-Text Searches in .NET with Sphinx - Part 1: Introduction

What is Sphinx?

How to get your data into Sphinx

Index Types

Disk-Based or Batch Indexes

Real-Time Indexes

So which index type should I use?

Conclusion

SphinxConnector.NET 3.0.4 released

SphinxConnector.NET 3.0 has been released

Licensing and Upgrading

Recent Posts

Month List

Doing Full-Text Searches in .NET with Sphinx - Part 2: Data Import

Creating the Document Model

Configuring Sphinx

Importing the Documents

Conclusion

Importing Data into Sphinx RT-Indexes with SphinxConnector.NET’s Fluent API

Doing Full-Text Searches in .NET with Sphinx - Part 1: Introduction

What is Sphinx?

How to get your data into Sphinx

Index Types

Disk-Based or Batch Indexes

Real-Time Indexes

So which index type should I use?

Conclusion

SphinxConnector.NET 3.0.4 released

SphinxConnector.NET 3.0 has been released

Licensing and Upgrading

Recent Posts

Month List

Tag cloud

Category list