SphinxConnector.NET 3.1 has been released

by Dennis 25. October 2012 13:51

We're pleased to announce the immediate availability of SphinxConnector.NET 3.1!

This version introduces support for future queries with the fluent query API. This feature allows to defer the execution of queries until their results are needed, enabling SphinxConnector.NET to send them in batches. This functionality is available with .NET 4.0 or higher and Mono.

Additional improvements to the fluent query API:

  • IFulltextQuery.Where can now be called multiple times for a query.
  • The generic type of collections for multi-value attributes e.g. IList<T> etc. can now be of any type that is large enough to hold the result values. Previously only 'long' was supported.
  • A workaround to the query generator has been added, to account for the fact that Sphinx currently doesn't support using 'weight()' in an expression, e.g. 'SELECT weight()*2 AS c1 FROM example'. The query generator will now emit a separate, aliased weight() column and use the alias in the expression, e.g. 'SELECT weight() AS c2, c2*2 AS c1 FROM example'.

Last but not least, a bug that caused the RecordsAffected property of the SphinxQLDataReader class not being set correctly when the reader was used to read results from multiple SELECT or SHOW statements, has been fixed.

Tags:

Announcements

Doing Full-Text Searches in .NET with Sphinx - Part 1: Introduction

by Dennis 25. October 2012 11:16

This is the first post in a series (part 2, part 3) that is intended as an introduction to Sphinx for .NET developers who have not yet heard of Sphinx and are looking for a powerful full-text search engine for their websites or applications. The first part of this series serves as an introduction to Sphinx, and over the next parts we’ll build an ASP.NET MVC site that uses Sphinx.

What is Sphinx?

 

Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It's written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems.

Source: http://sphinxsearch.com/about/sphinx/

As we can see, Sphinx runs a stand-alone server (as opposed to embedded solutions) on pretty much any important operating system. Its key benefits are high search and indexing performance, scalability and providing high quality search results. It is used by a variety of sites from different domains like news, ecommerce and social media e.g. Slashdot, Craigslist and tumblr. Before you can use Sphinx for full-text searches you need to import your data into Sphinx.

How to get your data into Sphinx

 

First, let’s talk about some terminology: Sphinx stores our data in an index and each entry in an index is called a document. An index is comparable to a table in a database and a document is analogous to a row in database table. A document can contain text and attributes and must have a unique id. The text is the data that we want to search through while attributes contain additional data that is stored along with our text. Attributes can be used to hold data like dates, numbers, booleans and strings. The latter is especially interesting and useful as we’ll see in the next part.

So, how do we get our documents into Sphinx? There are several answers to that and it can sometimes be confusing for the beginner but once you get behind it, it’s not that difficult.

Index Types

As we already mentioned, Sphinx stores documents in an index. Let’s first take a look at the index types that Sphinx provides. Sphinx comes with two different index types: disk-based indexes (also referred to as batch indexes) and real-time (RT) indexes. The name disk-based index can be a bit confusing, because RT indexes are of course also stored on disk. RT indexes consist of so called disk-chunks which in turn are just regular disk-based indexes. The difference between these two index types is the way they are filled with data.

Disk-Based or Batch Indexes

With disk-based indexes, there are two ways to add documents: from a database or from a XML pipe. A disk-based index always needs to have a data source configured. As stated before, this source can be either a database or a stream of XML data. Sphinx can talk to many databases directly; it supports MSSQL, MySQL and PostgreSQL natively, other databases can be accessed via ODBC. If a database isn’t supported or the data is not stored in one, it can be exported to XML in a schema that Sphinx understands, and be processed from there.

The indexing for disk-based indexes is not done by the Sphinx search server itself, but by a separate program aptly called indexer. It fetches the data from a data source and creates the index files which are then served to clients by a program called searchd (the actual search server). Once a disk-based index is created, no new documents can be added to it, though attributes of existing documents can be updated. So how does one add new documents to an index? The easiest way is to just run the indexer again and re-create the index from the ground up. Schedule an indexer run every 20 minutes or so and new documents are available in the index with a maximum delay of 20 minutes plus the time it takes to create the index. While this seems crude, it can be a valid approach for smaller datasets because of its simplicity.

The approach commonly used for larger datasets, is a so called main-delta scheme, where one has a big index that is created once (main) and a smaller one that holds new documents (delta). Since the delta index is small it can re-build regularly. As the delta index grows it can be merged with the main index to keep indexing time short. Next, let’s take a look at how RT-indexes work:

Real-Time Indexes

The most notable difference between disk-based indexes and RT-indexes is the fact that the latter are not created by the indexer program. Then how do my documents get in there, you ask. Simple, we just INSERT them. With RT-indexes we can insert, update and delete documents in real-time (hence the name), all changes to an RT-index are made nearly instantaneously. These operations are done via Sphinx’ own query language named SphinxQL which supports INSERT, UPDATE, and DELETE statements similar to SQL. To reiterate (because sometimes new Sphinx users are confused by this): unlike disk-based indexes, real-time indexes are independent of the data source! It does not matter where your data comes from, you retrieve it from any source and then insert it into a real-time index via an INSERT INTO… statement, just as you would insert data into a database.

So which index type should I use?

Unfortunately, that is not an easy question to answer. It really depends on each project’s specific requirements but here are some general guidelines:

  • RT-indexes are missing some features (infix indexing for example) that disk-based indexes have. So if you can’t live without that, disk-based indexes are the way to go (Update: Infixes are supported as of Sphinx 2.1.1)
  • If you need instantaneous index updates, RT-indexes are the way to go (not entirely true, see below)
  • The indexing speed of disk-based indexes is often higher than that of RT-indexes

It is also possible to use both index types e.g. disk-based indexes for existing data and RT-indexes for new documents. Also, if you start out with disk-based indexes and decide to switch to RT indexes later, Sphinx provides an easy way to convert your existing indexes via the ATTACH INDEX statement.

Conclusion

 

In the first part of this series we took a high-level look at Sphinx - what index types it provides, how indexing works and different strategies for indexing data. In the next part we’ll get our hands dirty and do some actual coding!

Tags: , ,

Tutorial

SphinxConnector.NET 3.0.9 and 2.8.4 released

by Dennis 5. October 2012 11:21

A new version of SphinxConnector.NET 3.0 has been released. This version fixes a race condition that could occur when closing a pooled connection. Additionally a new version of the 2.x branch is available. This is a bugfix release that contains fixes that were backported from the 3.x branch.

Tags:

Announcements