Indexing Office and PDF Files With Sphinx and .NET

by Dennis 6. February 2013 13:19

Sphinx is a great full-text search engine with many amazing features, but there is one feature missing that would make it even better: the ability to directly index Word, Excel, PowerPoint, PDF files. How one can index these kinds of documents with Sphinx is something that is often asked in the Sphinx forum. Today I’d like to show you an easy way to extract text from these document types and store them in a Sphinx real-time index from your .NET application.

There are a bunch of tools and libraries out there that claim to be able to extract text from various document formats. As it is a pretty hard task to support many formats and extract text reliably, the quality of each library varies greatly. One tool that stands out is the Apache Tika™ toolkit. It is a Java library that

“detects and extracts metadata and structured text content from various documents using existing parser libraries.”

And it is really very good at it. Amongst others, it supports Microsoft Office, Open Document (ODF), and PDF files. But wait, I said Java library, didn’t I? “What am I supposed to do with a Java library in my .NET application?”, you might ask. Well, we’ll just convert it from Java to .NET using IKVM.NET. IKVM.NET is a .NET implementation of a Java Virtual Machine (JVM) which can be used as a drop-in replacement for Java. And it comes with a Java bytecode to CIL translator named ikvmc that we can use to build a .NET version of Apache Tika. In the next section, I’ll walk through the steps required to do this. At the end of this article you can download a complete sample application that uses Tika to extract text from some files and stores them in a Sphinx real-time index via SphinxConnector.NET.

Creating a .NET Version of Apache Tika


To create your own .NET version of Apacha Tika you need to:

  1. Download IKVM.NET
  2. Download the .jar file from the Tika project page (the current version at the time of writing is 1.3)
  3. Extract IKVM.NET to a folder of your choice
  4. Optional: Add the bin folder of IKVM.NET to your %PATH% variable
  5. Execute the following command (add/change the paths to ikvmc.exe and tika-app-1.3.jar if needed):
ikvmc.exe -target:library -version:1.3.0.0 -out:Apache.Tika.dll tika-app-1.3.jar

Let’s walk through the command line parameters: With –target:library we tell ikvmc to convert the jar to a class library. This is needed because the jar file is also usable as a standalone console/gui application, i.e. contains a main() method, which by default would cause ikvmc to generate an exe file. Next, we specify the version for our output DLL because otherwise ikvmc would set the version to 0.0.0.0. Finally we specify the output file name via –out: and the path to the Tika jar file.

After hitting Enter, ikvmc starts to translate the Java library to .NET. It’ll output a lot of warning messages, but will eventually finish and produce a working DLL. Note that if you want to sign the output assembly you can do so by specifying a key file via the -keyfile: command line option.

Extracting Text from Documents


Now that we've created a .NET library of Tika, we can start extracting text from documents. I’ve created small wrapper that provides methods to perform the extraction. To build the wrapper DLL we need to add references to a couple of IKVM.NET libraries:

IKVMNETLibs

Note that you need to reference more of IKVM.NET’s DLL’s in an application that uses Tika, these are just the required files to compile the wrapper project.

The AutoTextExtractor class which handles the extraction of text from files and binary data (useful if your documents are stored in a DB) and the TextExtractionResult class are based on these by Kevin Miller:

public class AutoTextExtractor
{
  public TextExtractionResult Extract(string filePath,OutputType outputType = OutputType.Text)
  {
      return Extract(System.IO.File.ReadAllBytes(filePath), outputType);
  }

  public TextExtractionResult Extract(byte[] data, OutputType outputType = OutputType.Text)
  {
      var parser = new AutoDetectParser();
      var metadata = new Metadata();

      using (Writer outputWriter = new StringWriter())
      using (InputStream inputStream = TikaInputStream.get(data, metadata))
      {
          parser.parse(inputStream, GetTransformerHandler(outputWriter, outputType), 
              metadata, new ParseContext());
          
          return AssembleExtractionResult(outputWriter.toString(), metadata);
      }
  }

  private static TextExtractionResult AssembleExtractionResult(string text, Metadata metadata)
  {
      Dictionary<string, string> metaDataResult = metadata.names().
          ToDictionary(name => name, name => String.Join(", ", metadata.getValues(name)));

      string contentType = metaDataResult["Content-Type"];

      return new TextExtractionResult
      {
          Text = text,
          ContentType = contentType,
          Metadata = metaDataResult
      };
  }

  private TransformerHandler GetTransformerHandler(Writer outputWriter, OutputType outputType)
  {
      var factory = (SAXTransformerFactory)TransformerFactory.newInstance();
      TransformerHandler handler = factory.newTransformerHandler();
      handler.getTransformer().setOutputProperty(OutputKeys.METHOD, outputType.ToString());
      handler.setResult(new StreamResult(outputWriter));

      return handler;
  }
}

Here’s the TextExtractionResult class:

public class TextExtractionResult
{
    public string Text { get; set; }
    
    public string ContentType { get; set; }
    
    public IDictionary<string, string> Metadata { get; set; }
}

And the OutputType enumeration:

public enum OutputType
{
    Text,
    Html,
    Xml
}

Demo Application


I’ve created a small demo application that contains a DLL that wraps Tika with the help of the aforementioned classes, and a console application that demonstrates how to extract and store the contents of some files in a Sphinx real-time index with SphinxConnector.NET. The code that does the extraction is pretty simple:

private static SphinxDocumentModel[] GetDocuments()
{
    AutoTextExtractor textExtractor = new AutoTextExtractor();

    int id = 1;

    return (from filePath in Directory.EnumerateFiles(@"..\..\..\..\testfiles")
            select new SphinxDocumentModel
            {
                Id = id++,
                FileContent = textExtractor.Extract(filePath).Text,
                FilePath = Path.GetFullPath(filePath)
            }).ToArray();
}

I’ll omit the code that saves the documents to the index, as it is straightforward. Tip: if you are working with big files you might have to increase Sphinx’ max_packet_size setting.

The archive contains all required libraries so it’s pretty big (IKVM.NET and Tika alone take up about 50 MB in space).

Downloads:

[1] Zip-Package (22 MB)

[2] 7z-Package (18 MB)

Tags: , , , ,

How-to | Tutorial

Optimized Attribute Filtering with SphinxConnector.NET’s Fluent API

by Dennis 1. February 2013 12:11

An interesting article over at the MySQL Performance Blog was recently published about optimizing Sphinx queries that only filter by an attribute (i.e. do not contain a full-text query). I recommend reading the article first and then coming back here, but here’s a quick summary: a Sphinx query that only filters by an attribute may be relatively slow compared to an equivalent query in a regular DBMS. The reason for this is the fact that one cannot create indexes (as in B-tree indexes) for attributes in Sphinx as one would do in a DBMS. So to retrieve the results of such a query, Sphinx has to perform a full-scan of the index which is relatively costly depending on the size of the index.

The article describes a neat trick to get around this limitation: by adding a full-text indexed field for an attribute and querying that, one can achieve a greatly improved query time. In this post I’d like to demonstrate how this technique can be used with SphinxConnector.NET’s fluent API in conjunction with a real-time index.

The index in the articles example contains data about books, so I’ll be using that here as well. These documents have an integer attribute for a user id that we’d like to store as a full-text indexed field. Let’s take a look at what the document model should look like and which additional settings need to be applied.

To add the user id attribute to the full-text index it needs to be converted to a string. We’ll also add a prefix to each value to avoid it being included in the results of a “regular” full-text query. To do this, we add a string property to the document model that returns the converted and prefixed value:

public class CatalogItem
{
    public int Id { get; set; }

    public int UserId { get; set; }

    public string Title { get; set; }

    public string UserIdKey
    {
        get { return "userkey_" + UserId; }
    }
}

As of Version 3.2, SphinxConnector.NET will automatically exclude any read-only property when selecting the results of a query, so no further setup is required here (it will of course still be inserted into the index during a save).

In previous versions of SphinxConnector.NET the UserIdKey property would have to be configured as follows:

fulltextStore.Conventions.IsFulltextFieldOnly = memberInfo => memberInfo.Name == "UserIdKey";

A query that uses the new attribute would then look this:

IList<CatalogItem> results = session.Query<CatalogItem>().
                                     Match("@UserIdKey userkey_42").
                                     ToList();

For the sake of completeness, here’s the corresponding Sphinx configuration:

index catalog
{
    type = rt
    path = catalog
rt_field = title rt_field = useridkey rt_attr_string = title rt_attr_uint = userid }

Tags: , ,

How-to

Using SphinxConnector.NET with ASP.NET MVC

by Dennis 21. December 2012 11:41

As using Sphinx from a web application is probably the most common use case,  I thought I’d post some guidelines and examples on how to use the fluent API in an ASP.NET MVC application with regards to setup and proper handling of IFulltextStore and IFulltextSession. The documentation already mentions that there should (usually) be one instance of the FulltextStore per application and one IFulltextSession per thread/(web-) request. Let’s take a look at a few different approaches to this:

Using Lazy

 

This approach makes use of the Lazy<T> class that was introduced with .NET 4.0. We create a base controller that holds the IFulltextStore instance which will be initialized upon the first access. Lazy<T> will make sure that the FulltextStore is created only once in a thread-safe way.

public abstract class SearchController : Controller
{
    private static readonly Lazy<IFulltextStore> Store = new Lazy<IFulltextStore>(() =>
    {
        IFulltextStore fulltextStore = new FulltextStore().Initialize();
        fulltextStore.ConnectionString.IsThis("pooling=true");

        return fulltextStore;
    });

    protected static IFulltextStore FulltextStore
    {
        get { return Store.Value; }
    }

    protected IFulltextSession FulltextSession { get; private set; }

    protected override void OnActionExecuting(ActionExecutingContext filterContext)
    {
        FulltextSession = FulltextStore.StartSession();
    }

    protected override void OnActionExecuted(ActionExecutedContext filterContext)
    {
        if (filterContext.IsChildAction || FulltextSession == null)
            return;

        using (FulltextSession)
        {
            if (filterContext.Exception != null)
                return;

            FulltextSession.FlushChanges();
        }
    }
}

The IFulltextSession for every request is created in an override of OnActionExecuting by assigning the result of StartSession to the FulltextSession property. This way, every controller that inherits from SearchController automatically gets an open session that is ready for use. In the override of OnActionExecuted we tell the FulltextSession to flush all pending changes. The using statement ensures that it is properly disposed of.

Using an IoC-Container

 

Following is an example installer for Castle Windsor:

public class SphinxConnectorInstaller : IWindsorInstaller
{
    public void Install(IWindsorContainer container, IConfigurationStore store)
    {
        container.Register(Component.For<IFulltextStore>().
                                     Instance(new FulltextStore().Initialize()).
                                     LifestyleSingleton(),
                           Component.For<IFulltextSession>().
                                     UsingFactoryMethod(kernel =>
                                         kernel.Resolve<IFulltextStore>().StartSession()).
                                     LifestylePerWebRequest());
    }
}

In this example, we setup Castle Windsor so that it can create both IFulltextStore and IFulltextSession. If you wanted to create IFulltextSession yourself (by injecting IFulltextStore into your classes and calling StartSession), you could remove the corresponding code from the installer.

We instruct Windsor to use the Singleton lifestyle for IFulltextStore, which means that Windsor will create one instance per container. In fact, Windsor uses Singleton is the default lifestyle, but in cases like this I’d like to make that explicit, so that developers that are not familiar with Windsor immediately see what’s going on. For IFulltextSession we set LifestylePerWebRequest so that Windsor will create an instance for each request; it will also automatically call Dispose at the end of each request, so we don’t have to worry about that. If you wanted Windsor to also call FlushChanges, you could do so with the help of Windsor’s OnDestroy method.

Initialization at Application Startup

 

Like with the first approach, we create a base controller, this time with a static property hat holds the IFulltextStore instance. The instance is initialized in the Global.asax.cs file in Application_Start:

public abstract class SearchController : Controller 
{
    public static IFulltextStore FulltextStore { get; set; }
protected IFulltextSession FulltextSession { get; private set; }
//Overrides of OnActionExecuting and OnActionExecuted omitted }
protected void Application_Start()
{
    AreaRegistration.RegisterAllAreas();

    RegisterGlobalFilters(GlobalFilters.Filters);
    RegisterRoutes(RouteTable.Routes);

    InitFulltextStore();
}

private static void InitFulltextStore()
{
    IFulltextStore fulltextStore = new FulltextStore().Initialize();
    fulltextStore.ConnectionString.IsThis("pooling=true");

    SearchController.FulltextStore = fulltextStore;
}

Tags: , ,

How-to

Importing Data into Sphinx RT-Indexes with SphinxConnector.NET’s Fluent API

by Dennis 19. November 2012 12:04

If you are facing the task of importing data into a Sphinx RT-index, SphinxConnector.NET’s fluent API makes this really easy with just a couple lines of code (the document model class is omitted for brevity):

void Import()
{
    IFulltextStore fulltextStore = new FulltextStore().Initialize();
    fulltextStore.ConnectionString.IsThis("pooling=true");

    int count = 0;

    using (IFulltextSession session = fulltextStore.StartSession())
    {
        foreach (var document in GetDocuments())
        {
            session.Save(document);

            if (++count % fulltextStore.Settings.SaveBatchSize == 0)
                session.FlushChanges();
        }

        session.FlushChanges();   
    }
}   

The important part is the call to FlushChanges each time a batch of documents has been passed to Save. This avoids high memory usage when importing many documents, because SphinxConnector.NET has to keep each document in memory until FlushChanges is called (though for smaller datasets it might be acceptable to flush all changes at the end of the import process).

Not only is this much simpler than writing SphinxQL by hand, it’s also faster because of SphinxConnector.NET’s automatic batching. The default value for SaveBatchSize is 16, which provides good performance, but can of course be adjusted for environments where a higher batch size leads to even more performance.

Tags: , ,

Tutorial | How-to

Doing Full-Text Searches in .NET with Sphinx - Part 1: Introduction

by Dennis 25. October 2012 11:16

This is the first post in a series (part 2, part 3) that is intended as an introduction to Sphinx for .NET developers who have not yet heard of Sphinx and are looking for a powerful full-text search engine for their websites or applications. The first part of this series serves as an introduction to Sphinx, and over the next parts we’ll build an ASP.NET MVC site that uses Sphinx.

What is Sphinx?

 

Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It's written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems.

Source: http://sphinxsearch.com/about/sphinx/

As we can see, Sphinx runs a stand-alone server (as opposed to embedded solutions) on pretty much any important operating system. Its key benefits are high search and indexing performance, scalability and providing high quality search results. It is used by a variety of sites from different domains like news, ecommerce and social media e.g. Slashdot, Craigslist and tumblr. Before you can use Sphinx for full-text searches you need to import your data into Sphinx.

How to get your data into Sphinx

 

First, let’s talk about some terminology: Sphinx stores our data in an index and each entry in an index is called a document. An index is comparable to a table in a database and a document is analogous to a row in database table. A document can contain text and attributes and must have a unique id. The text is the data that we want to search through while attributes contain additional data that is stored along with our text. Attributes can be used to hold data like dates, numbers, booleans and strings. The latter is especially interesting and useful as we’ll see in the next part.

So, how do we get our documents into Sphinx? There are several answers to that and it can sometimes be confusing for the beginner but once you get behind it, it’s not that difficult.

Index Types

As we already mentioned, Sphinx stores documents in an index. Let’s first take a look at the index types that Sphinx provides. Sphinx comes with two different index types: disk-based indexes (also referred to as batch indexes) and real-time (RT) indexes. The name disk-based index can be a bit confusing, because RT indexes are of course also stored on disk. RT indexes consist of so called disk-chunks which in turn are just regular disk-based indexes. The difference between these two index types is the way they are filled with data.

Disk-Based or Batch Indexes

With disk-based indexes, there are two ways to add documents: from a database or from a XML pipe. A disk-based index always needs to have a data source configured. As stated before, this source can be either a database or a stream of XML data. Sphinx can talk to many databases directly; it supports MSSQL, MySQL and PostgreSQL natively, other databases can be accessed via ODBC. If a database isn’t supported or the data is not stored in one, it can be exported to XML in a schema that Sphinx understands, and be processed from there.

The indexing for disk-based indexes is not done by the Sphinx search server itself, but by a separate program aptly called indexer. It fetches the data from a data source and creates the index files which are then served to clients by a program called searchd (the actual search server). Once a disk-based index is created, no new documents can be added to it, though attributes of existing documents can be updated. So how does one add new documents to an index? The easiest way is to just run the indexer again and re-create the index from the ground up. Schedule an indexer run every 20 minutes or so and new documents are available in the index with a maximum delay of 20 minutes plus the time it takes to create the index. While this seems crude, it can be a valid approach for smaller datasets because of its simplicity.

The approach commonly used for larger datasets, is a so called main-delta scheme, where one has a big index that is created once (main) and a smaller one that holds new documents (delta). Since the delta index is small it can re-build regularly. As the delta index grows it can be merged with the main index to keep indexing time short. Next, let’s take a look at how RT-indexes work:

Real-Time Indexes

The most notable difference between disk-based indexes and RT-indexes is the fact that the latter are not created by the indexer program. Then how do my documents get in there, you ask. Simple, we just INSERT them. With RT-indexes we can insert, update and delete documents in real-time (hence the name), all changes to an RT-index are made nearly instantaneously. These operations are done via Sphinx’ own query language named SphinxQL which supports INSERT, UPDATE, and DELETE statements similar to SQL. To reiterate (because sometimes new Sphinx users are confused by this): unlike disk-based indexes, real-time indexes are independent of the data source! It does not matter where your data comes from, you retrieve it from any source and then insert it into a real-time index via an INSERT INTO… statement, just as you would insert data into a database.

So which index type should I use?

Unfortunately, that is not an easy question to answer. It really depends on each project’s specific requirements but here are some general guidelines:

  • RT-indexes are missing some features (infix indexing for example) that disk-based indexes have. So if you can’t live without that, disk-based indexes are the way to go (Update: Infixes are supported as of Sphinx 2.1.1)
  • If you need instantaneous index updates, RT-indexes are the way to go (not entirely true, see below)
  • The indexing speed of disk-based indexes is often higher than that of RT-indexes

It is also possible to use both index types e.g. disk-based indexes for existing data and RT-indexes for new documents. Also, if you start out with disk-based indexes and decide to switch to RT indexes later, Sphinx provides an easy way to convert your existing indexes via the ATTACH INDEX statement.

Conclusion

 

In the first part of this series we took a high-level look at Sphinx - what index types it provides, how indexing works and different strategies for indexing data. In the next part we’ll get our hands dirty and do some actual coding!

Tags: , ,

Tutorial