Indexing Office and PDF Files With Sphinx and .NET

by Dennis 6. February 2013 13:19

Sphinx is a great full-text search engine with many amazing features, but there is one feature missing that would make it even better: the ability to directly index Word, Excel, PowerPoint, PDF files. How one can index these kinds of documents with Sphinx is something that is often asked in the Sphinx forum. Today I’d like to show you an easy way to extract text from these document types and store them in a Sphinx real-time index from your .NET application.

There are a bunch of tools and libraries out there that claim to be able to extract text from various document formats. As it is a pretty hard task to support many formats and extract text reliably, the quality of each library varies greatly. One tool that stands out is the Apache Tika™ toolkit. It is a Java library that

“detects and extracts metadata and structured text content from various documents using existing parser libraries.”

And it is really very good at it. Amongst others, it supports Microsoft Office, Open Document (ODF), and PDF files. But wait, I said Java library, didn’t I? “What am I supposed to do with a Java library in my .NET application?”, you might ask. Well, we’ll just convert it from Java to .NET using IKVM.NET. IKVM.NET is a .NET implementation of a Java Virtual Machine (JVM) which can be used as a drop-in replacement for Java. And it comes with a Java bytecode to CIL translator named ikvmc that we can use to build a .NET version of Apache Tika. In the next section, I’ll walk through the steps required to do this. At the end of this article you can download a complete sample application that uses Tika to extract text from some files and stores them in a Sphinx real-time index via SphinxConnector.NET.

Creating a .NET Version of Apache Tika

To create your own .NET version of Apacha Tika you need to:

Download IKVM.NET
Download the .jar file from the Tika project page (the current version at the time of writing is 1.3)
Extract IKVM.NET to a folder of your choice
Optional: Add the bin folder of IKVM.NET to your %PATH% variable
Execute the following command (add/change the paths to ikvmc.exe and tika-app-1.3.jar if needed):

ikvmc.exe -target:library -version:1.3.0.0 -out:Apache.Tika.dll tika-app-1.3.jar

Let’s walk through the command line parameters: With –target:library we tell ikvmc to convert the jar to a class library. This is needed because the jar file is also usable as a standalone console/gui application, i.e. contains a main() method, which by default would cause ikvmc to generate an exe file. Next, we specify the version for our output DLL because otherwise ikvmc would set the version to 0.0.0.0. Finally we specify the output file name via –out: and the path to the Tika jar file.

After hitting Enter, ikvmc starts to translate the Java library to .NET. It’ll output a lot of warning messages, but will eventually finish and produce a working DLL. Note that if you want to sign the output assembly you can do so by specifying a key file via the -keyfile: command line option.

Extracting Text from Documents

Now that we've created a .NET library of Tika, we can start extracting text from documents. I’ve created small wrapper that provides methods to perform the extraction. To build the wrapper DLL we need to add references to a couple of IKVM.NET libraries:

Note that you need to reference more of IKVM.NET’s DLL’s in an application that uses Tika, these are just the required files to compile the wrapper project.

The AutoTextExtractor class which handles the extraction of text from files and binary data (useful if your documents are stored in a DB) and the TextExtractionResult class are based on these by Kevin Miller:

public class AutoTextExtractor
{
  public TextExtractionResult Extract(string filePath,OutputType outputType = OutputType.Text)
  {
      return Extract(System.IO.File.ReadAllBytes(filePath), outputType);
  }

  public TextExtractionResult Extract(byte[] data, OutputType outputType = OutputType.Text)
  {
      var parser = new AutoDetectParser();
      var metadata = new Metadata();

      using (Writer outputWriter = new StringWriter())
      using (InputStream inputStream = TikaInputStream.get(data, metadata))
      {
          parser.parse(inputStream, GetTransformerHandler(outputWriter, outputType), 
              metadata, new ParseContext());
          
          return AssembleExtractionResult(outputWriter.toString(), metadata);
      }
  }

  private static TextExtractionResult AssembleExtractionResult(string text, Metadata metadata)
  {
      Dictionary<string, string> metaDataResult = metadata.names().
          ToDictionary(name => name, name => String.Join(", ", metadata.getValues(name)));

      string contentType = metaDataResult["Content-Type"];

      return new TextExtractionResult
      {
          Text = text,
          ContentType = contentType,
          Metadata = metaDataResult
      };
  }

  private TransformerHandler GetTransformerHandler(Writer outputWriter, OutputType outputType)
  {
      var factory = (SAXTransformerFactory)TransformerFactory.newInstance();
      TransformerHandler handler = factory.newTransformerHandler();
      handler.getTransformer().setOutputProperty(OutputKeys.METHOD, outputType.ToString());
      handler.setResult(new StreamResult(outputWriter));

      return handler;
  }
}

Here’s the TextExtractionResult class:

public class TextExtractionResult
{
    public string Text { get; set; }
    
    public string ContentType { get; set; }
    
    public IDictionary<string, string> Metadata { get; set; }
}

And the OutputType enumeration:

public enum OutputType
{
    Text,
    Html,
    Xml
}

Demo Application

I’ve created a small demo application that contains a DLL that wraps Tika with the help of the aforementioned classes, and a console application that demonstrates how to extract and store the contents of some files in a Sphinx real-time index with SphinxConnector.NET. The code that does the extraction is pretty simple:

private static SphinxDocumentModel[] GetDocuments()
{
    AutoTextExtractor textExtractor = new AutoTextExtractor();

    int id = 1;

    return (from filePath in Directory.EnumerateFiles(@"..\..\..\..\testfiles")
            select new SphinxDocumentModel
            {
                Id = id++,
                FileContent = textExtractor.Extract(filePath).Text,
                FilePath = Path.GetFullPath(filePath)
            }).ToArray();
}

I’ll omit the code that saves the documents to the index, as it is straightforward. Tip: if you are working with big files you might have to increase Sphinx’ max_packet_size setting.

The archive contains all required libraries so it’s pretty big (IKVM.NET and Tika alone take up about 50 MB in space).

Downloads:

[1] Zip-Package (22 MB)

[2] 7z-Package (18 MB)

65f4dfa5-64c6-478b-a624-d39bfa64d8c2|0|.0

Tags: sphinx, .NET, indexing, office documents, pdf

How-to | Tutorial

Optimized Attribute Filtering with SphinxConnector.NET’s Fluent API

by Dennis 1. February 2013 12:11

An interesting article over at the MySQL Performance Blog was recently published about optimizing Sphinx queries that only filter by an attribute (i.e. do not contain a full-text query). I recommend reading the article first and then coming back here, but here’s a quick summary: a Sphinx query that only filters by an attribute may be relatively slow compared to an equivalent query in a regular DBMS. The reason for this is the fact that one cannot create indexes (as in B-tree indexes) for attributes in Sphinx as one would do in a DBMS. So to retrieve the results of such a query, Sphinx has to perform a full-scan of the index which is relatively costly depending on the size of the index.

The article describes a neat trick to get around this limitation: by adding a full-text indexed field for an attribute and querying that, one can achieve a greatly improved query time. In this post I’d like to demonstrate how this technique can be used with SphinxConnector.NET’s fluent API in conjunction with a real-time index.

The index in the articles example contains data about books, so I’ll be using that here as well. These documents have an integer attribute for a user id that we’d like to store as a full-text indexed field. Let’s take a look at what the document model should look like and which additional settings need to be applied.

To add the user id attribute to the full-text index it needs to be converted to a string. We’ll also add a prefix to each value to avoid it being included in the results of a “regular” full-text query. To do this, we add a string property to the document model that returns the converted and prefixed value:

public class CatalogItem
{
    public int Id { get; set; }

    public int UserId { get; set; }

    public string Title { get; set; }

    public string UserIdKey
    {
        get { return "userkey_" + UserId; }
    }
}

As of Version 3.2, SphinxConnector.NET will automatically exclude any read-only property when selecting the results of a query, so no further setup is required here (it will of course still be inserted into the index during a save).

In previous versions of SphinxConnector.NET the UserIdKey property would have to be configured as follows:

fulltextStore.Conventions.IsFulltextFieldOnly = memberInfo => memberInfo.Name == "UserIdKey";

A query that uses the new attribute would then look this:

IList<CatalogItem> results = session.Query<CatalogItem>().
                                     Match("@UserIdKey userkey_42").
                                     ToList();

For the sake of completeness, here’s the corresponding Sphinx configuration:

index catalog
{
    type = rt
    path = catalog

    rt_field = title
    rt_field = useridkey
    rt_attr_string = title
    rt_attr_uint = userid    
}

02f9f157-6ae8-4c0a-bd63-c2266ad03a2e|0|.0

Tags: sphinx, Sphinxconnector.NET, performance

How-to

SphinxConnector.NET 3.2 has been released

by Dennis 30. January 2013 11:50

We're pleased to announce the immediate availability of a new release of SphinxConnector.NET! Among other things, we've been busy to add support for Sphinx 2.1 in the course of which we've made several optimizations that should improve performance and reduce memory usage with SphinxQL and the fluent API.

The latter now properly supports enums and has gotten support for JSON attributes that are going to be introduced with Sphinx 2.1. The methods First() and FirstOrDefault() can now also be executed as futures, and we've added the possibility to perform operations like attaching and flushing indexes to the fluent API.

There are several other additions, improvements, and bugfixes which are listed in the version history.

17c2c444-a844-4dcc-a77e-072776b9247f|0|.0

Tags:

Announcements

A Quick Way to Setup Logging during Development

by Dennis 18. January 2013 09:29

I’ve been asked a few times if there’s a quick way to get logging output from SphinxConnector.NET without setting up a “real” logging framework like NLog. Here’s one: Common.Logging comes with two adapters named TraceLoggerFactoryAdapter and ConsoleOutLoggerFactoryAdapter. The latter (obviously) logs messages to the console, while the former logs messages via .NET’s Trace class. One nice thing about the trace log is that it can be accessed via Visual Studio’s ‘Output’ window (CTRL+ALT+O) if your application is running with a debugger attached (F5).

Here is the relevant code:

[Conditional("DEBUG")]
private static void SetupLogging()
{
    LogManager.Adapter = new TraceLoggerFactoryAdapter
    {
        Level = LogLevel.All,
        ShowLevel = true,
        ShowDateTime = true,
    };
}

I also added a Conditional attribute to the setup method to ensure that it is only being called in a debug configuration.

f4fd60a1-f997-4378-8f02-88a7bbb3fba1|1|5.0

Tags:

How-to

Using SphinxConnector.NET with ASP.NET MVC

by Dennis 21. December 2012 11:41

As using Sphinx from a web application is probably the most common use case, I thought I’d post some guidelines and examples on how to use the fluent API in an ASP.NET MVC application with regards to setup and proper handling of IFulltextStore and IFulltextSession. The documentation already mentions that there should (usually) be one instance of the FulltextStore per application and one IFulltextSession per thread/(web-) request. Let’s take a look at a few different approaches to this:

Using Lazy

This approach makes use of the Lazy<T> class that was introduced with .NET 4.0. We create a base controller that holds the IFulltextStore instance which will be initialized upon the first access. Lazy<T> will make sure that the FulltextStore is created only once in a thread-safe way.

public abstract class SearchController : Controller
{
    private static readonly Lazy<IFulltextStore> Store = new Lazy<IFulltextStore>(() =>
    {
        IFulltextStore fulltextStore = new FulltextStore().Initialize();
        fulltextStore.ConnectionString.IsThis("pooling=true");

        return fulltextStore;
    });

    protected static IFulltextStore FulltextStore
    {
        get { return Store.Value; }
    }

    protected IFulltextSession FulltextSession { get; private set; }

    protected override void OnActionExecuting(ActionExecutingContext filterContext)
    {
        FulltextSession = FulltextStore.StartSession();
    }

    protected override void OnActionExecuted(ActionExecutedContext filterContext)
    {
        if (filterContext.IsChildAction || FulltextSession == null)
            return;

        using (FulltextSession)
        {
            if (filterContext.Exception != null)
                return;

            FulltextSession.FlushChanges();
        }
    }
}

The IFulltextSession for every request is created in an override of OnActionExecuting by assigning the result of StartSession to the FulltextSession property. This way, every controller that inherits from SearchController automatically gets an open session that is ready for use. In the override of OnActionExecuted we tell the FulltextSession to flush all pending changes. The using statement ensures that it is properly disposed of.

Using an IoC-Container

Following is an example installer for Castle Windsor:

public class SphinxConnectorInstaller : IWindsorInstaller
{
    public void Install(IWindsorContainer container, IConfigurationStore store)
    {
        container.Register(Component.For<IFulltextStore>().
                                     Instance(new FulltextStore().Initialize()).
                                     LifestyleSingleton(),
                           Component.For<IFulltextSession>().
                                     UsingFactoryMethod(kernel =>
                                         kernel.Resolve<IFulltextStore>().StartSession()).
                                     LifestylePerWebRequest());
    }
}

In this example, we setup Castle Windsor so that it can create both IFulltextStore and IFulltextSession. If you wanted to create IFulltextSession yourself (by injecting IFulltextStore into your classes and calling StartSession), you could remove the corresponding code from the installer.

We instruct Windsor to use the Singleton lifestyle for IFulltextStore, which means that Windsor will create one instance per container. In fact, Windsor uses Singleton is the default lifestyle, but in cases like this I’d like to make that explicit, so that developers that are not familiar with Windsor immediately see what’s going on. For IFulltextSession we set LifestylePerWebRequest so that Windsor will create an instance for each request; it will also automatically call Dispose at the end of each request, so we don’t have to worry about that. If you wanted Windsor to also call FlushChanges, you could do so with the help of Windsor’s OnDestroy method.

Initialization at Application Startup

Like with the first approach, we create a base controller, this time with a static property hat holds the IFulltextStore instance. The instance is initialized in the Global.asax.cs file in Application_Start:

public abstract class SearchController : Controller 
{
    public static IFulltextStore FulltextStore { get; set; }

    protected IFulltextSession FulltextSession { get; private set; } 
    
    //Overrides of OnActionExecuting and OnActionExecuted omitted
}

protected void Application_Start()
{
    AreaRegistration.RegisterAllAreas();

    RegisterGlobalFilters(GlobalFilters.Filters);
    RegisterRoutes(RouteTable.Routes);

    InitFulltextStore();
}

private static void InitFulltextStore()
{
    IFulltextStore fulltextStore = new FulltextStore().Initialize();
    fulltextStore.ConnectionString.IsThis("pooling=true");

    SearchController.FulltextStore = fulltextStore;
}

9b89facf-0ed5-4c71-a8a8-c8831d2109df|0|.0

Tags: ASP.NET MVC, Sphinxconnector.NET, sphinx

How-to

Indexing Office and PDF Files With Sphinx and .NET

Creating a .NET Version of Apache Tika

Extracting Text from Documents

Demo Application

Downloads:

Optimized Attribute Filtering with SphinxConnector.NET’s Fluent API

SphinxConnector.NET 3.2 has been released

A Quick Way to Setup Logging during Development

Using SphinxConnector.NET with ASP.NET MVC

Using Lazy

Using an IoC-Container

Initialization at Application Startup

Recent Posts

Month List

Indexing Office and PDF Files With Sphinx and .NET

Creating a .NET Version of Apache Tika

Extracting Text from Documents

Demo Application

Downloads:

Optimized Attribute Filtering with SphinxConnector.NET’s Fluent API

SphinxConnector.NET 3.2 has been released

A Quick Way to Setup Logging during Development

Using SphinxConnector.NET with ASP.NET MVC

Using Lazy

Using an IoC-Container

Initialization at Application Startup

Recent Posts

Month List

Tag cloud

Category list