Indexing Office and PDF Files With Sphinx and .NET

by Dennis 6. February 2013 13:19

Sphinx is a great full-text search engine with many amazing features, but there is one feature missing that would make it even better: the ability to directly index Word, Excel, PowerPoint, PDF files. How one can index these kinds of documents with Sphinx is something that is often asked in the Sphinx forum. Today I’d like to show you an easy way to extract text from these document types and store them in a Sphinx real-time index from your .NET application.

There are a bunch of tools and libraries out there that claim to be able to extract text from various document formats. As it is a pretty hard task to support many formats and extract text reliably, the quality of each library varies greatly. One tool that stands out is the Apache Tika™ toolkit. It is a Java library that

“detects and extracts metadata and structured text content from various documents using existing parser libraries.”

And it is really very good at it. Amongst others, it supports Microsoft Office, Open Document (ODF), and PDF files. But wait, I said Java library, didn’t I? “What am I supposed to do with a Java library in my .NET application?”, you might ask. Well, we’ll just convert it from Java to .NET using IKVM.NET. IKVM.NET is a .NET implementation of a Java Virtual Machine (JVM) which can be used as a drop-in replacement for Java. And it comes with a Java bytecode to CIL translator named ikvmc that we can use to build a .NET version of Apache Tika. In the next section, I’ll walk through the steps required to do this. At the end of this article you can download a complete sample application that uses Tika to extract text from some files and stores them in a Sphinx real-time index via SphinxConnector.NET.

Creating a .NET Version of Apache Tika


To create your own .NET version of Apacha Tika you need to:

  1. Download IKVM.NET
  2. Download the .jar file from the Tika project page (the current version at the time of writing is 1.3)
  3. Extract IKVM.NET to a folder of your choice
  4. Optional: Add the bin folder of IKVM.NET to your %PATH% variable
  5. Execute the following command (add/change the paths to ikvmc.exe and tika-app-1.3.jar if needed):
ikvmc.exe -target:library -version:1.3.0.0 -out:Apache.Tika.dll tika-app-1.3.jar

Let’s walk through the command line parameters: With –target:library we tell ikvmc to convert the jar to a class library. This is needed because the jar file is also usable as a standalone console/gui application, i.e. contains a main() method, which by default would cause ikvmc to generate an exe file. Next, we specify the version for our output DLL because otherwise ikvmc would set the version to 0.0.0.0. Finally we specify the output file name via –out: and the path to the Tika jar file.

After hitting Enter, ikvmc starts to translate the Java library to .NET. It’ll output a lot of warning messages, but will eventually finish and produce a working DLL. Note that if you want to sign the output assembly you can do so by specifying a key file via the -keyfile: command line option.

Extracting Text from Documents


Now that we've created a .NET library of Tika, we can start extracting text from documents. I’ve created small wrapper that provides methods to perform the extraction. To build the wrapper DLL we need to add references to a couple of IKVM.NET libraries:

IKVMNETLibs

Note that you need to reference more of IKVM.NET’s DLL’s in an application that uses Tika, these are just the required files to compile the wrapper project.

The AutoTextExtractor class which handles the extraction of text from files and binary data (useful if your documents are stored in a DB) and the TextExtractionResult class are based on these by Kevin Miller:

public class AutoTextExtractor
{
  public TextExtractionResult Extract(string filePath,OutputType outputType = OutputType.Text)
  {
      return Extract(System.IO.File.ReadAllBytes(filePath), outputType);
  }

  public TextExtractionResult Extract(byte[] data, OutputType outputType = OutputType.Text)
  {
      var parser = new AutoDetectParser();
      var metadata = new Metadata();

      using (Writer outputWriter = new StringWriter())
      using (InputStream inputStream = TikaInputStream.get(data, metadata))
      {
          parser.parse(inputStream, GetTransformerHandler(outputWriter, outputType), 
              metadata, new ParseContext());
          
          return AssembleExtractionResult(outputWriter.toString(), metadata);
      }
  }

  private static TextExtractionResult AssembleExtractionResult(string text, Metadata metadata)
  {
      Dictionary<string, string> metaDataResult = metadata.names().
          ToDictionary(name => name, name => String.Join(", ", metadata.getValues(name)));

      string contentType = metaDataResult["Content-Type"];

      return new TextExtractionResult
      {
          Text = text,
          ContentType = contentType,
          Metadata = metaDataResult
      };
  }

  private TransformerHandler GetTransformerHandler(Writer outputWriter, OutputType outputType)
  {
      var factory = (SAXTransformerFactory)TransformerFactory.newInstance();
      TransformerHandler handler = factory.newTransformerHandler();
      handler.getTransformer().setOutputProperty(OutputKeys.METHOD, outputType.ToString());
      handler.setResult(new StreamResult(outputWriter));

      return handler;
  }
}

Here’s the TextExtractionResult class:

public class TextExtractionResult
{
    public string Text { get; set; }
    
    public string ContentType { get; set; }
    
    public IDictionary<string, string> Metadata { get; set; }
}

And the OutputType enumeration:

public enum OutputType
{
    Text,
    Html,
    Xml
}

Demo Application


I’ve created a small demo application that contains a DLL that wraps Tika with the help of the aforementioned classes, and a console application that demonstrates how to extract and store the contents of some files in a Sphinx real-time index with SphinxConnector.NET. The code that does the extraction is pretty simple:

private static SphinxDocumentModel[] GetDocuments()
{
    AutoTextExtractor textExtractor = new AutoTextExtractor();

    int id = 1;

    return (from filePath in Directory.EnumerateFiles(@"..\..\..\..\testfiles")
            select new SphinxDocumentModel
            {
                Id = id++,
                FileContent = textExtractor.Extract(filePath).Text,
                FilePath = Path.GetFullPath(filePath)
            }).ToArray();
}

I’ll omit the code that saves the documents to the index, as it is straightforward. Tip: if you are working with big files you might have to increase Sphinx’ max_packet_size setting.

The archive contains all required libraries so it’s pretty big (IKVM.NET and Tika alone take up about 50 MB in space).

Downloads:

[1] Zip-Package (22 MB)

[2] 7z-Package (18 MB)

Tags: , , , ,

How-to | Tutorial

SphinxConnector.NET 3.0.4 released

by Dennis 11. September 2012 10:13

This is just a small maintenance release that contains a few bugfixes. A list of resolved issues is available in the version history. NuGet users can update to the latest version via the package manager, a ZIP package can be downloaded from the download page.

Tags: , , ,

Announcements

SphinxConnector.NET 3.0 has been released

by Dennis 3. September 2012 10:03

We are pleased to announce that the new major version of SphinxConnector.NET is now available for download!

Those of you who have been following the blog already know about the big new feature coming with this release: the fluent query API. The fluent API provides you with a LINQ-like query API to design your full-text queries. It operates directly on your document models and also lets you comfortably save and delete documents from real-time indexes.

A description with much more details is available on the features page.

Another highlight of this release is the newly added support for the Mono runtime. Additionally, we've upgraded Common.Logging to version 2, which provides support for recent releases of the supported logging frameworks. We've also added support for running SphinxConnector.NET in medium-trust environments. There are a bunch of other improvements which are listed in the version history.

SphinxConnector.NET is now available as a NuGet package, which we know many of you have been waiting for!

Licensing and Upgrading

With the new release we're switching to a subscription based licensing system. All new purchases and upgrades come with a 1 year upgrade subscription which gives you access to all major and minor releases made during the subscription period. At the end of the subscription period you can renew your license for just 40% of the then current price.

If you bought your license in 2012, you will receive SphinxConnector.NET 3.0 and all other releases made this year for free! Afterwards you can renew your licenses at the conditions outlined above.

If you bought your license before 2012, you can also renew your license for just 40% of the current price!

We are also introducing a new license type, the 'Large Team License' for up to eight developers, to make up for the fact that we had to raise the price for the site license quite a bit more than we wished. If you have purchased a Site License you can downgrade to a Large Team License if you're eligible.

You can now also purchase a premium support subscription along with your license or license renewal. All details can be found on our purchase page.

If you would like to send us feedback about the new version, you can use the contact form or send us an e-mail to contact@sphinxconnector.net.

Tags: , , ,

Announcements

Introducing the Fluent Query API Part 4 of n: Saving and Deleting Documents

by Dennis 21. August 2012 14:54

Disclaimer: The API presented here is still under development, so there might be changes until the final release. If you have any suggestions or comments post them here, over at Uservoice or drop me a mail!

While SphinxConnector.NET 3.0 is nearing completion and will be released soon, I wanted to write about another nice feature of the fluent API. In the first few posts we’ve focused exclusively on querying. But of course, you’ll also be able to save and delete documents in your (real-time) indexes.

Saving Documents

 

We’ll use the following class as our document model:

public class Book
{
    public int Id { get; set; }
    public string Title { get; set; }
    public string Author { get; set; }
    public decimal Price { get; set; }
    public bool EbookAvailable { get; set; }
    public DateTime ReleaseDate { get; set; }
    public IList<long> Categories { get; set; }
    public int Weight { get; set; }
}

which is based on the following index definition:

index books
{
    type                 = rt
    path                 = books

    rt_field             = title
    rt_attr_string       = title
    rt_field             = author
    rt_attr_string       = author
    rt_attr_float        = price
    rt_attr_timestamp    = releasedate
    rt_attr_uint         = ebookavailable
    rt_attr_multi        = categories

    charset_table        = 0..9, A..Z->a..z, a..z
    charset_type         = utf-8
}

The IFulltextSession interface provides a Save method with two overloads so we can either save a single document or an enumerable of documents.

void Save(object document);

void Save<TDocument>(IEnumerable<TDocument> documents);

Let’s insert two books into our index:

IFulltextStore fulltextStore = new FulltextStore().Initialize();

using (IFulltextSession session = fulltextStore.StartSession())
{
    session.Save(new Book
    {
        Id = 1,
        Author = "George R.R. Martin",
        Title = "A Game of Thrones: A Song of Ice and Fire: Book One",
        EbookAvailable = true,
        Categories = new long[] { 1, 2 },
        Price = 5.60m,
        ReleaseDate = new DateTime(1997, 8, 4)
    });

    session.Save(new Book
    {
        Id = 2,
        Author = "George R.R. Martin",
        Title = "A Clash of Kings: A Song of Ice and Fire: Book Two",
        EbookAvailable = true,
        Categories = new long[] { 1, 2 },
        Price = 7.10m,
        ReleaseDate = new DateTime(2000, 9, 5)
    });

    session.FlushChanges();
}

The above code is pretty straightforward, we’re creating two instances of the Book class and pass them to the Save method. When we’re done, we call FlushChanges to tell SphinxConnector.NET that all pending saves and deletes should be executed.

This is where things get interesting: When SphinxConnector.NET detects that it needs to save more than one document, it inserts them in batches by generating a single REPLACE statement for each batch:

REPLACE INTO `books` (id, title, author, price, ebookavailable, releasedate, categories) 
VALUES (1, 'A Game of Thrones: A Song of Ice and Fire: Book One', 'George R.R. Martin', 5.60, 1,
870645600, (1, 2)), (2, 'A Clash of Kings: A Song of Ice and Fire: Book Two', 'George R.R. Martin', 7.10, 1,
968104800, (1, 2))

This leads to a speed-up by several orders of magnitude compared to inserting documents one by one. The batch size is of course configurable, so you can fine tune it to your workload.

Deleting Documents

 

To delete documents from real-time indexes we’ll use the Delete methods that the IFulltextSession interface provides. We can either pass in one or more id’s of the documents to delete, or an instance of a document to delete:

using (IFulltextSession session = fulltextStore.StartSession())
{
    session.Delete<Book>(1, 2);

    session.FlushChanges();
}

Let’s take a look at the generated SphinxQL:

DELETE FROM `books` WHERE id IN (1, 2)

Again, SphinxConnector.NET takes into account that more than one document should be deleted and generates a single DELETE statement instead of two, thus avoiding unnecessary network round-trips.

Transaction Handling

 

When a call to FlushChanges is made, SphinxConnector.NET executes all saves and deletes within a new transaction for each affected index. The reason for this is that Sphinx limits transactions to a single real-time index. This is something that can get pretty messy when handled manually, but SphinxConnector.NET will take care of that for you.

Using the TransactionScope class in conjunction with a full-text session is also fully supported.

Tags: , , ,

Breaking Changes in SphinxConnector.NET 3.0

by Dennis 19. July 2012 17:41

After looking into the fluent query API that will be shipping with SphinxConnector.NET 3.0 in the last three posts, today we’ll be looking at a not so pleasant topic: the breaking changes of the next release.

Requirements Changes


The first thing to mention is that SphinxConnector.NET 3.0 needs at least .NET 3.5 to run. If you’ve read the posts about the fluent query API you probably guessed that already, as it makes heavy use of features only available in .NET 3.5 like expression trees. The next big change is that support for Sphinx versions < 2.0.1 has been dropped. The reason for this is mainly that SphinxQL has greatly improved with the V2 release of Sphinx and many of these improvements are being used for the fluent query API. For those of you that are still using an older Sphinx version, SphinxConnector.NET 2.x will continue to be available and also receive bug fixes if necessary. However, if you’ve not yet updated to Sphinx 2.x, now is a a great time!

Removal of obsolete members


All methods and properties that have been marked as obsolete in SphinxConnector.NET V2 have been removed.

New Namespace for the native API


The native API, i.e. everything that revolves around the SphinxClient class has been moved into its own namespace which (surprisingly Winking smile) is called ‘NativeAPI’. With the addition of a new query API this a logical thing to do to keep things organized within the assembly.

A Namespace for Common Types


Types that are used in more than one kind of API have been moved to a namespace named ‘Common’. This applies to classes like SphinxHelper and SphinxException and a few other types that have been added in V3. One could argue that they could have just been left in the root namespace, but IMO tends to lead to clutter especially if more classes get added over time.

The only class that is contained in the root namespace the new SphinxConnectorLicensing class. It has just one method: SetLicense, which should make it pretty clear where the license key belongs. Since the introduction of the SphinxQL API with V2, there has sometimes been confusion about where the license key goes, because it had to be assigned the License property of the SphinxClient class, which is not that obvious if you’re only using SphinxQL. That property is now gone and hopefully any confusion about the license key with it.

And finally, the namespaces have been renamed such that the root namespace now named ‘SphinxConnector’. This also means that the assembly of V3 will be named ‘SphinxConnector.dll’.

Conclusion


While breaking changes are certainly annoying, I think that in this case it’s only half as bad. You have to set your license key only once per application, so that’s just a small change. The namespaces changes should also be easy to do with Visual Studio’s refactoring capabilities and even easier if you are using a tool like ReSharper.

Tags: , , ,