Doing Full-Text Searches in .NET with Sphinx - Part 2: Data Import

by Dennis 18. April 2013 11:52

This is the second post in a series that is intended as an introduction to Sphinx for .NET developers who have not yet heard of Sphinx and are looking for a powerful full-text search engine for their websites or applications. The first part of this series served as an introduction to Sphinx, in this part we’ll get our hands dirty and start working on our ASP.NET MVC based sample application.

We’ll be creating a simple website that will allow us to search through Stackoverflow data (or rather the posts made by Stackoverflow users). The Stackoverflow team kindly provides this data under a Creative Commons license in an XML format. As the dataset is pretty large, just the posts are 7 GB in size (uncompressed), we’ll only use about 10,000 posts for demonstration purposes. A file with only these posts is included in the download so you won’t have to download the whole archive.

Having an existing set of data that needs to be imported into Sphinx is probably the most common scenario when people start to work with Sphinx. We will begin by creating the document model class and the Sphinx configuration file. We will be storing the documents in a real-time index, so we also need to create a small program to read the input data from the XML file.

The current state of this sample application is available for download, it uses .NET 4.0, ASP.NET MVC3, Sphinx 2.1.1, SphinxConnector.NET and contains a Visual Studio 2010 solution. It contains the console application for importing data and the web application which will be described in detail in the next part. Please note that the layout of the website is simple and might have some display quirks as I don’t want to spend too much time on making it look pretty. 

Creating the Document Model


The first step is to determine which part of the data should be stored in the index. One approach is to store only the bare minimum that is needed for full-text searches. Often, this requires that after the search the original data is retrieved from the data source, most likely a database, to display a meaningful search result to the user. As this requires one or more additional network round trips, I personally like to store as much data in the index as is needed (and feasible) to avoid this overhead.

In this example we’ll be storing everything in the index, because it reduces the complexity of the example. For instance, we use Sphinx to display the questions on the front page, which does not involve searching at all. But, storing your complete dataset in Sphinx is usually not a good idea because it can drastically increase the resources needed by Sphinx. Also Sphinx isn’t a database and shouldn’t be treated as such, for instance it doesn’t execute certain types of queries as efficient as a database would (though there are a some tricks for that). We are doing this here for the sake of simplicity, and it might even be feasible in some cases, but should be done with care!

Let’s take a look at our input data. The Stackoverflow data archive contains a readme file with a description of the input data. Here is the relevant part for the posts.xml file which we are interested in:

**posts**.xml
       - Id
       - PostTypeId
          - 1: Question
          - 2: Answer
       - ParentID (only present if PostTypeId is 2)
       - AcceptedAnswerId (only present if PostTypeId is 1)
       - CreationDate
       - Score
       - ViewCount
       - Body
       - OwnerUserId
       - LastEditorUserId
       - LastEditorDisplayName="Jeff Atwood"
       - LastEditDate="2009-03-05T22:28:34.823"
       - LastActivityDate="2009-03-11T12:51:01.480"
       - CommunityOwnedDate="2009-03-11T12:51:01.480"
       - ClosedDate="2009-03-11T12:51:01.480"
       - Title=
       - Tags=
       - AnswerCount
       - CommentCount
       - FavoriteCount

We’ll leave out some attributes that are not needed in our example which leads to the following model class:

public class Post
{
    public int Id { get; set; }

    public PostType PostType { get; set; }
    public long ParentId { get; set; }
    public long AcceptedAnswerId { get; set; }
    public string Body { get; set; }
    public string Title { get; set; }
    public string Tags { get; set; }
    public long Score { get; set; }
    public long ViewCount { get; set; }
    public long AnswerCount { get; set; }
    public DateTime CreationDate { get; set; }
    
    public int Weight { get; set; }
}

You might have already noticed that we’ve added a property named ‘Weight’ to our document class which is not present in the data source. When Sphinx searches the index for results it assigns each match a weight depending on its relevancy to the search query. The higher the weight, the more relevant it is for the query, and because we might want to use the weight in our own relevancy calculations, we need to add a property for it to our model.

Configuring Sphinx


Next we need to create the configuration for the Sphinx server. By default Sphinx will look for a file named sphinx.conf in the directory where the executable resides so we’ll name it just that. The configuration file consists of several sections: one for the server itself, one for the indexer program which is not relevant for our sample, and the index configuration. For our documents the index configuration looks like this:

index posts
{
    type                    = rt
    path                    = posts
    
    rt_field                = body 
    rt_attr_string          = body 
    rt_field                = title     
    rt_attr_string          = title
    rt_field                = tags          
    rt_attr_string          = tags 

    rt_attr_uint            = parentid
    rt_attr_uint            = score 
    rt_attr_uint            = viewcount
    rt_attr_uint            = posttype
    rt_attr_uint            = answercount
    rt_attr_timestamp       = creationdate    
    rt_attr_uint            = acceptedanswerid
    rt_attr_uint            = votecount
    
    charset_type            = utf-8
    min_word_len            = 1    
    dict                    = keywords                    
    expand_keywords         = 1                            
    rt_mem_limit            = 1024M
    
    charset_table           = 0..9, a..z, A..Z->a..z, U+DF, \
                              U+FC->u, U+DC->u, U+FC,U+DC, \
                              U+F6->o, U+D6->o, U+F6,U+D6, \
                              U+E4->a, U+C4->a, U+C4,U+E4, \
                              U+E1->a, U+C9->a, U+E9->e, U+C9->e, \
                              U+410..U+42F->U+430..U+44F, U+430..U+44F, U+00E6, \
                              U+00C6->U+00E6, U+01E2->U+00E6, U+01E3->U+00E6, \
                              U+01FC->U+00E6, U+01FD->U+00E6, U+1D01->U+00E6, \
                              U+1D02->U+00E6, U+1D2D->U+00E6, U+1D46->U+00E6
    
    blend_chars             = U+23, U+2B, -, ., @, &

stopwords               = stopwords.txt }

Let’s look through the different configuration options: first we specify the index type and the path where the index files should be stored. Next we declare the fields and attributes; it is not necessary to explicitly declare the id attribute, Sphinx does that automatically for us.

The next thing to note is the declaration of the full-text fields which are the fields that are searched by Sphinx when it processes a full-text query. In order to be able to retrieve the contents of these fields, so that we can display them to the user, we also declare string attributes for each of them. Why is it necessary to declare both? When the keywords are inserted into the index, they are hashed. This implies that there is no way to retrieve the original data that was inserted. This is where string attributes come into play. They enable us to store and retrieve the original content, which we then use to display the search result to the user, which allows us to avoid accessing the database altogether.

Next we specify the charset to use for our documents which is utf8 in our case. For the minimum word length we use one, the keywords dictionary type is set to ‘keywords’, we also enable Sphinx’ internal keyword expansion to allow for partial word matches and set the memory limit for RT index RAM chunks to 1024MB.

The charset table is up next. It defines which characters are recognized by Sphinx and also allows to remap a character to another. It is important to understand that if a character is not in the charset table, Sphinx will treat that character as a separator during indexing. To configure the charset table we can use both the character itself and/or its Unicode number. In our charset table we first add numbers from 0 to 9, lower case chars and remap upper case chars to lower case chars. Additionally we define some more mappings e.g. German umlauts are mapped to their non-umlaut equivalent like ü –> u. When defining your own charset table you can refer to the Sphinx wiki which contains a page with charset tables.

Lastly, we define the blend chars and configure a file with stop words. Blend chars are characters that are treated by Sphinx as both regular characters and separators. Suppose that you’re indexing email addresses and configure the ‘@’ symbol as a blend char, an email address such as ‘foo@bar.com’ will be indexed as a whole and as ‘foo’ and ‘bar.com’. This enables users to find an email address even if they don’t remember the domain or search for addresses that belong to a particular domain.

The stop words file contains a list of words that should be ignored by Sphinx during indexing. While any word can be configured as a stop word, most of the time frequently occurring words such as ‘the’, ‘is’, etc. are configured as stop words because it reduces the size of the full-text index and thereby improves query times.

Importing the Documents


To import the Stackoverflow data we create a simple console application that reads the input data from the XML file and uses SphinxConnector.NET’s fluent API to save the data into the index. Here’s the ImportPosts method:

public static void ImportPosts()
{
    IFulltextStore fulltextStore = new FulltextStore().Initialize();

    using (IFulltextSession session = fulltextStore.StartSession())
    {
        session.Advanced.TruncateIndex<Post>();

        int count = 0;

        foreach (var post in GetPosts())
        {
            session.Save(post);

            if (count++ % fulltextStore.Settings.SaveBatchSize == 0)
                session.FlushChanges();
        }

	session.FlushChanges();
    }
}

The FulltextStore class is the entry point to the fluent API. It is used to configure the connection string, mapping conventions and other settings. It also serves as a factory for creating instances of the IFulltextSession interface which provides the methods we need to execute full-text queries and to save and delete documents from RT-indexes. In the above code we’re creating an instance of the IFulltextSession interface by calling StartSession and then proceed to insert the documents. The GetPosts method is responsible for reading the documents from the XML file.

Conclusion


In the second installment of this series we created our document model and a Sphinx real-time index setup for these documents. We then created a small console application that performs the import of the documents from the source XML file. In the next part, we’ll create a website that’ll allow us to search the index we just created.

Tags: , ,

Tutorial

SphinxConnector.NET 3.4 has been released

by Dennis 16. April 2013 11:45

We're pleased to announce that SphinxConnector.NET 3.4 is available for download and via NuGet!

With the new version you can now use the fluent API to execute SphinxQL queries and take advantage of its object mapping capabilities. This gives you the ability to gradually move existing SphinxQL queries to the fluent API, and to execute queries not yet supported by the fluent API without having to manually construct your document objects:

using (IFulltextSession session = fulltextStore.StartSession())
{
    ISphinxQLExecutor executor = session.Advanced.CreateSphinxQLExecutor();

    var parameters = new { query = "a product" };
    var results = executor.Query<Product>("SELECT * FROM products WHERE MATCH(@query)", 
parameters); }

Please note that this feature is only available with .NET 4.

Also, the fluent API now allows grouping by multiple attributes (Sphinx 2.1.2), and projections into anonymous types support using the document object as a property and the creation of nested anonymous types.

Tags:

Announcements

SphinxConnector.NET 3.3.2 has been released

by Dennis 12. March 2013 15:44

SphinxConnector.NET 3.3.2 has just been made available for download and via NuGet. A list of resolved issues is available in the version history.

Tags:

SphinxConnector.NET 3.3 has been released

by Dennis 27. February 2013 13:50

This release further improves support for the just released beta of Sphinx 2.1 by allowing users to save JSON attributes to real-time indexes with the fluent API. The fluent API now also supports new query options and the creation of snippets within a query via the GetSnippets() extension method.


The native API has gotten support for new query flags and sub-selects introduced with Sphinx 2.1. A list of all changes is available in the version history.

Tags:

Announcements

Indexing Office and PDF Files With Sphinx and .NET

by Dennis 6. February 2013 13:19

Sphinx is a great full-text search engine with many amazing features, but there is one feature missing that would make it even better: the ability to directly index Word, Excel, PowerPoint, PDF files. How one can index these kinds of documents with Sphinx is something that is often asked in the Sphinx forum. Today I’d like to show you an easy way to extract text from these document types and store them in a Sphinx real-time index from your .NET application.

There are a bunch of tools and libraries out there that claim to be able to extract text from various document formats. As it is a pretty hard task to support many formats and extract text reliably, the quality of each library varies greatly. One tool that stands out is the Apache Tika™ toolkit. It is a Java library that

“detects and extracts metadata and structured text content from various documents using existing parser libraries.”

And it is really very good at it. Amongst others, it supports Microsoft Office, Open Document (ODF), and PDF files. But wait, I said Java library, didn’t I? “What am I supposed to do with a Java library in my .NET application?”, you might ask. Well, we’ll just convert it from Java to .NET using IKVM.NET. IKVM.NET is a .NET implementation of a Java Virtual Machine (JVM) which can be used as a drop-in replacement for Java. And it comes with a Java bytecode to CIL translator named ikvmc that we can use to build a .NET version of Apache Tika. In the next section, I’ll walk through the steps required to do this. At the end of this article you can download a complete sample application that uses Tika to extract text from some files and stores them in a Sphinx real-time index via SphinxConnector.NET.

Creating a .NET Version of Apache Tika


To create your own .NET version of Apacha Tika you need to:

  1. Download IKVM.NET
  2. Download the .jar file from the Tika project page (the current version at the time of writing is 1.3)
  3. Extract IKVM.NET to a folder of your choice
  4. Optional: Add the bin folder of IKVM.NET to your %PATH% variable
  5. Execute the following command (add/change the paths to ikvmc.exe and tika-app-1.3.jar if needed):
ikvmc.exe -target:library -version:1.3.0.0 -out:Apache.Tika.dll tika-app-1.3.jar

Let’s walk through the command line parameters: With –target:library we tell ikvmc to convert the jar to a class library. This is needed because the jar file is also usable as a standalone console/gui application, i.e. contains a main() method, which by default would cause ikvmc to generate an exe file. Next, we specify the version for our output DLL because otherwise ikvmc would set the version to 0.0.0.0. Finally we specify the output file name via –out: and the path to the Tika jar file.

After hitting Enter, ikvmc starts to translate the Java library to .NET. It’ll output a lot of warning messages, but will eventually finish and produce a working DLL. Note that if you want to sign the output assembly you can do so by specifying a key file via the -keyfile: command line option.

Extracting Text from Documents


Now that we've created a .NET library of Tika, we can start extracting text from documents. I’ve created small wrapper that provides methods to perform the extraction. To build the wrapper DLL we need to add references to a couple of IKVM.NET libraries:

IKVMNETLibs

Note that you need to reference more of IKVM.NET’s DLL’s in an application that uses Tika, these are just the required files to compile the wrapper project.

The AutoTextExtractor class which handles the extraction of text from files and binary data (useful if your documents are stored in a DB) and the TextExtractionResult class are based on these by Kevin Miller:

public class AutoTextExtractor
{
  public TextExtractionResult Extract(string filePath,OutputType outputType = OutputType.Text)
  {
      return Extract(System.IO.File.ReadAllBytes(filePath), outputType);
  }

  public TextExtractionResult Extract(byte[] data, OutputType outputType = OutputType.Text)
  {
      var parser = new AutoDetectParser();
      var metadata = new Metadata();

      using (Writer outputWriter = new StringWriter())
      using (InputStream inputStream = TikaInputStream.get(data, metadata))
      {
          parser.parse(inputStream, GetTransformerHandler(outputWriter, outputType), 
              metadata, new ParseContext());
          
          return AssembleExtractionResult(outputWriter.toString(), metadata);
      }
  }

  private static TextExtractionResult AssembleExtractionResult(string text, Metadata metadata)
  {
      Dictionary<string, string> metaDataResult = metadata.names().
          ToDictionary(name => name, name => String.Join(", ", metadata.getValues(name)));

      string contentType = metaDataResult["Content-Type"];

      return new TextExtractionResult
      {
          Text = text,
          ContentType = contentType,
          Metadata = metaDataResult
      };
  }

  private TransformerHandler GetTransformerHandler(Writer outputWriter, OutputType outputType)
  {
      var factory = (SAXTransformerFactory)TransformerFactory.newInstance();
      TransformerHandler handler = factory.newTransformerHandler();
      handler.getTransformer().setOutputProperty(OutputKeys.METHOD, outputType.ToString());
      handler.setResult(new StreamResult(outputWriter));

      return handler;
  }
}

Here’s the TextExtractionResult class:

public class TextExtractionResult
{
    public string Text { get; set; }
    
    public string ContentType { get; set; }
    
    public IDictionary<string, string> Metadata { get; set; }
}

And the OutputType enumeration:

public enum OutputType
{
    Text,
    Html,
    Xml
}

Demo Application


I’ve created a small demo application that contains a DLL that wraps Tika with the help of the aforementioned classes, and a console application that demonstrates how to extract and store the contents of some files in a Sphinx real-time index with SphinxConnector.NET. The code that does the extraction is pretty simple:

private static SphinxDocumentModel[] GetDocuments()
{
    AutoTextExtractor textExtractor = new AutoTextExtractor();

    int id = 1;

    return (from filePath in Directory.EnumerateFiles(@"..\..\..\..\testfiles")
            select new SphinxDocumentModel
            {
                Id = id++,
                FileContent = textExtractor.Extract(filePath).Text,
                FilePath = Path.GetFullPath(filePath)
            }).ToArray();
}

I’ll omit the code that saves the documents to the index, as it is straightforward. Tip: if you are working with big files you might have to increase Sphinx’ max_packet_size setting.

The archive contains all required libraries so it’s pretty big (IKVM.NET and Tika alone take up about 50 MB in space).

Downloads:

[1] Zip-Package (22 MB)

[2] 7z-Package (18 MB)

Tags: , , , ,

How-to | Tutorial