Automatic Generation of Document Ids With the Fluent API

by Dennis 15. August 2013 09:48

Sometimes you’ll want to insert documents into a Sphinx real-time index that don’t come with a id assigned. This might be for instance because they don’t originate from a database. As a matter of fact, the search functionality for this website is implemented without first going through a database: the content is both scrapped by a crawler and read from the corresponding files and then stored directly in a Sphinx real-time index.

So, wouldn’t it be nice if SphinxConnector.NET could automatically assign an id to these documents before saving, so you don’t have to do it yourself? This is what a user thought and what we implemented for the next release (version 3.7).

Two new members have been added to IFulltextStore.Conventions:

public Func<object, bool> DocumentNeedsIdAssigned { get; set; }

public Func<object, object> DocumentIdGenerator { get; set; }

The first, DocumentNeedsIdAssigned, is responsible for telling SphinxConnector.NET whether a document that is about to be saved, needs an id assigned. It comes with a default that will return false if the document in question already has an id with a value greater 0, and true otherwise.

The second, DocumentIdGenerator, is responsible for generating the id. It receives the document that needs to have an id generated (in case the generator has to inspect it before creating an id) and should return the generated id. There is no default generator, so one needs to be provided by the developer for this functionality to work. Let’s have a look at some id generation algorithms you could use:

Id Generation Algorithms

For an algorithm to be suitable for use with Sphinx, it has to generate a unique, positive 32-bit or 64-bit integer number for each document. There are a couple of id generation algorithms that fulfill this requirement: for example a simple sequential generation algorithm could be used, or the hi/lo algorithm which some of you might know from NHibernate. Both need some kind of persistent storage that provides transactional semantics, to avoid duplicate ids being generated. If you are only performing a one time import, you could use a simple sequential id generator that increments id values in memory though.

Alternatively, you could use an id generator such as Snowflake (created by Twitter) or Flake. Snowflake has a .NET port, RustFlakes is a derivative of Flake for .NET which is also available via NuGet. Both can generate unique, positive 64-bit integers without the need of persistent storage.

Example

As stated above, this site’s search functionality is based on Sphinx with the documents being stored in a real-time index. Indexing is done by crawling the site and directly storing the documents in the index, which requires us to generate an id.

Before, we generated and assigned the id manually prior to each call to Save. Now, we assign the id generation method to IFulltextStore.Conventions.DocumentIdGenerator and SphinxConnector.NET will take care of both invoking this method and assigning the id on each call to Save:

var idGenerator = new IdGenerator();
var fulltextStore = new FulltextStore().Initialize(); fulltextStore.Conventions.DocumentIdGenerator = _ => idGenerator.NextId(); using (var session = fulltextStore.StartSession()) { var webpage = new Webpage //<-- Our document, note that we don’t assign an id { Title = "SphinxConnector.NET", Url = "http://www.sphinxconnector.net", Content = "Sphinx .NET API" }; session.Save(webpage); //<-- Id will be generated and assigned to document here
session.FlushChanges(); }

Conclusion

This new functionality allows you to delegate the task of generating an id and assigning it to a document to SphinxConnector.NET. By using existing id generators such as Snowflake or RustFlakes, you can create your setup within a couple of minutes and just a few lines of code.

Tags:

How-to

Pingbacks and trackbacks (1)+