Hacking around with search and strong typed models.

Heads Up!

This article is several years old now, and much has happened since then, so please keep that in mind while reading it.

Hi all, season's greetings! 

Troy

Welcome to another edition of 24 days in Umbraco; I'm here today to talk about hacking with search and strong typed models. I've written about the benefits of using them with Umbraco on Skrift.

As some of you may know I'm a strong proponent of using strong typed models within Umbraco. I'm part of the core team working on the Ditto project and have written about using it in the past.

One of the many things I've always wanted to integrate, in a nice way, using strong typed models is search. Umbraco has excellent search capabilities utilizing Examine which wraps around Lucene but every example I've seen requires extensive configuration which in my mind leads to a disconnect from the strong typed methodology.

What follows is an approach I have been using recently which I feel works really quite well to bridge that disconnect. This is based on a snippet that Hendy Racher showed me last year that I fixed and expanded upon.

I'll take you through some code that allows for multiword wildcard enabled search that can be wired up with some simple attributes. I've also published an example project on Github containing functional demonstration code. 

By the end of this article you will be able to simply do the following to return your query.

IEnumerable<SearchMatch> result = SearchEngine.SearchSite(query);

Warning: Some of this gets quite technical and there will be lots of code snippets so lay off the Eggnog until I'm finished. If you're not as nerdy as I am it might get a bit boring. The results are worth the read though.

Mapping Basic Properties 

So let's get started with an example model representing my home page.

/// <summary>
/// The home document type.
/// </summary>
[SearchCategory(new[] { "Content" })]
public class Home : PageBase
{
    /// <summary>
    /// Gets or sets the image.
    /// </summary>
    public virtual Image Image { get; set; }

    /// <summary>
    /// Gets or sets the body text.
    /// </summary>
    [SearchMergedField]
    public virtual HtmlString BodyText { get; set; }
}

There's two attributes here you would have noticed. SearchCategory and SearchMergedField. The former takes an array of category names and the latter instructs the search engine that I want to index this property for full text searching. There's nothing fancy about them so I won't display them here, you'll be able to see them on Github.  

Collecting those properties is done in almost the same manner as you would gather node data - by tapping into the Umbraco event. There's a few differences though as you will see.

/// <summary>
/// Gathers the information from each node to add to the Examine index.
/// </summary>
/// <param name="sender">The sender.</param>
/// <param name="e">The event arguments containing information about the nodes to be gathered.</param>
/// <param name="helper">The <see cref="UmbracoHelper"/> to help gather node data.</param>
// ReSharper disable once UnusedParameter.Local
private void GatheringNodeData(object sender, IndexingNodeDataEventArgs e, UmbracoHelper helper)
{
    StringBuilder mergedDataStringBuilder = new StringBuilder();
    StringBuilder categoryStringBuilder = new StringBuilder();

    // Convert the property and use reflection to grab the output property value adding it to the merged property collection.
    IPublishedContent content = null;

    switch (e.IndexType)
    {
        case "content":
            content = helper.TypedContent(e.NodeId);
            break;
        case "media":
            content = helper.TypedMedia(e.NodeId);
            break;
    }

    if (content == null)
    {
        return;
    }

    Type doctype = ContentHelper.Instance.GetRegisteredType(content.DocumentTypeAlias);

    List<string> mergedProperties = new List<string>();

    if (doctype != null)
    {
        // Match the Ditto properties filters.
        PropertyInfo[] properties =
            doctype.GetProperties(BindingFlags.Public | BindingFlags.Instance)
                    .Where(x => x.CanWrite)
                    .ToArray();

        // ReSharper disable once LoopCanBeConvertedToQuery
        foreach (PropertyInfo property in properties)
        {
            SearchMergedFieldAttribute attr = property.GetCustomAttribute<SearchMergedFieldAttribute>(true);

            if (attr == null)
            {
                continue;
            }

            mergedProperties.Add(!string.IsNullOrWhiteSpace(attr.ExamineKey) ? attr.ExamineKey : property.Name);

            // Look for any custom search resolvers to convert the information to a useful search result.
            SearchResolverAttribute resolverAttribute = property.GetCustomAttribute<SearchResolverAttribute>(true);

            // Combine property values.
            foreach (KeyValuePair<string, string> field in e.Fields.Distinct())
            {
                if (mergedProperties.Distinct().InvariantContains(field.Key))
                {
                    if (resolverAttribute != null)
                    {
                        SearchValueResolver resolver = (SearchValueResolver)Activator.CreateInstance(resolverAttribute.ResolverType);
                        mergedDataStringBuilder.AppendFormat(" {0}", helper.StripHtml(resolver.ResolveValue(resolverAttribute, content, property, field.Value, Thread.CurrentThread.CurrentUICulture)));
                    }
                    else
                    {
                        mergedDataStringBuilder.AppendFormat(" {0}", helper.StripHtml(field.Value));
                    }

                    mergedProperties.Remove(!string.IsNullOrWhiteSpace(attr.ExamineKey) ? attr.ExamineKey : property.Name);
                }
            }
        }

        // Combine categories.
        SearchCategoryAttribute categoryAttribute = doctype.GetCustomAttribute<SearchCategoryAttribute>();

        if (categoryAttribute != null)
        {
            if (categoryAttribute.Categories.Any())
            {
                foreach (string category in categoryAttribute.Categories)
                {
                    categoryStringBuilder.AppendFormat("{0} ", category);
                }
            }
        }
    }

    e.Fields[SearchConstants.CategoryField] = categoryStringBuilder.ToString().Trim();
    e.Fields[SearchConstants.MergedDataField] = mergedDataStringBuilder.ToString().Trim();
}

That's a lot of code for an article! So... what's going on here?  

For each node that is passed to the handler I'm grabbing the type that I have created to represent that from a cache. If there is a category attribute present I add the category to the category field. I then loop through the types properties and check for the appearance of the merged property attribute attribute mentioned above. If one present I grab the property value and add that to the properties field.

That works great for simple properties like strings but there are more complicated properties out there that you might want to make searchable, like Json. Here's where we get a bit fancy.

Enter SearchResolverAttribute

You'll have noticed in the code there is an attribute referenced called SearchResolverAttribute. This tells the event handler that we need to do something a little bit different in order to get something suitable for indexing. This is a little idea I stole borrowed from the Ditto project which ends up being a powerful concept. Here's a simple example of one for parsing the filename of an image.

/// <summary>
/// The image filename search resolver. Used to resolve a value suitable for indexing with Examine.
/// </summary>
public class ImageFileSearchValueResolver : SearchValueResolver<SearchResolverAttribute>
{
    /// <summary>
    /// Performs the value resolution.
    /// </summary>
    /// /// <returns>
    /// The <see cref="string"/> representing the converted value.
    /// </returns>
    public override string ResolveValue()
    {
        string umbracoFile = Constants.Conventions.Media.File;
        return this.Content.GetPropertyValue<ImageCropDataSet>(umbracoFile).Src;
    }
}

This approach can be applied to more complicated examples. Here's one using the excellent multilingual package Vorto.  

/// <summary>
/// The Vorto search resolver. Used to resolve a value suitable for indexing with Examine.
/// </summary>
public class VortoSearchValueResolver : SearchValueResolver<SearchResolverAttribute>
{
    /// <summary>
    /// Performs the value resolution.
    /// </summary>
    /// <returns>
    /// The <see cref="string"/> representing the converted value.
    /// </returns>
    public override string ResolveValue()
    {
        IEnumerable<Language> languages = LocalizationHelper.GetInstalledLanguages();
        StringBuilder stringBuilder = new StringBuilder();
        VortoValue vortoValue = JsonConvert.DeserializeObject<VortoValue>(this.RawValue);
        string name = this.Property.Name;

        foreach (Language language in languages)
        {
            string iso = language.IsoCode;
            if (this.Content.HasVortoValue(name, iso))
            {
                object value;

                // Umbraco method Parse internal links fails since we are operating on a background thread.
                try
                {
                    value = this.Content.GetVortoValue(name, iso);
                }
                catch
                {
                    value = vortoValue.Values[iso];
                }

                stringBuilder.Append(string.Format(SearchConstants.CultureTemplate, iso, value));
            }
        }

        return stringBuilder.ToString();
    }
}

Grabbing the value to display is fairly simple due to the extension methods available with Vorto but here you can see that we have to pull another trick out of the hat to deal with the different cultures stored within Vorto's Json object. We need to think outside the box.

Warning: This is where the hacking part of the article and code appears.  

Traditional approaches to multilingual sites within Umbraco involve using multiple copies of the same site with a different Examine index for each site to avoid search pollution. Since we are operating with a single index we need a way to store the values so they can be retrieved individually. We're going to use regular expressions to help us out.

Regular expressions

Regular expressions, for the uninitiated, are special text strings for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in Windows explorer. Used correctly they can be a powerful tool in a developer arsenal. Used incorrectly, however, you can cause all sorts of problems such as creating a hole in the space-time continuum t0 invOke the hiv3-mind repr3s3nting chao$. With ou7 Orrderrr. The NezPerdi4n h!ve-mind of ch@os. H£ C0M£Z. ZALGO!

Ahem...

The regular expression that will do most of the work is this doozy.

"\u0000SearchDemoCulture:[^\u0000]+:(?<replacement>[^\u0000]+)\u0000";

I'm being a bit of a smart-arse here using the UTF-16 null character to wrap the expression but I wanted something that people wouldn't type.

Basically this allows me to match my culture specific values that are formatted to a specific layout in the Vorto resolver, and with some slight-of-hand, replace the full value in the Lucene field result with my content. You'll see how I do it below.

The Anatomy of a Search Result

Our search is made up of five classes. These classes are fairly simple to follow but combined provide us with some powerful tools. 

  1. SearchConstants
  2. SearchMatch
  3. SearchRequest
  4. SearchResponse
  5. SeachEngine

Most of the work happens in SearchRequest so I'll only highlight code from there. 

When a query comes in we want to ensure that it is set up as a multi word grouped or IBooleanOperation. We do that as follows:

IBooleanOperation searchCriteria = searchProvider.CreateSearchCriteria().OrderBy(string.Empty);

if (!string.IsNullOrWhiteSpace(this.Query))
{
    searchCriteria = searchProvider
        .CreateSearchCriteria()
        .GroupedOr(SearchConstants.MergedDataField.AsEnumerableOfOne(),
        this.Query.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries).Select(w => w.Trim().MultipleCharacterWildcard())
        .ToArray());
}

That code allows us to search for any word that starts with any the individual words that make up our query.

We run the search, as standard and then it gets all a little crazy.

In our results we could potentially have matches that don't match our current culture. Now we don't want to pollute our search so we do the following process:

  1. Loop though the results and check for any values within them that match our culture pattern but belong to a different culture.
  2. If we find a match we replace that value with an empty string in our field result.
  3. We then check whether there are any matches left over in our field result by splitting up the query and using a wildcard regular expression to do a quick search. (told you it gets hacky!)
  4. If there are any results left over then we pass the result to a highlighter to format the result for the user.

Wanna see some more code? Of coure you do! 

Analyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
Formatter formatter = new SimpleHTMLFormatter("<strong>", "</strong>");

foreach (SearchResult searchResult in searchResults.OrderByDescending(x => x.Score))
{
    // Check to see if the result is culture specific.
    // This is a bit hacky but there is no way with property wrappers like Vorto to separate the results into 
    // different indexes so we have to fall back to regular expressions.
    string fieldResult = searchResult.Fields[SearchConstants.MergedDataField];
    RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Multiline;

    string opts = $"({string.Join("|", this.Query.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries))})";

    // First check to see if there is any matches for any installed languages and remove any
    // That are not in our culture collection.
    // ReSharper disable once LoopCanBeConvertedToQuery
    foreach (Language language in this.languages)
    {
        if (!this.Cultures.Contains(language.CultureInfo))
        {
            fieldResult = Regex.Replace(
                fieldResult,
                string.Format(SearchConstants.CultureRegexTemplate, language.IsoCode, opts),
                string.Empty,
                options);
        }
    }

    // Now clean up the languages we do have a result for.
    MatchCollection matches = AllCultureRegex.Matches(fieldResult);

    foreach (Match match in matches)
    {
        if (match.Success)
        {
            string replacement = match.Groups["replacement"].Value;

            fieldResult = Regex.Replace(
            fieldResult,
            Regex.Escape(match.Value),
            replacement + " ",
            options);
        }
    }

    // Now check to see if we have any match left over. If not, break out.
    if (!new Regex(string.Format(SearchConstants.QueryRegexTemplate, opts), options).Match(fieldResult).Success)
    {
        continue;
    }

    this.AddSearchMatch(analyzer, formatter, searchResults, searchResponse, searchResult, fieldResult);
}

Follow all that? Mental!...

After that, the highlighter does its magic and we have a nice pageable collection of search matches that can be broken down into their individual categories for display. 

Search

Ain't that great! You've now totally nailed search and can stick it in your box of magic tricks to impress your boss and clients. 

So I hope some of this is useful to other strong typed Umbracians out there. I dunno if it's best practise (I'm no Examine guru) but I like it and it's made my day-to-day work a lot easier.

Let me know what you think about it all in the comments below. A working code example is hosted on Github for you to play with and encorporate in your own work if you like.

Now go get that Eggnog inside you. You've earned it!

James Jackson-South

James is on Twitter as