27 November 2013

Search with Sitecore - Article - 5 - Auto Complete For Search

Dear all friends, this is fifth and final article in this series and following are the list of articles for reference :

1) Search with Sitecore - Article-1 - Introduction 
2) Search with Sitecore - Article - 2 - Configure Advance Database Crawler and More
3) Search with Sitecore - Article - 3 - Crawler to Crawl Media File like PDF, Document  
4) Search with Sitecore - Article - 4 - Create Search API
5) Search with Sitecore - Article - 5 - Auto Complete For Search

Every implementer of search functionality has a fantasy where it can provide an true auto complete feature for there search text box or similar to AKA Google search. Basic class or inspiration i get out from the following stack overflow answer
http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-suggestions-in-lucene

So let's analyze our problem, "we need to provide legitimate search keywords during typing in our search text box. and it should be pretty fast so that end user has true experience. Following are three pin point issues from above problem statement".

a) We need to find the source for repository.
b) We need to build the repository once we found that source and it should be easily accessible and organised.
c) Query that repository and show with text box as user types.

Let's look closer to the problem statement, I said legitimate because it should be come from the content we already have in our Sitecore content repository. Content could be come from routine content like blog, news, content pages, media file names, meta data, item name anything but within the Sitecore Content universe. Off-course it should be also valid content means it should be not that content which is restricted by the business. 
So my first problem is from where to find such content which is easily accessible rather than going through whole Sitecore Content.

Answer: We have such content already indexed in our existing Indexes like 'SeacrhItems', 'Documents' and 'Products'(Please read previous articles). They are perfect places to look as they already have indexed all keyword necessary and still compliant with the business rules as they are made keeping that point of view. So answer is i have that repository with me now. But it is not organised/refined and there are three different indexes to look for data.

Next issue is I need to build such repository.
Answer: Best way is to create new Lucene Index by reading existing index which will store all list of keywords. Reading all index and creating new index out of it I need full Lucene .NET library and probably few other third party libraries. This task can be done through Sitecore Scheduler task also, but I prefer to do it through Console Application which can be scheduled as Task Scheduler in deployment servers. We will name this index as 'AutoUpdate'.

The main class which is responsible to read index and build index is follow rest all classes are mere helper and initiator. So let's get into Code.  


namespace AutoUpdateIndexer
{
    /// <summary>
    /// Search term auto-completer, works for single terms (so use on the last term of the query).
    /// Returns more popular terms first.
    /// </summary> 
    public class SearchAutoComplete
    {
        public static ILog log = log4net.LogManager.GetLogger(typeof(SearchAutoComplete));
        public int MaxResults { get; set; }

        private static readonly Lucene.Net.Util.Version kLuceneVersion = Lucene.Net.Util.Version.LUCENE_CURRENT;

        private static readonly String kGrammedWordsField = "words";

        private static readonly String kSourceWordField = "sourceWord";

        private static readonly String kCountField = "count";

        private static readonly String[] kEnglishStopWords = {
            "a", "an", "and", "are", "as", "at", "be", "but", "by",
            "for", "i", "if", "in", "into", "is",
            "no", "not", "of", "on", "or", "s", "such",
            "t", "that", "the", "their", "then", "there", "these",
            "they", "this", "to", "was", "will", "with"
        };

        public bool IsFirstTime { get; set; }

        private readonly Directory m_directory;

        private IndexReader m_reader;

        private IndexSearcher m_searcher;

        public SearchAutoComplete(string autoCompleteDir) :
            this(FSDirectory.Open(new System.IO.DirectoryInfo(autoCompleteDir)))
        {
        }

        public SearchAutoComplete(Directory autoCompleteDir, int maxResults = 8)
        {
            this.m_directory = autoCompleteDir;
            MaxResults = maxResults;

            ReplaceSearcher();
        }

        /// <summary>
        /// Find terms matching the given partial word that appear in the highest number of documents.</summary>
        /// <param name="term">A word or part of a word</param>
        /// <returns>A list of suggested completions</returns>
        public string[] SuggestTermsFor(string term)
        {
            if (m_searcher == null)
                return new string[] { };

            // get the top terms for query
            Query query = new TermQuery(new Term(kGrammedWordsField, term.ToLower()));

            Sort sort = new Sort(new SortField(kCountField, SortField.INT));

            TopDocs docs = m_searcher.Search(query, null, MaxResults, sort);
            string[] suggestions = docs.ScoreDocs.Select(doc =>
                m_reader.Document(doc.doc).Get(kSourceWordField)).ToArray();

            return suggestions;
        }

        /// <summary>
        /// Open the index in the given directory and create a new index of word frequency for the 
        /// given index.</summary>
        /// <param name="sourceDirectory">Directory containing the index to count words in.</param>
        /// <param name="fieldToAutocomplete">The field in the index that should be analyzed.</param>
        public void BuildAutoCompleteIndex(Directory sourceDirectory, Directory TargetDirectory, bool verbose)
        {
            // build a dictionary (from the spell package)
            using (IndexReader sourceReader = IndexReader.Open(sourceDirectory, true))
            {

                string[] fieldNames = sourceReader.GetFieldNames(IndexReader.FieldOption.ALL).ToArray();
                foreach (string fieldToAutocomplete in fieldNames)
                {
                    if (fieldToAutocomplete.Contains("__display name") || fieldToAutocomplete.Contains("_name") || !fieldToAutocomplete.Contains('_') || !fieldToAutocomplete.Contains("date") || !fieldToAutocomplete.Contains("threshold"))
                    {
                        LuceneDictionary dict = new LuceneDictionary(sourceReader, fieldToAutocomplete);

                        // code from
                        // org.apache.lucene.search.spell.SpellChecker.indexDictionary(
                        // Dictionary)
                        //IndexWriter.Unlock(m_directory);

                        // use a custom analyzer so we can do EdgeNGramFiltering
                        AutoCompleteAnalyzer analyzer = new AutoCompleteAnalyzer();
                        using (var writer = new IndexWriter(TargetDirectory, analyzer, IsFirstTime, IndexWriter.MaxFieldLength.UNLIMITED))
                        {
                            writer.SetMergeFactor(300);
                            writer.SetMaxBufferedDocs(150);

                            // go through every word, storing the original word (incl. n-grams) 
                            // and the number of times it occurs
                            System.Collections.IEnumerator ie = dict.GetWordsIterator();
                            double num;
                            Guid guid;
                            foreach (string word in dict)
                            {
                                if (word.Length < UtilitySettings.AllowedMinimumWordLengthToBeIndexed)
                                    continue; // too short we bail but 
                                if (word.Length > UtilitySettings.AllowedMaxWordLengthToBeIndexed)
                                    continue; //too long also we bail out


                                if (!word.Contains('<') && !word.Contains('>') && !word.Contains('/') && !word.Contains('\\') && !isNotFile(word) && !word.Contains('@') && !word.Contains('&') && !double.TryParse(word, out num) && !Guid.TryParse(word, out guid))
                                {
                                    // ok index the word
                                    // use the number of documents this word appears in
                                    int freq = sourceReader.DocFreq(new Term(fieldToAutocomplete, word));
                                    if (verbose)
                                    {                                        
                                        log.Info(string.Format("Frequency {0} of this word {1}", freq, word));
                                    }
                                    var doc = MakeDocument(fieldToAutocomplete, word, freq);
                                    writer.AddDocument(doc);
                                }
                            }
                            writer.Optimize();
                        }
                    }
                }
            }

            // re-open our reader
            //ReplaceSearcher();
        }

        private static Document MakeDocument(String fieldToAutocomplete, string word, int frequency)
        {
            var doc = new Document();
            doc.Add(new Field(kSourceWordField, word, Field.Store.YES, Field.Index.NOT_ANALYZED)); // orig term
            doc.Add(new Field(kGrammedWordsField, word, Field.Store.YES, Field.Index.ANALYZED)); // grammed
            doc.Add(new Field(kCountField, frequency.ToString(), Field.Store.NO, Field.Index.NOT_ANALYZED)); // count
            return doc;
        }

        private void ReplaceSearcher()
        {
            if (IndexReader.IndexExists(m_directory))
            {
                if (m_reader == null)
                    m_reader = IndexReader.Open(m_directory, true);
                else
                    m_reader.Reopen();

                m_searcher = new IndexSearcher(m_reader);
            }
            else
            {
                m_searcher = null;
            }
        }

        private bool isNotFile(string word)
        {
            bool result = false;
            if (word == null || word.Length < 1)
                return result;

            if (word.Contains(".png") || word.Contains(".jpeg") || word.Contains(".jpg") || word.Contains(".gif") || word.Contains(".tif") || word.Contains(".ico") || word.Contains(".bmp") || word.Contains(".aspx") || word.Contains("&amp"))
            {
                result = true;
            }
            return result;
        }
    }

    public class AutoCompleteAnalyzer : Analyzer
    {
        private static readonly Lucene.Net.Util.Version kLuceneVersion = Lucene.Net.Util.Version.LUCENE_24;
        private static readonly String[] kEnglishStopWords = {
            "a", "an", "and", "are", "as", "at", "be", "but", "by",
            "for", "i", "if", "in", "into", "is",
            "no", "not", "of", "on", "or", "s", "such",
            "t", "that", "the", "their", "then", "there", "these",
            "they", "this", "to", "was", "will", "with"
        };

        public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
        {
            TokenStream result = new StandardTokenizer(kLuceneVersion, reader);
            result = new StandardFilter(result);
            result = new LowerCaseFilter(result);
            result = new ASCIIFoldingFilter(result);
            result = new StopFilter(false, result, StopFilter.MakeStopSet(kEnglishStopWords));
            result = new EdgeNGramTokenFilter(
                result, Lucene.Net.Analysis.NGram.EdgeNGramTokenFilter.Side.FRONT, 1, 20);
            return result;
        }
    }
}

The main function is 'BuildAutoCompleteIndex' function which reads index through the source directory, loop through each valid field and write found terms into target directory after due filtration which is our new Index. This class also have function which can be used query the target directory for the terms.
Full project is shared on GitHub.

Figure 5.1 Console Screen of Search


Next Issue is to query this Index in such a way that it is feasible and give true experience to the user. First we will need to create a new class which will provide us functions to query above new index.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Lucene.Net.Store;
using Lucene.Net.Index;
using Lucene.Net.Search;
using SpellChecker.Net.Search.Spell;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.NGram;
using Lucene.Net.Documents;
using Portal.AppServices.Utilities;

namespace Portal.EShopServices.Search.Searcher
{
    /// <summary>
    /// Search term auto-completer, works for single terms (so use on the last term of the query).
    /// Returns more popular terms first.
     /// </summary>
    /// 
    public class SearchAutoComplete
    {
        public int MaxResults { get; set; }       

        private static readonly String kGrammedWordsField = "words";

        private static readonly String kSourceWordField = "sourceWord";

        private static readonly String kCountField = "count";

        private readonly Directory m_directory;

        private IndexReader m_reader;

        private IndexSearcher m_searcher;

        public SearchAutoComplete(string autoCompleteDir) :
            this(FSDirectory.Open(new System.IO.DirectoryInfo(autoCompleteDir)))
        {
        }

        public SearchAutoComplete(Directory autoCompleteDir, int maxResults = 100)
        {
            this.m_directory = autoCompleteDir;
            MaxResults = maxResults;
            ReplaceSearcher();
        }

        /// <summary>
        /// Find terms matching the given partial word that appear in the highest number of documents.</summary>
        /// <param name="term">A word or part of a word</param>
        /// <returns>A list of suggested completions</returns>
        public string[] SuggestTermsFor(string term)
        {
            if (m_searcher == null)
                return new string[] { };

            // get the top terms for query
            Query query = new TermQuery(new Term(kGrammedWordsField, term.ToLower()));
            Sort sort = new Sort(new SortField(kCountField, SortField.INT));
            TopDocs docs = m_searcher.Search(query, null, MaxResults, sort);
            string[] suggestions = docs.ScoreDocs.Select(doc =>
                m_reader.Document(doc.doc).Get(kSourceWordField)).ToArray();

            return suggestions;
        }

        private void ReplaceSearcher()
        {
            if (IndexReader.IndexExists(m_directory))
            {
                if (m_reader == null)
                    m_reader = IndexReader.Open(m_directory, true);
                else
                    m_reader.Reopen();

                m_searcher = new IndexSearcher(m_reader);
            }
            else
            {
                m_searcher = null;
            }
        }
    }    
}


The main function in this class is 'SuggestTermsFor' which take string as argument to search and revert back with possible suggestions in string Array.The class is initialized with the argument of directory path where AutoUpdate Index is located.
To call 'SuggestTermsFor' this function we need and ASPX .NET page which will work as resource which can answer our Ajax queries initiated from the jQuery getJson method. The HTML/designer side of the ASPX page is almost blank with just following declaration

<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Ajax.aspx.cs" Inherits="Portal.Web.Services.Ajax" %>

The code page of the above page has following class:

using Portal.AppServices.Utilities;
using Portal.EShopServices.Search.Searcher;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;

namespace Portal.Web.Services
{
    public partial class Ajax : System.Web.UI.Page
    {
        protected void Page_Load(object sender, EventArgs e)
        {
            if (Request.QueryString["term"] != null)
            {
                string searchTerm = ValidationHelper.ValidateToString(Request.QueryString["term"], "");
                SearchTerm(searchTerm);
            }
        }

        protected void SearchTerm(string searchTerm)
        {
            string response = string.Empty;
            string[] suggestions = null;
            List<MyCustomData> mcd = null;
            if (searchTerm != null && searchTerm.Length > 1)
            {
                SearchAutoComplete searchIndex1 = new SearchAutoComplete(Searcher.GetAutoUpdateIndexDirectory(), Portal.AppServices.Constants.UtilitySettings.MaximumResultinAutoUpdate);
                suggestions = searchIndex1.SuggestTermsFor(searchTerm);
                if (suggestions != null && suggestions.Length > 1)
                    suggestions = suggestions.Distinct().ToArray<string>();

                mcd = new List<MyCustomData>();
                if (suggestions != null && suggestions.Length > 0)
                {
                    int count = 0;
                    foreach (string st in suggestions)
                    {
                        if (!st.Contains("<") && !st.Contains(">") && !st.Contains('&') && count <= Portal.AppServices.Constants.UtilitySettings.MaximumResultinAutoUpdate)
                        {                            
                            MyCustomData mc = new MyCustomData();
                            mc.value = st;
                            mc.id = count.ToString();
                            mcd.Add(mc);
                            count++;
                        }
                    }
                }
            }
            System.Web.Script.Serialization.JavaScriptSerializer jss = new System.Web.Script.Serialization.JavaScriptSerializer();
            response = jss.Serialize(mcd);
            Response.ContentType = "application/json";
            Response.ClearContent();
            Response.Write(response);
            Response.End();
        }

        /// <summary>
        /// This custom type is used to return the data for the Search Term
        /// </summary>
        public class MyCustomData
        {
            public string id { get; set; }
            public string value { get; set; }
        }
    }
}

In above code on page load event we check does Request variable has QueryString value called "term", if yes then call the internal function 'SearchTerm' and pass the value which we get from the query string as parameter to it. That function intern will call our above 'SearchAutoComplete' class and get the result as string array. Once we get the array we convert that array into strongly type entity collection which has only two attributes like ID and Value and then finally parse that collection object into JSON string and send back as the response.
Following is the client side code which finally give life to the text box we are looking to build.

<link rel="stylesheet" type="text/css" href="/themes/jquery-ui/css/ui-lightness/jquery-ui-1.10.3.custom.min.css">
<script type="text/javascript">
    $(document).ready(function () {
        getSearchSuggestions();
        //this make sure that it triggers only when header search textbox have some value.
        $('#' + '<%= imgSearch.ClientID  %>').bind("click", function () {
            var testTextBox = $('#' + '<%= txtSearch.ClientID  %>');
            if ($.trim($(testTextBox).val()) == "") {
                return false;
            }
        });
    });

    function getSearchSuggestions() {
        var cache = {};
        var testTextBox = $('#' + '<%= txtSearch.ClientID  %>');
        var hdnTextBox = $('#' + '<%= hdnSearchText1.ClientID  %>');
        var txtSearch;
        testTextBox.keypress(function (e) {
            code = (e.keyCode ? e.keyCode : e.which);
            if (code == 13) {
                txtSearch = $.trim($(testTextBox).val());
                if (txtSearch != '') {
                    $(hdnTextBox).val(txtSearch);
                    $('#' + '<%= imgSearch.ClientID  %>').click();
                }
            }
        });
        var myterms;
        $(testTextBox).autocomplete({
            minLength: 3,
            source: function (request, response) {
                var term = request.term;
                if (term in cache) {
                    response(cache[term]);
                    return;
                }
                $.getJSON('<%= this.RootPathUrl  %>', request, function (data, status, xhr) {
                    cache[term] = data;
                    response(data);
                });
            },
            appendTo: "#ulsuggestion",
            delay: 400,
            autoFocus: true
        });
    }
 </script>
<div>
<asp:TextBox ID="txtSearch" runat="server" autocomplete="off"></asp:TextBox>
 <div id="ulsuggestion">
 </div>
<%-- On Server side function you may write any kind of code which you want to do on selection of keyword to search --%>
<asp:ImageButton ID="imgSearch" runat="server" CssClass="display-none width-0" OnClick="imgSearch_Click1" />
</div>
Figure 5.2 Auto Search Feature

We used jQueryUI plugin to get the affect of getting menu just below of textbox with the suggestions. Our suggestion will be add to the div named 'ulsuggestion'. Code is simple which jquery on function load bind and key press event to the text box which allows user to get the suggestion by making background ajax request to our Ajax.aspx page. Following is the example from working live website.

So the summary of this article is we are successful to build a new repository based on existing index and then with help of jQuery we are able to pull the suggestion keyword from the new repository on real time basis. I know the way i use this concept may not be perfect like i can use Windows Service which can build the AutoUpdate Index and i can use Signal to get new data from the server or i can use Language Suggestion etc., but whatever the way you choose to implement above the end result will be the same with some plus or minus of performance.

Summary of this whole series is, we started from understanding the Lucene Index and learned how we create/configure new index, how to crawl media files and in end how to provide auto update suggestion for our search text box. With few more features and bits you similarly you also may make the enterprise search end to end in Sitecore 6.0 to 6.6 Version.

Sitecore 7.0 is new and latest and have altogether different search capabilities where we need to write less code and use more. My next project is in Sitecore 7.0 where i may implement search through new way, whatever I do I will keep sharing my learning and I request you all to give praise or curse but please provide the feedback.


No comments:

Post a Comment