07 September 2013

Search with Sitecore - Article - 1 - Introduction

Dear All friends and colleagues, from past three months I am involved in one of the project. In this project I faced many challenges and I am here to share that with you all. I hope these articles may help you in your project.

This series of article is for the "Search" with Sitecore basically using Lucene. I know already there are multiple articles available at internet on this subject. But no article or series covers 'End to End', at-least I found out that. This going to be long series but I guess will cover all aspect of search. I welcome my friends and colleagues to suggest me or point if I write some technically wrong section.

This is First article in this Series, Article -1 Introduction.

To start with let put some fact on the table and analyse them. Sitecore is best CMS available in .NET platform and it is open in nature (except commercial aspect). 
a) Sitecore uses Lucene as the back-end for the search.
b) Sitecore store its information in relational database, in general namely core, master and web.
c) Lucene is the Open Source project which is completely written in Java and to for .NET world it ported as Lucene.NET.
d) Lucene store its index in Filesystem, so that searching can become fast as compared to store index in relational DB.
Now Let's put together these facts and see how Sitecore is achieving the Search for its own purpose. This diagram is not official and created just by me for understanding.
Sitecore Search Diagram which explain how it use Lucene
Figure 1.1

So in summary Sitecore first build the Index by reading database and store that index in file-system. At time of search Sitecore read that index and get appropriate item based on keys(GUID, URI etc.,) from database.

But to go in depth how all this happening and later how we build our own indexes and use them to search, first we need to understand how Lucene Index(Is based on Inverted Index Algo) are build and how we use them, once we know this we are in game.

  1. Lucene is a text search engine API(Wiki)
  2. Lucene stores the indexed information in Document. Document is the record in the index. Document contains the list of fields and each fields has a name and textual value. Certain collection of the documents are called as Lucene Index.
  3. Term is an Lucene unit of Indexing, a term is often a word.
  4. To create these documents Lucene need Analysers. For example Standard Analyser is used widely to index English or Latin based language. For other concepts please read this. Sitecore also has its own analyser to index the items.
  5. Lucene will not automatically start creating the index by looking into your DB. It is API which provide you classes to create, maintain and search the index. So in other words we need crawler who can read our information consecutively and store that information in Lucene Documents.
  6. In Figure 1.1 'Index Building Process' is nothing but an crawler which triggered each time you publish the Item or manually rebuild the Index from Sitecore Control Panel.
  7. You may find the Lucene.NET DLL file in your bin directory of the Sitecore Installation.
  8. Sitecore also provide its own API under Namespace 'Sitecore.Search'. This namespace contains several useful classes which helps to search the Lucene Index without worrying about how to use Lucene.NET directly.
  9. Sitecore comes up with default Index called 'System'. 
  10. This index is build using 'Sitecore.Search.Crawlers.DatabaseCrawler' class which internally use Sitecore Analyser class within Sitecore.Search namespace.
Sitecore System Index
Figure 1.2
As you can see in above screenshot Sitecore's 'System' index is getting information from three databases namely Core and Master written under Locations element. 
This also means implementer or developer is free to create its own index for the delivery database(Web). There is wonderful open extension called 'SitecoreSearchContrib' written by one of the most active blogger in Sitecore namely Alex Shyba. His tool actually helps lot of developers including me who want to implement the Search within Sitecore. In upcoming article we are going to use that extension to show how with other available tricks we can build enterprise search without using any expensive third party search engine.

To summarize this first article, we need to create following module or classes which will become building blocks for our search.
  • We need to configure a new Index based on our criteria 'what to search' within Sitecore Configuration.
  • We need to write a code / use extension to tell how to crawl for particular index.
  • We need to write a code / use 'search part' of extension so that we are able to return appropriate result for the searched query.
In other article within this series we will also cover other aspect of search like as follow 
  • We will also cover how to build 'Auto Suggest Feature as you type' aka Google way.
  • We will also cover how to parse PDF, Word files so that your search result can return text even from those files.
Thanks for reading this article, please let me know your comments and feedback.