07 October 2013

Search with Sitecore - Article - 2 - Configure Advance Database Crawler and More

Dear all friends, I hope you have read my first article in this series and here are list of articles which are as follows:

1) Search with Sitecore - Article-1 - Introduction 
2) Search with Sitecore - Article - 2 - Configure Advance Database Crawler and More
3) Search with Sitecore - Article - 3 - Crawler to Crawl Media File like PDF, Document  
4) Search with Sitecore - Article - 4 - Create Search API
5) Search with Sitecore - Article - 5 - Auto Complete For Search

In same series this is next instalment, before telling/showing you what we do in this article I need to set records straight first.
  1. I use Shyba's Advance Database Crawler's, there are already very good articles which are written in length and breadth about it. Just for reference following is the best I found and refer during my work.
    1. Three article series showing how to use the ADV.
    2. Another good article which tell us all glossary of the ADV configuration files.
In our project we have content section as usual under home node. We also have Global reference and product catalogue which is outside of the home node. In our project for day to day retrieval of items also we used Lucene based search, this strategy can be debated but for now that is what we have done. So following is the overall content tree we have. All this I share to set the context of this series.
Content Tree - Figure 2.1
Because of this we have almost three different custom index configured using Advance database crawler.
  • Home
  • GlobalReference
  • Product
All above index can be build using the Sitecore Index Building Wizard from the control Panel. All above index have it's configuration file added in the following path ~\App_Config\Include\Shared\ and every index configuration file is just having default configuration as suggested for advance database crawler in above blogs. 
Once you build the index using Sitecore Wizard you will able to see the folder named as index named in the data folder as follows:
Index Folders - Figure 2.2
Next we optimize the each index configuration file so that we have clear knowing what to index. As I need to provide the front end search for my website I will put my focus on two index namely Home and Product. To optmize my index I followed above article listed from other authors, but mainly set following property false 
<IndexAllFields> false </IndexAllFields>
I also updated the list of fields to include which is relevant to my project using following syntax. Then I provided the list of fields to exclude. This is very important as I don't want to index the fields such as 'UpdatedBy', 'Revision' etc., which is as follows as this list can be used by others as it is same for most of the cases.
             <include hint="list:ExcludeField">
                  <!-- Workflow Fields -->
                  <fieldId>{A4F985D9-98B3-4B52-AAAF-4344F6E747C6}</fieldId>
                  <fieldId>{3E431DE1-525E-47A3-B6B0-1CCBEC3A8C98}</fieldId>
                  <fieldId>{001DD393-96C5-490B-924A-B0F25CD9EFD8}</fieldId>
                  <fieldId>{CA9B9F52-4FB0-4F87-A79F-24DEA62CDA65}</fieldId>
                  <!-- Statistics Fields -->
                  <fieldId>{25BED78C-4957-4165-998A-CA1B52F67497}</fieldId>
                  <fieldId>{5DD74568-4D4B-44C1-B513-0AF5F4CDA34F}</fieldId>
                  <fieldId>{8CDC337E-A112-42FB-BBB4-4143751E123F}</fieldId>
                  <fieldId>{D9CF14B1-FA16-4BA6-9288-E8A174D4D522}</fieldId>
                  <fieldId>{BADD9CF9-53E0-4D0C-BCC0-2D784C282F6A}</fieldId>
                  <!-- Security Fields -->
                  <fieldId>{52807595-0F8F-4B20-8D2A-CB71D28C6103}</fieldId>
                  <fieldId>{DEC8D2D5-E3CF-48B6-A653-8E69E2716641}</fieldId>
                  <!-- Publish Fields -->
                  <fieldId>{86FE4F77-4D9A-4EC3-9ED9-263D03BD1965}</fieldId>
                  <fieldId>{7EAD6FD6-6CF1-4ACA-AC6B-B200E7BAFE88}</fieldId>
                  <fieldId>{74484BDF-7C86-463C-B49F-7B73B9AFC965}</fieldId>
                  <fieldId>{9135200A-5626-4DD8-AB9D-D665B8C11748}</fieldId>
                  <!-- Lifetime Fields -->
                  <fieldId>{4C346442-E859-4EFD-89B2-44AEDF467D21}</fieldId>
                  <fieldId>{B8F42732-9CB8-478D-AE95-07E25345FB0F}</fieldId>
                  <fieldId>{C8F93AFE-BFD4-4E8F-9C61-152559854661}</fieldId>
                  <!-- Layout Fields -->
                  <fieldId>{F1A1FE9E-A60C-4DDB-A3A0-BB5B29FE732E}</fieldId>
                  <fieldId>{B03569B1-1534-43F2-8C83-BD064B7D782C}</fieldId>
                  <fieldId>{4C9312A5-2E4E-42F8-AB6F-B8DB8B82BF22}</fieldId>
                  <fieldId>{9FB734CC-8952-4072-A2D4-40F890E16F56}</fieldId>
                  <fieldId>{A4879E42-0270-458D-9C19-A20AF3C2B765}</fieldId>
                  <fieldId>{8546D6E6-0749-4591-90F3-CEC033D6E8D8}</fieldId>
                  <!-- Insert Options Fields -->
                  <fieldId>{83798D75-DF25-4C28-9327-E8BAC2B75292}</fieldId>
                  <fieldId>{1172F251-DAD4-4EFB-A329-0C63500E4F1E}</fieldId>
                  <!-- Help Fields -->
                  <fieldId>{56776EDF-261C-4ABC-9FE7-70C618795239}</fieldId>
                  <fieldId>{577F1689-7DE4-4AD2-A15F-7FDC1759285F}</fieldId>
                  <fieldId>{9541E67D-CE8C-4225-803D-33F7F29F09EF}</fieldId>
                  <!-- Appearance Menu Fields -->
                  <fieldId>{D3AE7222-425D-4B77-95D8-EE33AC2B6730}</fieldId>
                  <fieldId>{B5E02AD9-D56F-4C41-A065-A133DB87BDEB}</fieldId>
                  <fieldId>{D85DB4EC-FF89-4F9C-9E7C-A9E0654797FC}</fieldId>
                  <fieldId>{A0CB3965-8884-4C7A-8815-B6B2E5CED162}</fieldId>
                  <fieldId>{39C4902E-9960-4469-AEEF-E878E9C8218F}</fieldId>
                  <fieldId>{06D5295C-ED2F-4A54-9BF2-26228D113318}</fieldId>
                  <fieldId>{9C6106EA-7A5A-48E2-8CAD-F0F693B1E2D4}</fieldId>
                  <fieldId>{0C894AAB-962B-4A84-B923-CB24B05E60D2}</fieldId>
                  <fieldId>{079AFCFE-8ACA-4863-BDA7-07893541E2F5}</fieldId>
                  <fieldId>{BA3F86A2-4A1C-4D78-B63D-91C2779C1B5E}</fieldId>
                  <fieldId>{BA3F86A2-4A1C-4D78-B63D-91C2779C1B5E}</fieldId>
                  <fieldId>{6FD695E7-7F6D-4CA5-8B49-A829E5950AE9}</fieldId>
                  <fieldId>{C7C26117-DBB1-42B2-AB5E-F7223845CCA3}</fieldId>
                  <fieldId>{F6D8A61C-2F84-4401-BD24-52D2068172BC}</fieldId>
                  <fieldId>{41C6CC0E-389F-4D51-9990-FE35417B6666}</fieldId>
                  <!-- Advanced Fields -->
                  <fieldId>{1B86697D-60CA-4D80-83FB-7555A2E6CE1C}</fieldId>
                  <fieldId>{F7D48A55-2158-4F02-9356-756654404F73}</fieldId>
                  <fieldId>{B0A67B2A-8B07-4E0B-8809-69F751709806}</fieldId>
                  <!-- Tasks Fields -->
                  <fieldId>{56C15C6D-FD5A-40CA-BB37-64CEEC6A9BD5}</fieldId>
                  <fieldId>{1D99005E-65CA-45CA-9D9A-FD7016E23F1E}</fieldId>
                  <fieldId>{ABE5D54C-59D7-41E6-8D3F-C1A3E4EC9B9E}</fieldId>
                  <fieldId>{2ED9C4D0-9EFF-490D-A40A-B5D856499C40}</fieldId>
                  <fieldId>{BB6C8540-118E-4C49-9157-830576D7345A}</fieldId>
                  <!-- Validation Rules Fields -->
                  <fieldId>{C2F5B2B5-71C1-431E-BF7F-DBDC1E5A2F83}</fieldId>
                  <fieldId>{57CBCA4C-8C94-446C-B8CA-7D8DC54F4285}</fieldId>
                  <fieldId>{57CBCA4C-8C94-446C-B8CA-7D8DC54F4285}</fieldId>
                  <fieldId>{86B52EEF-078E-4D9E-80BF-888287070E6C}</fieldId>
                  <fieldId>{F47C0D78-61F9-479C-96DF-1159727D32C6}</fieldId>
                </include>
Following is the syntax of Include field. Off-course your list of fields in case of include will be different as it is based on your own template.
<include hint="list:IncludeField">
    <fieldId>{8CDC337E-A112-42FB-BBB4-4143751E123F}</fieldId>
</include>
Above configurations make my index more slimmer and accurate from content prospective. After updating the configuration file I re-build the index to be certain that my configurations are correct. Next, to be more certain what lies in my index I used another community tool which can show the content of Lucene Index. To run this tool you need correct version of it as it depends upon the lucene version and you also need JVM. Here is the link which do the job...
Now next thing is to start writing the custom code / API which uses the Advance database search libraries and customized for my project. Before getting into the code let me tell you requirement I have:
  • Search Result should be able to search based on any search keyword or in other words it is free text search.
  • Search Result should show the description also or part of text where it finds the word which is being search and highlight that word in the description.
  • Search Result should also show URL.
  • Search result should be prioritize first on product then browsable content and last should include the file result.
  • File search should also be able to show content from PDF file, Word File, text and HTML files. All files are stored within Media Library.
  • In case of file result it should also show the file type as icon next to link.
  • In header search box which is available in every page should be able to show the auto-complete feature and should come from the keywords which is relevant as per the content.
I will not say above requirement is equivalent to Mini Google within site but yes its get inspired by the same.
Taking clue from the requirements First thing which I notice is ability to search inside the file. Till this moment we don't have any Sitecore inbuilt mechanism which can go and crawl the files. So it means I need to write my own crawler can parse and crawl the PDF and other files. My next article is devoted fully for how to crawl and index the files within media library.