Thursday, January 12, 2012

How to reduce Lucene Index size in CQ / WEM

Use Case:
1) Index size is huge and it is taking a lot of time for CQ to start.
2) Index size is huge and it is taking a lot of time for Index to rebuild (Also check http://www.wemblog.com/2011/09/how-to-reindex-large-repository.html for this).
3) you use different search service like solr or FAST for indexing
4) you don't use full text searching of documents in site

Solution:

Step 1: Find tika_config file

For version CQ5.4 and less

If you are really not concern about full text indexing of these documents, You could disable indexing of these document in tika-config (crx-quickstart/server/runtime/0/_crx/WEB-INF/classes/org/apache/jackrabbit/core/query/lucene/tika-config.xml). If this folder structure is not present then you have to create one. Original tika_config.xml can be found by unzipping crx-quickstart/server/runtime/0/_crx/WEB-INF/libs/jackrabbit-core-*.jar (Copy it to some other location, rename it to .zip and then unzip) and then going to org/apache/jackrabbit/core/query/lucene).

For version CQ5.5

1) Find the jackrabbit-core jar file and extract the tika config: find ./crx-quickstart/launchpad/felix -name "jackrabbit-core*.jar" | xargs -I {} jar -xvf {} org/apache/jackrabbit/core/query/lucene/tika-config.xml

2) Update org/apache/jackrabbit/core/query/lucene/tika-config.xml file with updated tika file (See step 2)

3) Update the jackrabbit-core jar file with the updated tika-config.xml file: find ./crx-quickstart/launchpad/felix -name "jackrabbit-core*.jar" | xargs -I {} jar -uvf {} org/apache/jackrabbit/core/query/lucene/tika-config.xml

Step 2: Modify file

You could add org.apache.tika.parser.EmptyParser as class for not to parse document type.

For example (To not index excel sheet)

<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</mime>
</parser>

To remove PDF parsing you can remove entry
<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
<mime>application/pdf</mime>
</parser>

Above method will also help you to reduce Index size (Lucene) in CQ.



Note: To reduce Lucene Index size you can also add following in workspace.xml

<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
<param name="path" value="${wsp.home}/index"/>
<!-- add below param -->
<param name="supportHighlighting" value="false"/>
</SearchIndex>

As an example please refere this tika-config.xml file.

Some use ful link for tuning your search index in case you can do above,

http://wiki.apache.org/jackrabbit/Search
try to tune "resultFetchSize" and other parameters

Step 3: Disable Indexing using indexing_config.xml file 

Please check instruction below of how to do that. You can add your own node type to reduce index size further. You can use attached indexing_config file

http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/SearchIndexingConfig.html

Other Useful Links to reduce lucene Index:

http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/how-to-optimize-lucene-index-to-gain-efficiency.html

http://wiki.apache.org/jackrabbit/IndexingConfiguration

http://dev.day.com/content/kb/home/cq5/CQ5SystemAdministration/BoostInSearch.html

http://dev.day.com/content/kb/home/cq5/CQ5Troubleshooting/performancetuningtips.html#TIP05

There is an also ongoing issue to expedite start up time if index size is large https://issues.apache.org/jira/browse/JCR-3107

Important:You have to rebuild index after above changes.



Special thanks to Andrew Khoury and other member from Adobe for sharing information.

11 comments:

  1. How to disable built in Lucence indexing and integrate CQ5.5 with Apache Solr ?

    ReplyDelete
    Replies
    1. Hello,

      Unfortunately index files are required for core CQ functionality to work hence you can not disable lucene completely. However you can reduce size of lucence index as described above.

      Yogesh

      Delete
  2. Hi Yogesh,

    Any Updates on above question?

    ReplyDelete
  3. Hi Yogesh,
    What is process for integrating CQ5.6 with solr search.
    please share the solr + CQ demo package.

    ReplyDelete
    Replies
    1. Hello,

      Unfortunately I do not have demo package, but this integration is very much possible. I think CQ6.0 has this OOTB.

      Yogesh

      Delete
  4. Yes, that would be interesting to hear about hte SOLR integration. Thank you :)

    ReplyDelete
    Replies
    1. Can you please check http://www.gastongonzalez.com/tech-blog/2013/9/13/integrating-apache-solr-with-adobe-cq-aem.html for CQ integration with SOLR

      Delete
  5. Hi Yogesh,
    There is a confusion regarding indexing configuration.
    The jackrabbit documentation http://wiki.apache.org/jackrabbit/IndexingConfiguration talks of including the indexing configuration in both repository.xml and workspace.xml.

    Line taken from Jackrabbit indexing config:
    If you wish to configure the indexing behaviour you need to add a parameter to the SearchIndex element in your workspace.xml and repository.xml file.

    indexing_config.xml:

    < param name="path" value="${wsp.home}/index"/>
    < param name="resultFetchSize" value="50"/>
    < param name="indexingConfiguration" value="${wsp.home}/indexing_config.xml"/>
    < param name="tikaConfigPath" value="${wsp.home}/tika-config.xml"/>


    But you have mentioned of including only in workspace.xml. Can you please comment on this.

    ReplyDelete
    Replies
    1. Hello Bala,

      If you are not using multiple workspace, including indexing config in workspace.xml should work.

      Yogesh

      Delete
  6. Hi Yogesh,

    We have learned that AEM 6.0 has embedded Apache Solr Engine running on CRX3 (OAK). Is there any concrete documentation around optimizing the index as per business requirement?

    Once custom indexing schema is set, will querying using JQL, SQL2, XPath etc will honor our custom index schema?

    Thanks,
    Rohit

    ReplyDelete
    Replies
    1. Hello Rohit,

      Unfortunately I don't have step by step guide for AEM 6, But you can refer to http://jackrabbit.apache.org/oak/docs/query.html of how to set up Solr with Oak.

      Yogesh

      Delete