Thursday, September 15, 2011

How to fix LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 180) error

Symptom : Upload of certain documents (word, excel, pdf) cause system to go down because of OOM error. And since these document are under workflow, Restart of CQ does not help either.

Resolution :

1) Add -Dcom.day.crx.persistence.tar.IndexMergeDelay=0 In start up script. This will make sure that after uploading any document or changing large properties synchronize merge (There is known issue with Synchronize Index Merging) should not cause any issue.

2) Modify your repository.xml and in workspace.xml

change

<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
<param name="path" value="${wsp.home}/index"/>
<param name="resultFetchSize" value="50"/>
</SearchIndex>

to

<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
<param name="path" value="${wsp.home}/index"/>
<param name="resultFetchSize" value="50"/>
<param name="forkJavaCommand" value="nice java -Xmx32m"/>
<param name="extractorPoolSize" value="2"/>
</SearchIndex>


For above option in CRX2.1, you have to make sure that Hotfix pack 2.1.0.9 is installed. It will work OOTB in CRX2.2.

Please note that this will not help those file to process successfully, but help you to get rid of OOM error and your system will not go down.

Other Option: If you are really not concern about full text indexing of these documents, You could disable indexing of these document in tika-config (crx-quickstart/server/runtime/0/_crx/WEB-INF/classes/org/apache/jackrabbit/core/query/lucene/tika-config.xml). If this folder structure is not present then you have to create one. Original tika_config.xml can be found by unzipping crx-quickstart/server/runtime/0/_crx/WEB-INF/libs/jackrabbit-core-*.jar (Copy it to some other location, rename it to .zip and then unzip) and then going to org/apache/jackrabbit/core/query/lucene).

You could add org.apache.tika.parser.EmptyParser as class for not to parse document type.

For example (To not index excel sheet)

<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</mime>
</parser>

To remove PDF parsing you can remove entry
<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
<mime>application/pdf</mime>
</parser>

Above method will also help you to reduce Index size (Lucene) in CQ.

Note: To reduce Lucene Index size you can also add following in workspace.xml

<SearchIndex class="com.day.crx.query.lucene.LuceneHandler">
<param name="path" value="${wsp.home}/index"/>
<!-- add below param -->
<param name="supportHighlighting" value="false"/>
</SearchIndex>

4 comments:

  1. For option 1 Add -Dcom.day.crx.persistence.tar.IndexMergeDelay=0 In start up script., where should this exactly be
    in case we have cq as part of Weblogic app server

    ReplyDelete
    Replies
    1. In weblogic server it would be any where, where you are defining JVM system param. As this is system parameter.

      Delete