Friday, January 27, 2012

How to fix Out Of sync cluster issue CQ / CRX / WEM / ADEP

Use case : Your cluster is out of sync or you get following error in log

27.01.2012 14:09:12 *WARN * ClusterTarSet: Could not open (ClusterTarSet.java, line 710)
java.io.IOException: This cluster node and the master are out of sync. Operation stopped.
Please ensure the repository is configured correctly.
To continue anyway, please delete the index and data tar files on this cluster node and restart.
Please note the Lucene index may still be out of sync unless it is also deleted.
Repository home: /crx-quickstart/crx.0000, workspace: crx.default,
file name: data_00037.tar, position:13252096, expected: 109, got: 116,
last data file in cluster node: 37

Or

*INFO * ClusterTarSet: close /local/dss/apps/crx-quickstart/crx.0000 crx.default (ClusterTarSet.java, line 408)
27.01.2012 14:18:11 *ERROR* RepositoryImpl: Failed to initialize workspace 'crx.default' (RepositoryImpl.java, line 540)
javax.jcr.RepositoryException: Cannot instantiate persistence manager com.day.crx.persistence.tar.TarPersistenceManager
at com.day.crx.core.CRXRepositoryImpl$CRXWorkspaceInfo.createPersistenceManager(CRXRepositoryImpl.java:1101)
at com.day.crx.core.CRXRepositoryImpl$CRXWorkspaceInfo.doInitialize(CRXRepositoryImpl.java:1117)
at org.apache.jackrabbit.core.RepositoryImpl$WorkspaceInfo.initialize(RepositoryImpl.java:1998)
at org.apache.jackrabbit.core.RepositoryImpl.initStartupWorkspaces(RepositoryImpl.java:533)
at com.day.crx.core.CRXRepositoryImpl.initStartupWorkspaces(CRXRepositoryImpl.java:279)
at org.apache.jackrabbit.core.RepositoryImpl.(RepositoryImpl.java:342)
at com.day.crx.core.CRXRepositoryImpl.(CRXRepositoryImpl.java:225)
at com.day.crx.core.CRXRepositoryImpl.(CRXRepositoryImpl.java:267)
at com.day.crx.core.CRXRepositoryImpl.create(CRXRepositoryImpl.java:185)
Caused by: java.io.IOException: This cluster node and the master are out of sync. Operation stopped.
Please ensure the repository is configured correctly.
To continue anyway, please delete the index and data tar files on this cluster node and restart.
Please note the Lucene index may still be out of sync unless it is also deleted.


Solution:
1) Stop your slave (Your slave will not start anyway)
2) Make sure that You have latest CRX Hot fix or Hotfix greater than 2.2.0.38 installed on your system (Master). If not please contact Adobe to get latest HF. (In order to check for Hot Fix, Go to host:port/crx/index.jsp and you should see crx hotfix version name (For example 2.2.0.50 on top)
3) Once you have latest HF. Install HF in your master and restart master. If you have more than 2 Node cluster then install HF in all other nodes.
4) Once master is up take online backup of master instance (http://dev.day.com/docs/en/crx/current/administering/backup_and_restore.html)
4.1) if You already have master with latest HF, You can use any valid backup of master (May be backup from last night).
4.2) If you can afford downtime, then take offline backup of crx-quickstart folder from master.
5) Copy backup over to slave instance
6) Go to crx-quickstart/repository/cluster.properties file and add IP address of master and slave. You can use command like
echo "addresses=MASTER_IP_ADDRESS,SLAVE_1_IP_ADDRESS" >> crx- quickstart/repository/cluster.properties
7) remove crx-quickstart/repository/cluster_node.id file
8) If your master had preferredMaster setting in repository.xml, Remove that setting as well from crx-quickstart/repository/repository.xml file.
9) Delete logs folder crx-quickstart/logs
10) Start slave
11) Based on how much changes has been made, After backup is taken, Your slave take some time to start.


It is important that you exercise following order while stopping and starting,
1) While stopping
-- Stop Slave first and then master
2) While Starting
-- Start last master first and then slave
To identify which one was last master please check http://crxcluster.wemblog.com/miscellaneous_questions

If any one of the node in cluster is running and you have latest Hot Fix, You can stop other nodes in any order.

DO NOT USE THIS PROCESS IN PRODUCTION WITHOUT CONSULTING ADOBE SUPPORT OR SERVICE TEAM

2 comments:

  1. Hi,
    I see the below exception and unable to bring up the slave instance. Accidentally i stopped Master before shutting down the slave. How to fix the below issue.

    com.day.crx.sling.com.day.crx.sling.client acquireRepository: Problem checking JNDI for virtual-crx (javax.naming.NameNotFoundException) javax.naming.NameNotFoundException

    ReplyDelete