-
Type:
Bug
-
Status: Closed
-
Resolution: Fixed
-
Affects Version/s: 6.2.10 EE GA1, 7.2.0 GA1, 7.2.10 DXP GA1
-
Fix Version/s: 7.0.0 DXP FP86, 7.0.10.12 DXP SP12, 7.0.X, 7.1.10 DXP FP14, 7.1.X, 7.2.10 DXP FP2, 7.2.10.1 DXP SP1, 7.2.X, Master
-
Component/s: Fault Tolerance > Clustering Framework
-
Branch Version/s:7.2.x, 7.1.x, 7.0.x
-
Backported to Branch:Committed
-
Fix Priority:3
-
Git Pull Request:
Overview
Liferay clustering jGroups communication problems can cause significant slowdown the application server HTTP thread, which can lead up situation that cluster is not responding.
My customer's huge Liferay cluster did slow down so much that it is none of the nodes were responding for traffic. On this case one of the nodes where the root cause and when that node was killed then cluster did heal itself. Still, meanwhile, our site wasn't able to serve any traffic for a significant period of time.
By examining thread dumps from that period, the situation was tracked down to jGroups.send(..) and that method was called from the thread that was responding the user Http request.
This ticket describes how to simulate this kind of situation:
Prepare for the simulated test
Create a Liferay DXP 7.2 ( or EE 6.2 clustered fixpack -69 ) environment with Tomcat. (EE 6.2 might require little changes on Byteman script)
Install byteman: https://byteman.jboss.org/downloads.html under `<tomcat>/byteman` directory
Modify <tomcat>/bin/setenv.sh script and add following there at end:
if [ -f "${CATALINA_HOME}/byteman/lib/byteman.jar" ]; then BYTEMAN_HOME="${CATALINA_HOME}/byteman" CATALINA_OPTS="${CATALINA_OPTS} -Dorg.jboss.byteman.transform.all" CATALINA_OPTS="${CATALINA_OPTS} -Dorg.jboss.byteman.allow.config.updates" if [ -f "${CATALINA_HOME}/default.btm" ]; then echo "Byteman Found with script ${CATALINA_HOME}/default.btm" CATALINA_OPTS="${CATALINA_OPTS} -javaagent:${BYTEMAN_HOME}/lib/byteman.jar=script:${CATALINA_HOME}/default.btm,boot:${BYTEMAN_HOME}/lib/byteman.jar,listener:true" else echo "Byteman Found" CATALINA_OPTS="${CATALINA_OPTS} -javaagent:${BYTEMAN_HOME}/lib/byteman.jar=sys:${BYTEMAN_HOME}/lib/byteman.jar,listener:true" fi else echo "Booting without ${CATALINA_HOME}/byteman" fi
With DXP 7.2 / CE 7.2 add to portal-ext.properties (if that exist then make sure to add org.jboss.byteman.* to last.
module.framework.properties.org.osgi.framework.bootdelegation=\ __redirected,\ com.liferay.aspectj,\ com.liferay.aspectj.*,\ com.liferay.expando.kernel.model,\ com.liferay.portal.servlet.delegate,\ com.liferay.portal.servlet.delegate*,\ com.sun.ccpp,\ com.sun.ccpp.*,\ com.sun.crypto.*,\ com.sun.image.*,\ com.sun.jmx.*,\ com.sun.jna,\ com.sun.jndi.*,\ com.sun.mail.*,\ com.sun.management.*,\ com.sun.media.*,\ com.sun.msv.*,\ com.sun.org.*,\ com.sun.syndication,\ com.sun.tools.*,\ com.sun.xml.*,\ com.yourkit.*,\ javax.validation,\ javax.validation.*,\ jdk.*,\ sun.*,\ weblogic.jndi,\ weblogic.jndi.*,\ org.jboss.byteman.*
Create Byteman rule file <tomcat>/default.btm file with content of following (File is created for 7.2 so it might vary for other versions).
# Prints log entry when JChannel.close is called RULE org.jgroups.JChannel.close CLASS org.jgroups.JChannel METHOD close AT ENTRY BIND channel:org.jgroups.JChannel = $0; IF true DO #We open trace every time just in case. It won't re-open it. traceOpen("jgroups_log","jgroups.log"); traceln("jgroups_log","Closing JChannel: " + channel.cluster_name); ENDRULE # Prints log entry when JChannel.connect(String) is called RULE org.jgroups.JChannel.connect CLASS org.jgroups.JChannel METHOD connect(String) AT ENTRY BIND threadId = ""+Thread.currentThread().getId() + "/" + Thread.currentThread().getName(); IF true DO # We open trace every time just in case. It won't re-open it. traceOpen("jgroups_log","jgroups.log"); traceln("jgroups_log", threadId + "\t" + new java.util.Date() + "\tConnect JChannel: " + $1); ENDRULE # Prints log entry when JChannel.send(org.jgroups.Message) is called RULE org.jgroups.JChannel.send CLASS org.jgroups.JChannel METHOD send(org.jgroups.Message) AT ENTRY BIND threadId = ""+Thread.currentThread().getId() + "/" + Thread.currentThread().getName(); channelName = $0.cluster_name + ""; # Tap rule only to Http thread (Sleep also 5000 ms) IF formatStack().indexOf("com.liferay.portal.kernel.servlet.filters.invoker.InvokerFilterChain") > 0 DO Thread.sleep(5000); traceln("jgroups_log", threadId + "\t" + new java.util.Date() + "\t" + channelName + "\nSend:\nSTACKTRACE\n" + formatStack()); ENDRULE
Test
1. Start the portal
2. Add user and when you save the user information you notice that it is slow and take around 15 seconds (due to the Thread.sleep at Byteman script).
3. If you add a role to user that is taking around 5+ seconds due to the same reason.
4. You can see from <liferay-home>/jgroups.log places at the code where this has been called. Adding user is blocking 3 times and updating role once.
Summary
Cluster communication should be isolated from HTTP thread to increase Liferay's stability on exceptional cases when jGroups is slowing down.