-
Type:
Bug
-
Status: Closed
-
Resolution: Fixed
-
Affects Version/s: 7.0.0 DXP FP49, 7.0.X
-
Fix Version/s: 7.0.0 DXP FP59, 7.0.0 DXP SP9, 7.0.X, 7.1.10 DXP FP2, 7.1.1 CE GA2, 7.1.10.1 SP1, 7.1.X, Master
-
Component/s: Fault Tolerance > Clustering Framework
-
Branch Version/s:7.1.x, 7.0.x
-
Backported to Branch:Committed
-
Fix Priority:4
-
Git Pull Request:
-
QA Test Score:5
Cluster communication issues may occur at startup. Below is a brief timeline of what is happening, starting from the "ViewAccepted" log message on the slave node:
1. JGroupsReceiver#viewAccepted updates the addresses in its channel, and calls addressesUpdated and coordinatorAddressUpdated on the clusterReceiver (I'll follow the addressesUpdated in this case)
2. BaseClusterReceiver#addressesUpdated starts a new thread by calling execute on AddressesUpdatedRunnable. At this point, cluster initialization continues on the main thread.
3. BaseClusterReceiver$AddressesUpdatedRunnable#run calls doAddressesUpdated with the addresses
4. ClusterRequestReceiver#doAddressesUpdated then sends a notifyRequest
5. ClusterExecutorImpl#notifyRequest sends a ClusterNodeStatus request to all other nodes
6. On Master, the request is handled in ClusterExecutorImpl#handleReceivedClusterRequest. At this point, Master node calls the "memberJoined" method, which is the one that adds the clusterNode to the list of available nodes
7. Still in this execution, Master sends a ClusterNodeResponse back to the slave
8. On the Slave, ClusterExecutorImpl#handleReceivedClusterNodeResponse adds the Master node to the list of available nodes
9. Important to note, that steps 3-8 have been happening in an asyncronous way, mostly as results to cluster events
10. Meanwhile, the main thread's execution gets to the point where it wants to get the NodeID of the coordinator node. As the main thread was the one that got the address, we know that it's available, so we try to query for it from the list of nodes.
The problem with #10 above is that at that point there is no guarantee that the nodes have finished communicating, so we're effectively looking at a race condition. Since ClusterMasterExecutorImpl#getMasterClusterNodeId runs a while(true), we'll see a couple of error messages until the communication between the two nodes complete.
We can reproduce this if we're starting two nodes on different machines (or rather: different networks - this is to add some latency), and adding some load to the computer itself. It's possible to get 1-10 error messages.