Uploaded image for project: 'PUBLIC - Liferay Portal Community Edition'
  1. PUBLIC - Liferay Portal Community Edition
  2. LPS-82603

Race condition on member join in cluster setup during startup



      Cluster communication issues may occur at startup. Below is a brief timeline of what is happening, starting from the "ViewAccepted" log message on the slave node:
      1. JGroupsReceiver#viewAccepted updates the addresses in its channel, and calls addressesUpdated and coordinatorAddressUpdated on the clusterReceiver (I'll follow the addressesUpdated in this case)
      2. BaseClusterReceiver#addressesUpdated starts a new thread by calling execute on AddressesUpdatedRunnable. At this point, cluster initialization continues on the main thread.
      3. BaseClusterReceiver$AddressesUpdatedRunnable#run calls doAddressesUpdated with the addresses
      4. ClusterRequestReceiver#doAddressesUpdated then sends a notifyRequest
      5. ClusterExecutorImpl#notifyRequest sends a ClusterNodeStatus request to all other nodes
      6. On Master, the request is handled in ClusterExecutorImpl#handleReceivedClusterRequest. At this point, Master node calls the "memberJoined" method, which is the one that adds the clusterNode to the list of available nodes
      7. Still in this execution, Master sends a ClusterNodeResponse back to the slave
      8. On the Slave, ClusterExecutorImpl#handleReceivedClusterNodeResponse adds the Master node to the list of available nodes
      9. Important to note, that steps 3-8 have been happening in an asyncronous way, mostly as results to cluster events
      10. Meanwhile, the main thread's execution gets to the point where it wants to get the NodeID of the coordinator node. As the main thread was the one that got the address, we know that it's available, so we try to query for it from the list of nodes.

      The problem with #10 above is that at that point there is no guarantee that the nodes have finished communicating, so we're effectively looking at a race condition. Since ClusterMasterExecutorImpl#getMasterClusterNodeId runs a while(true), we'll see a couple of error messages until the communication between the two nodes complete.

      We can reproduce this if we're starting two nodes on different machines (or rather: different networks - this is to add some latency), and adding some load to the computer itself. It's possible to get 1-10 error messages.




            raven.song Raven Song
            istvan.sajtos Istvan Sajtos
            Participants of an Issue:
            Recent user:
            Antonio Ortega
            0 Vote for this issue
            11 Start watching this issue


              Days since last comment:
              3 years, 38 weeks, 4 days ago


                Version Package
                7.0.0 DXP FP59
                7.0.0 DXP SP9
                7.1.10 DXP FP2
                7.1.1 CE GA2