Details
-
Type:
Bug
-
Status: Closed
-
Resolution: No Longer Reproducible
-
Affects Version/s: 6.0.11 EE, 6.1.0 CE GA1
-
Fix Version/s: 6.2.0 CE M4
-
Component/s: Core Infrastructure, Environments, Environments > Databases, Performance
-
Labels:
-
Environment:Red Hat Enterprise Linux Server release 5.6 (Tikanga)
Linux Kernel Version 2.6.18-238.5.1.el5
Tomcat 6.0.29
PostgresSQL 8.4.7
Java SE Runtime build 1.6.0_24-b07
Java HotSpot Server VM build 19.1-b02
Database statistics to note:
22426 rows in the user_ Liferay table
6 rows in users_groups Liferay table
19 rows in role_ Liferay table
22426 rows in the group_ table
44829 rows in the users_roles table
44813 rows in the users_usergroups table
6 rows in the usergroup tableRed Hat Enterprise Linux Server release 5.6 (Tikanga) Linux Kernel Version 2.6.18-238.5.1.el5 Tomcat 6.0.29 PostgresSQL 8.4.7 Java SE Runtime build 1.6.0_24-b07 Java HotSpot Server VM build 19.1-b02 Database statistics to note: 22426 rows in the user_ Liferay table 6 rows in users_groups Liferay table 19 rows in role_ Liferay table 22426 rows in the group_ table 44829 rows in the users_roles table 44813 rows in the users_usergroups table 6 rows in the usergroup table
-
Epic/Theme:
-
Fix Priority:3
Description
Users attempt to login but never authenticate and remain on the sign-in page with the browser indicating it is waiting for a response. Only certain users are affected and typically call into support to notify us.
User steps to reproduce the problem:
1. Navigate via the browser to the sign in page.
2. Input login credentials in to sign-in form.
3. Click sign-in.
Expected: Successful authentication and login.
Actual: 'Waiting for xxx.server.xxx...' in browser status pane.
This was attempted on various different browsers including Firefox 4.0.1, Chrome 12.0.742.100 and Internet Explorer 8.0.6001.18702. All browsers running on Windows 32-bit.
While attempting to sign-in as one of the affected users, we inserted a debug breakpoint in our custom authenticator code and discovered that the thread ends up waiting for data from Postgres. Just for some background, we developed a custom authenticator and applied it to the 'pre' pipeline. We mirrored our implementation as per the LDAP flavour that comes prepackaged with Liferay. One of the things we do is import/merge user groups and the user itself into liferay upon successful authentication (exactly as LDAP) to keep Liferay in-sync with our master of record repository of users.
When we paused the thread to determine what exactly was going on, we found the thread was blocked doing an I/O read through a socket connecting to Postgres. The stack trace indicated this was happening during the Hibernate session flush (we're assuming this is when the JDBC connection is being committed). This block was indefinite. Following the stack trace backwards, we found that the last method invocation in our custom code was to:
UserLocalServiceUtil.updateUser(long userId, java.lang.String oldPassword, java.lang.String newPassword1, java.lang.String newPassword2, boolean passwordReset, java.lang.String reminderQueryQuestion, java.lang.String reminderQueryAnswer, java.lang.String screenName, java.lang.String emailAddress, long facebookId,
java.lang.String openId, java.lang.String languageId, java.lang.String timeZoneId, java.lang.String greeting,
java.lang.String comments, java.lang.String firstName, java.lang.String middleName, java.lang.String lastName, int prefixId, int suffixId, boolean male, int birthdayMonth, int birthdayDay, int birthdayYear, java.lang.String smsSn, java.lang.String aimSn, java.lang.String facebookSn, java.lang.String icqSn,
java.lang.String jabberSn, java.lang.String msnSn, java.lang.String mySpaceSn, java.lang.String skypeSn,
java.lang.String twitterSn, java.lang.String ymSn, java.lang.String jobTitle, long[] groupIds, long[] organizationIds, long[] roleIds, java.util.List<com.liferay.portal.model.UserGroupRole> userGroupRoles,
long[] userGroupIds, com.liferay.portal.service.ServiceContext serviceContext) throws com.liferay.portal.kernel.exception.PortalException, com.liferay.portal.kernel.exception.SystemException
Via the stack strace we inspected the JDBC statements executed and found the thread blocking occurred at the following SQL was executed:
update User_ set uuid_=$1, companyId=$2, createDate=$3, modifiedDate=$4, defaultUser=$5, contactId=$6, password_=$7, passwordEncrypted=$8, passwordReset=$9, passwordModifiedDate=$10, digest=$11, reminderQueryQuestion=$12, reminderQueryAnswer=$13, graceLoginCount=$14, screenName=$15, emailAddress=$16, facebookId=$17, openId=$18, portraitId=$19, languageId=$20, timeZoneId=$21, greeting=$22, comments=$23, firstName=$24, middleName=$25, lastName=$26, jobTitle=$27, loginDate=$28, loginIP=$29, lastLoginDate=$30, lastLoginIP=$31, lastFailedLoginDate=$32, failedLoginAttempts=$33, lockout=$34, lockoutDate=$35, agreedToTermsOfUse=$36, active_=$37 where userId=$38 (where $x are parameter placeholders).
While this happening we also executed various Postgres system based queries to get a sense as to what connections were open and from whom. We discovered that for each 'update' corresponding to one of the above API calls, there were approximately 9 database level table/row locks associated with the "user_", "users_groups", "role_", "group_", "user_", "users_roles", "users_usergroups", "usergroup", "user_" relations (1 lock per relation). After reading some Postgres literature we found there were some conflicting locks on the user_ table (one user_ lock had an 'ExclusiveLock' and another user_ lock had a 'RowExclusiveLock'). According to http://www.postgresql.org/docs/8.2/interactive/explicit-locking.html these two locks conflict with each other. These locks all related to one transaction involving the above mentioned update statement. This seemed to indicate that a possible database deadlock had occurred. Many of these transactions had been waiting for at least two days.
We were able to determine the process id for each of these deadlocked updates and manually did a UNIX kill on them and found that the affected users could now authenticate and login. The interesting to note here is that if we killed the so called 'primordial' or first one or two deadlocked processes, the remaining ones unblocked and completed.
Any help would be greatly appreciated.
Thank you,
Michael Melo
Intelliware Development Inc.