Details

      Description

      Steps to reproduce

      1a) Create ~5K organizations with at least 3 teams for each
      1b) Or import the provided database dump
      2) Do a full reindex

      Phenomenon #1: Observe that the LuceneIndexer gets stuck at organizations

      Extract from the attached thread dump.

      "http-bio-9080-exec-5" daemon prio=10 tid=0x0000000042aa6800 nid=0x1ae3 runnable [0x00007f7ebb8f1000]
         java.lang.Thread.State: RUNNABLE
      	at org.hibernate.engine.StatefulPersistenceContext.afterTransactionCompletion(StatefulPersistenceContext.java:276)
      	at org.hibernate.impl.SessionImpl.afterTransactionCompletion(SessionImpl.java:601)
      	at org.hibernate.jdbc.JDBCContext.afterNontransactionalQuery(JDBCContext.java:292)
      	at org.hibernate.impl.SessionImpl.afterOperation(SessionImpl.java:595)
      	at org.hibernate.impl.SessionImpl.load(SessionImpl.java:993)
      	at org.hibernate.impl.SessionImpl.load(SessionImpl.java:978)
      	at com.liferay.portal.dao.orm.hibernate.SessionImpl.load(SessionImpl.java:189)
      	at com.liferay.portal.kernel.dao.orm.ClassLoaderSession.load(ClassLoaderSession.java:323)
      	at com.liferay.portal.dao.orm.common.EntityCacheImpl.loadResult(EntityCacheImpl.java:201)
      	...
      	at com.liferay.portal.kernel.dao.orm.EntityCacheUtil.loadResult(EntityCacheUtil.java:59)
      	at com.liferay.portal.dao.orm.common.FinderCacheImpl._primaryKeyToResult(FinderCacheImpl.java:273)
      	at com.liferay.portal.dao.orm.common.FinderCacheImpl._primaryKeyToResult(FinderCacheImpl.java:260)
      	at com.liferay.portal.dao.orm.common.FinderCacheImpl.getResult(FinderCacheImpl.java:135)
      	...
      	at com.liferay.portal.kernel.dao.orm.FinderCacheUtil.getResult(FinderCacheUtil.java:47)
      	at com.liferay.portal.service.persistence.RolePersistenceImpl.findByCompanyId(RolePersistenceImpl.java:2099)
      	at com.liferay.portal.service.persistence.RolePersistenceImpl.findByCompanyId(RolePersistenceImpl.java:2044)
      	...
      	at com.liferay.portal.service.impl.RoleLocalServiceImpl.getRoles(RoleLocalServiceImpl.java:704)
      	at sun.reflect.GeneratedMethodAccessor491.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      	...
      	at com.liferay.portal.security.permission.ResourceActionsImpl.getRoles(ResourceActionsImpl.java:559)
      	...
      	at com.liferay.portal.security.permission.ResourceActionsUtil.getRoles(ResourceActionsUtil.java:245)
      	at com.liferay.portal.search.SearchPermissionCheckerImpl.doAddPermissionFields_6(SearchPermissionCheckerImpl.java:186)
      	at com.liferay.portal.search.SearchPermissionCheckerImpl.addPermissionFields(SearchPermissionCheckerImpl.java:112)
      	at com.liferay.portal.kernel.search.SearchEngineUtil.updateDocuments(SearchEngineUtil.java:740)
      	at com.liferay.portlet.usersadmin.util.OrganizationIndexer.reindexOrganizations(OrganizationIndexer.java:326)
      	at com.liferay.portlet.usersadmin.util.OrganizationIndexer.doReindex(OrganizationIndexer.java:297)
      

      Phenomenon #2: At the database level there a way too many SQL statements can be seen; the following query is repeated millions of times.

      select roleimpl0_.roleId as roleId356_0_, roleimpl0_.uuid_ as uuid2_356_0_, roleimpl0_.companyId as companyId356_0_, roleimpl0_.userId as userId356_0_, roleimpl0_.userName as userName356_0_, roleimpl0_.createDate as createDate356_0_, roleimpl0_.modifiedDate as modified7_356_0_, roleimpl0_.classNameId as classNam8_356_0_, roleimpl0_.classPK as classPK356_0_, roleimpl0_.name as name356_0_, roleimpl0_.title as title356_0_, roleimpl0_.description as descrip12_356_0_, roleimpl0_.type_ as type13_356_0_, roleimpl0_.subtype as subtype356_0_ from Role_ roleimpl0_ where roleimpl0_.roleId=:1 
      

      1) While reindex is running find out which DB session has been bound to the process. (Note that 79mvrbr1yscpa is the hash code of the SQL above and it will always be the same for exactly the same statement)

      SELECT ash.session_id,
        ash.session_serial#,
        p.spid,
        COUNT(*) AS cnt
      FROM v$active_session_history ash,
        v$session s,
        v$process p
      WHERE ash.sql_id   = '79mvrbr1yscpa'
      AND ash.session_id = s.sid
      AND s.paddr        = p.addr
      GROUP BY ash.session_id,
        ash.session_serial#,
        p.spid
      

      2) Enable SQL trace

      BEGIN
        -- The session from the list above with the highest CNT will be the one
        dbms_monitor.session_trace_enable(session_id => '49', serial_num => '9');
      END;
      /
      

      3) Look for the trace file of that session, value of column SPID denotes the O/S process ID; and then aggregate data.

      $ ls -hgG *6635*.trc
      -rw-r----- 1 1.2G Sep 16 11:15 LPISSUES_ora_6635.trc
      
      $ tkprof LPISSUES_ora_6635.trc LPISSUES_ora_6635.out aggregate=yes waits=yes sort=execnt
      

      As you can see the SQL above has been executed 2.2 millions time so far (and counting). There's nothing wrong with its execution plan; the problem is that that many SQL executions generate enourmous network trafic.

      OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS
      
      call     count       cpu    elapsed       disk      query    current        rows
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      Parse   2226820     21.61      21.42          0          0          0           0
      Execute 2226820     26.10      24.29          0          6          0           0
      Fetch   2226820     21.21      21.27          0    6680610          0     2226820
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      total   6680460     68.94      66.99          0    6680616          0     2226820
      
      Misses in library cache during parse: 2
      Misses in library cache during execute: 2
      
      Elapsed times include waiting on following events:
        Event waited on                             Times   Max. Wait  Total Waited
        ----------------------------------------   Waited  ----------  ------------
        SQL*Net message to client                 2226820        0.00          1.29
        SQL*Net message from client               2226819        0.20        718.74
      

      For the end of the report it can be seen that the top wait event for this session is SQL*Net message from client and so far the DB has waited 718s for the client (Liferay). Having seen high values here usually means inefficient data processing coming from the client. (eg. issueing a query for ever record one-by-one instead of fetching what's needed at once).

        Attachments

        1. LPISSUES_ora_6635.out
          10 kB
        2. lps_40214_export.zip
          2.79 MB
        3. thread_dump.txt
          58 kB

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Days since last comment:
                  5 years, 50 weeks, 2 days ago

                  Packages

                  Version Package
                  6.1.X EE
                  6.2.0 CE RC1