Uploaded image for project: 'PUBLIC - Liferay Portal Community Edition'
  1. PUBLIC - Liferay Portal Community Edition
  2. LPS-139467

It is impossible to search page content on some architectures.

    Details

    • Type: Regression Bug
    • Status: Closed
    • Resolution: Won't Fix
    • Affects Version/s: 7.2.X, 7.3.X, Master
    • Fix Version/s: None
    • Labels:

      Description

      Description
      In LPS-124407, we modified the way in which page content is indexed. We added a page crawler than makes GET requests to the page endpoint and indexes the response as the content.

      The page crawler always uses the domain that it saw from a previous request that was sent to the application server to construct the page endpoint. This can be very problematic depending on the system architecture. For some architectures, this will result in an incorrect domain name being used. And there is no way to reconfigure the page crawler to use a different domain name.

      We should add a new configuration option with which the user can configure which domain, host, and port that the page crawler should use. If the configuration hasn't been set, then we can fall back to the current behavior.

      Testing Steps
      Due to the nature of this fix, it is difficult to come up with a series of steps that demonstrate the issue without setting up a complex architecture involving web servers. Instead, steps have been provided below so that QA can test the issue to make sure that the fix is behaving as intended.
      1. Navigate to Site Builder > Pages.
      2. Add a new Content Page named "Test Page".
      3. Add a Paragraph to "Test Page" with the default text ("A paragraph is a self-contained unit of a discourse in writing dealing with a particular point or idea. Paragraphs are usually an expected part of formal writing, used to organize longer prose.")
      4. Publish "Test Page".
      5. Navigate to Site Builder > Pages.
      6. Search for the string "self-contained".
      Expected Result: "Test Page" should appear in the results.
      7. Navigate to Control Panel > Instance Settings > Pages (under "Content and Data") > Page Crawler.
      8. Fill out nonsensical values for the settings here, e.g. Hostname or IP Address: fakehostname, Port: 1, Connection Protocol: HTTPS.
      9. Navigate to Control Panel > Search > Index Actions.
      10. Execute a reindex on com.liferay.portal.kernel.model.Layout. Wait until the reindex has completed.
      11. Navigate to Site Builder > Pages.
      12. Search for the string "self-contained".
      Expected Result: "Test Page" should NOT appear in the results (we want to make sure that the nonsensical configuration is being picked up and used).
      13. Navigate to Control Panel > Instance Settings > Pages (under "Content and Data") > Page Crawler.
      14. Fill out reasonable values for the settings here, e.g. Hostname or IP Address: localhost, Port: 8080, Connection Protocol: HTTP.
      15. Navigate to Control Panel > Search > Index Actions.
      16. Execute a reindex on com.liferay.portal.kernel.model.Layout. Wait until the reindex has completed.
      17. Navigate to Site Builder > Pages.
      18. Search for the string "self-contained".
      Expected Result: "Test Page" should appear in the results.

      Systems Architecture Issues

      A client is using Akamai for caching, Apache httpd to terminate SSL, and Tomcat as app server. Their primary domain resolves to Akamai, it uses an internal public name to reach their exposed httpd instance using SSL, and it uses mod-proxy to tomcat on port 8080 for Liferay to serve content.

      The address used in the browser (to hit Akamai) is not appropriate for the layout crawler as it would reach out of the org to Akamai and may get cached content instead of real content, plus it would use SSL which is unnecessary and complicated.

      Using the internal address to hit httpd is a slightly better option, but since SSL terminates there it would need to be an SSL connection and the Liferay nodes would all need to have the cert installed to access the SSL site.

      Having the LayoutCrawler use localhost is the ideal solution as it bypasses all of those other components. However, the 8080 port is not exposed anywhere so the LC may not know the right port to access.

      Being able to configure the connection would allow client to specify how the LC should connect to the local Liferay instance to fetch the content from the content page.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              team-echo Product Team Echo
              Reporter:
              michael.bowerman Michael Bowerman
              Participants of an Issue:
              Recent user:
              Jose Jimenez
              Engineering Assignee:
              Michael Bowerman
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:
                Days since last comment:
                10 weeks, 6 days ago

                  Packages

                  Version Package