Sunday, August 21, 2011

How to Remove .html extension from URL in CQ Or How to Implement New Rewriter Pipeline in CQ / AEM

Note: CQ Need extension to understand and serve incoming request. There is no way you can have publish site extension less (Unless it is vanity URL). You can however make your public site extension less by tweaking apache and CQ config.

For CQ < 5.5

You could use Sling Rewriter to remove html from the URL.
Use this package (With some pom.xml changes)

You need following configuration changes,

create a node in the repository at /apps/myapp/config/rewriter/html-remover (the names 'myapp' and 'html-remover' can be whatever you want). This node must have the following properties:

* enabled (Boolean) - set to true
* serializerType (String) - set to htmlwriter
* generatorType (String) - set to htmlparser
* order (Long) - set to a number greater than 0
* contentTypes (String[]) - set to text/html
* transformerTypes (String[]) - set to linkchecker, html-remover

It might also be simpler to copy the node /libs/cq/config/rewriter/default to /apps/myapp/config/rewriter/html-remover and then add html-remover to the multi-valued transformerTypes property.


Some additional Questions and answer

Question: What are transformerTypes? I just set it to img-src-cdn-prefixer only without linkrewriter, mobile.....etc And it seems to work. Does transformerTypes define entire pipeline? If I specify it to be {"linkchecker", "img-src-cdn-prefixer, mobile, mobiledebug"} will the pipeline be: htmlparser -> linkchecker -> img-src-cdn-prefixer -> mobile -> mobiledebug -> htmlwriter ?

Answer: That is correct

Question What are htmlwriter (serializer) and htmlparser (generator) ? I assume these are default generator and serializer for default pipeline for .html ?

Answer: serializer marks the end of pipeline. so for /libs/cq/config/rewriter/default it would be htmlwriter->linkchecker->htmlwriter

and for /libs/cq/config/rewriter/pdf it would be html-generator->htmlparser->xslt->fop

Question: Do you recommend defining my own pipeline component? (by setting pipeline.type = "img-src-cdn-prefixer").
Or, should I use pipeline.mode = "global" ? If I used pipeline.mode = "global", would I be able to disable my component? Maybe by setting service.ranking to negative?

Answer: first option is to define pipe line as component and second option is once you have a component and you want to extend it. You can put new component in your pipeline based on ranking.

it is mentioned in http://sling.apache.org/site/output-rewriting-pipelines-orgapacheslingrewriter.html


Question: If I wanted to beautify html output, would I implement TransformerFactory and use tidy html or similar, and put it right before serializer?

Answer: In theory yes.

Note: This feature would be present OOTB in next release of CQ. You can also ask for paid feature pack for this for CQ5.4 version.

For CQ > 5.5




Note that your application does not work with extension less URL, CQ needs selectors in order to find appropriate resource. Above setting just remove .html extension from Links but eventually in order to serve those pages you need internal ".html" extension. You can use rewrite rule like this to achieve this


Apache Changes 


Above configuration will remove html extension from embedded link. But your goal is to have incoming URL as extension less. You need to consider following,

1) Some one coming with .html extension ----> remove extension
2) Some one coming without extension -----> map to something CQ understand. Mean add .html
3) Some one coming with selectors as extension ----> Just remove what you want to remove
4) Some one coming with a URL for which removing extension is not desirable ----> Exclude them from extension removal


# Set Up Env Variable for extension Less URL
# This mean that if URL does not start with /etc/design or /bin/wemblog (This is where you will have #your custom servlet)
#Also if URL does not have any extension then this url qualifies for extension less
RewriteCond %{REQUEST_URI} !^/bin/wemblog(.*) [NC]
RewriteCond %{REQUEST_URI} !^/etc/designs(.*) [NC]
RewriteCond %{REQUEST_URI} !(.*)\.[a-zA-Z0-9-]+$
RewriteRule .* - [E=EXTENSION_LESS_URL:1]

#If it is selector then don't remove html extension
#Here we are checking for more than 2 dots occurance
RewriteCond %{REQUEST_URI} .*(\.[a-zA-Z0-9-]*){2,}
RewriteRule .* - [E=MULTIPLE_EXTENSION_URL:1]

# To make thing configurable add few thing which will not be considered for extension less
RewriteCond %{REQUEST_URI} !^/<some path you don't want to remove extension>(.*) [NC]
RewriteCond %{REQUEST_URI} !^/etc/designs(.*) [NC]
RewriteCond  %{ENV:MULTIPLE_EXTENSION_URL} ^$
RewriteRule .* - [E=EXCLUDE_FROM_EXTENSIONLESS:1]

# If some one come with / in end then remove it
RewriteCond %{REQUEST_URI} !^/$ [NC]
RewriteRule ^(.*)/$ $1 [R=301,L]

# Now remove extension if some one comes with extension

# If some one comes with .html as extension then redirect them to non html URL
RewriteCond  %{ENV:EXCLUDE_FROM_EXTENSIONLESS} !^$
RewriteRule ^(.*)\.html$ https://%{HTTP_HOST}$1 [R=301,L]

# If some one comes with .html as extension then redirect them to non html URL
RewriteCond  %{ENV:EXCLUDE_FROM_EXTENSIONLESS} !^$
RewriteRule ^(.*)\.htm$ https://%{HTTP_HOST}$1 [R=301,L]

# Now time to handle extension less to be passed with extension so that publish understand request

#If URL has multiple extensions, and can have /content and it ends with html
RewriteCond %{REQUEST_URI} !^/etc/designs(.*) [NC]
RewriteCond %{REQUEST_URI} !^<some more path you want to exclude>(.*) [NC]
RewriteCond  %{ENV:MULTIPLE_EXTENSION_URL} !^$ [NC]
RewriteCond %{REQUEST_URI} .*(\.html)
RewriteRule ^(/.*)$ $1  [L,PT]


#If request do not have html extension then add html extension to it
#This assume that you already removed /content from URL
RewriteCond %{REQUEST_URI} !^/content/dam(.*) [NC]
RewriteCond %{REQUEST_URI} !^/content(.*) [NC]
RewriteCond  %{ENV:EXTENSION_LESS_URL} !^$
RewriteRule ^(/.*)$ $1.html [L,PT]


# If something is missed in the end
RewriteCond  %{ENV:EXTENSION_LESS_URL} !^$
RewriteRule ^(/.*)$ $1.html  [L,PT]



Custom generator example for XML: https://github.com/Adobe-Consulting-Services/acs-aem-commons/pull/48


24 comments:

  1. Package is not accessable. gives 404. Can you please post the package.

    ReplyDelete
    Replies
    1. Sorry for missing package. Updated new package and some more information.

      Delete
  2. Thank you Yogesh,
    is there anyway we can call this functionality on demand, from component, essentially JSP, Servlet.? I want to write a sample to tranform part of HTML to absolute urls, which needs to performed on demand.

    ReplyDelete
    Replies
    1. Not sure what do you mean by On demand ... I guess for that you could use sling resource resolver.

      Delete
  3. From JSP, I should be able to send a chunk of HTML generated to transformer and get back absolute urls.

    ReplyDelete
  4. The above packages are very useful & simple.
    Thanks yogesh

    ReplyDelete
  5. I will say, this is the first example I have ever found using the transformer factory/transformer in Sling/CQ. I work much better with examples, and this example was just enough for me to follow and replicate.

    Thanks a million for the example, especially the source code. You are a life saver!

    ReplyDelete
  6. Hi. I am trying to implement such functionality but all the time I am getting exception:

    Caused by: java.lang.ClassCastException: (...).service.LinkRewritingSampleTransformerFactory cannot be cast to org.apache.sling.rewriter.TransformerFactory

    CQ 5.5

    Can you help me?

    Thank you in advance.

    ReplyDelete
    Replies
    1. Fixed it. Artifact "org.apache.sling.rewriter" should be used in scope "provided" (in Maven) in order to use it only during compilation. Otherwise this dependency will be uploaded into repo and this artifact and another at CQ will be loaded by different classloaders which causes this trouble.

      Delete
  7. This comment has been removed by the author.

    ReplyDelete
  8. CQ Link rewrite doesn’t work for extensions like .jpg and .pdf by default .CQ only rewrites URLS which are having .html extensions but it doesn’t take care of the URLs which are having other extensions.
    Example :
    If we have
    a href="/content/A/B.html"
    And if url in href "/content/A/B.html" is mapped to “/A/B.html” then rewriter will change this to a href="/A/B.html"
    But same is not true for other URLs which are having different extensions
    a href="/content/A/B.pdf " Won’t get converted to a href="/A/B.pdf "

    I am looking to know the configuration where we can add other extensions so that rewriter will pic these URLS as well.

    ReplyDelete
    Replies
    1. Purmendra,

      Rewriter works based on your entry on mapping config and in sling resource resolver mapping. Check this http://helpx.adobe.com/cq/kb/HowToConfigureLinkRewriting.html if you have entry like /content/-/ then even your pdf and jsg will get rewritten as long as they have that extension and exist in repository.

      External links never get rewritten.

      Yogesh

      Delete
  9. no...its not happening I have tried only link with .htm or .html are getting rewritten.
    all my website links are getting resolved but only if extension is .htm or .html .
    and i am taking about only internal links .....

    ReplyDelete
    Replies
    1. What version of CQ you are using ?

      Delete
    2. Hello Purnendra,

      Yes you are right only links with .html extension is getting rewritten. If you want other links to get rewritten as well, you might have to right custom rewriter as given in above document. You can also raise enhancement request with Adobe to handle rewriting of DAM URLs.

      Yogesh

      Delete
  10. Hi Yogesh,

    I've implemented something very similar to this, in my case its a transformer that strips out specific url's based on a pattern. It has been placed in the pipeline directly before the linkchecker so that the pipeline is like this:
    link-blocker -> linkchecker -> ....

    this has been done as suggested by copying /libs/cq/config/rewriter/default to /apps/myapp/config/rewriter/html-remover (or similar)

    One potential issue is that "link-blocker" may not be available (ie the service or bundle is stopped), and thus the pipeline is broken. Is there some way of injecting my transformer into the pipeline when the bundle is loaded such that if the bundle is deactivated or removed the pipeline doesn't break?

    The only important thing is that my transformer is run before the linkchecker.


    Thanks
    kyle

    ReplyDelete
    Replies
    1. Kyle,

      I think you should raise a enhancement request for this. Ideally your pipeline should not break even if serialize type is not present.

      Yogesh

      Delete
  11. i have custom transformer( that is global one name="pipeline.mode" value="global") which has the logic to rewrite public urls from html and xml respose but it is working for html now i want to extend this for xml as well how to do that?
    I need right configuration.
    Please help me.
    just extension and contenttype is engough for this to work?

    ReplyDelete
    Replies
    1. See example under /libs/cq/config/rewriter/pdf with "empty-generator" "generator Type". However note that CQ already has xml renderer OOTB, not sure if that will conflict with your rewriter.

      Yogesh

      Delete
  12. This comment has been removed by the author.

    ReplyDelete
  13. Package is not accessable. gives 404. Can you please post the package.

    ReplyDelete
    Replies
    1. Hello Chelsea,

      It is working for me. Can you also try https://drive.google.com/file/d/0B3d7-oHroQKdQTZfX2lITWY4WG8/view?usp=sharing

      Yogesh

      Delete
  14. Hi Yogesh,
    I have the same requirement on our project and we are working on AEM 6.0 service pack2. I tried the steps mentioned above of checking "strip of html extension" from link checker transformer OSGI configuration and restarted AEM and used the apache redirect in apacher server but its removing the .html extension from end user url but not serving the page. Its also not able to cache the page as its a extension less URL recieved by dispatcher. Appreciate a quick response on this.

    ReplyDelete