Cloudfront is Amazon’s Content Delivery Network (CDN) offering and provides web developers with an easy to use and cost effective way of letting people access content from a variety of regions around the world with the aim of speeding up website delivery.
This isn’t a guide to using Cloudfront – if that’s what you’re after I’d suggest taking a look at this article.
In this article I will attempt to provide a solution to one of the only issues I’ve had when setting up a Cloudfront distribution, and that’s the issue of duplicate content.
When you create a distribution in Cloudfront it’s possible to use your website as the ‘origin’ for the ‘distribution’. This basically results in having a complete static copy of your website available at another location.
So an image which is available at:
http://18a.co/img/logo.png
Would now also be available at:
http://d3uy7bfospl44.cloudfront.net/img/logo.png
(For illustrative purposes only – these URLs don’t work)
While this makes using Cloudfront really easy (all you need to do is update your website to reference the files via cloudfront), unfortunately this isn’t so great if Google gets it’s greedy mitts on it.
There are 3 ways I’ve found for dealing with this issue:
- Theoretically if you don’t link to it anywhere then Google will never index it – risky
- Adding canonical URLs to your pages will tell Google the original version of the page and it ‘should’ take note of this hint. However according to Matt Cutts this is only a hint and not 100% to be relied upon.
- The final technique I’ve come across (or come up with I’m not sure) is viewed by some as a little extreme considering the 2 aforementioned options available to you. However I don’t like to take any chances when it comes to Google so I think it’s worth it to avoid any possible duplication of content issues.
The objective here is to use a robots.txt file to tell Google not to index the version of your site on the CDN. The main problem is that if you set your website document root to be the origin of the CDN distribution, then everything will be mirrored exactly as it is on your site. This includes your robots.txt file. Unfortunately at the time of writing, Cloudfront doesn’t allow you to edit the robots.txt file available on your distribution (this would get around the problem), so you have to be a little bit creative.
Create an alternative version of your site available via a subdomain, for example http://static.18a.co. The idea being that all your content is available at both http://18a.co and http://static.18a.co. This might seem counter-intuitive as you now have 3 versions of your site, but bear with me.
Now edit your .htaccess file and add a directive that serves up a different robots.txt file depending on the subdomain viewing the site, something like this should do the trick:
# This attempts to serve a custom robots.txt to the CDN subdomain
RewriteCond %{HTTP_HOST} ^static.18a.co$ [NC]
RewriteRule ^robots.txt robots_cdn.txt [L,NC]
Create a new robots_cdn.txt file which contains the following:
User-agent: *
Disallow: /
Now if you visit http://18a.co/robots.txt you should see something like this (or whatever is in your original robots.txt file):
User-agent: *
Disallow: /cgi-bin/
Disallow: /a/
Disallow: /min/
However if you visit http://static.18a.co/robots.txt you should see something like this:
User-agent: *
Disallow: /
With that done, create a distribution using static.18a.co as the origin and it should mirror everything you want, including the special robots.txt directive asking Google to kindly ignore everything on that domain.
Any feedback please leave in the comments.