Do search engines index the same page more than once?

Posted on May 1, 2008 in SEO | 4 comments


I ran into an interesting point recently which arose while trying to SEO a wordpress blog for the first time. I noticed that the site had links to certain parts of the blog outside of wordpress, but used a slightly different URL, which in the eyes of Google et al is a completely different URL. For example:

your-site.com/address/
your-site.com/address/index.html
www.your-site.com/address/
www.your-site.com/address
www.your-site.com/address/index.html

Providing you have an index.html file in that folder, all five URLs above will point to the same page. However, these are treated as different pages by the search engines, and will most likely penalize you for duplicate content.

WordPress has loads of other issues with regards to duplicate content, as you find the same post on loads of different URLs, which is dangerous to your Page Rank. Rather than explain it all, Oleg Ishenko does a very good job showing How to Make a WordPress Blog Duplicate Content Safe with some very practical ideas to improve your situation with regards to search engine ranks.

Oleg also gives a great example to solve the problem I highlighted above, which simply involves ammending your .htaccess (I’m assuming you use UNIX here, if you want a windows URL rewriting alternative, try Pete Freitag’s Blog for some alternatives for IIS) file slightly and adding the following:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.yoursite.com$ [NC]
RewriteRule ^(.*)$ http://www.yoursite.com/$1 [R,L]
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

What this does is redirect all those five links to the same page, which would be www.your-site.com/address/ . My advice is always go back into your site and change your links to point to the exact same addresses, and consider adopting a stricter link naming convention when creating a page.

One very useful tool I have found to help with duplicate content is the GSiteCrawler, which is a free sitemap generator for Google and others (although initially aimed at Google, now supports yahoo and generic ones). What this does is crawl your site like a search engine bot and create an index of your pages in order to generate a sitemap. The beauty is that along with this there are reports of how the crawl went, which include broken links and duplicate content. These are things that the likes of Dreamweaver will miss out, because GSiteCrawler sees your site from the outside, not the inside and therefore only indexes your site according to what is linked within your pages. Anyway, I’m going off topic here… the point I’m making is the duplicate content feature is a great way to see what you’ve missed out and will help you weed out those duplicate pages/addresses.

Hope this helps!