Monday, June 23, 2008

Duplicate Content Issue – Important Google Tips


This is the hottest topic among every webmaster is duplicate content issue. Every webmaster (include me ) always thinking about how to get rid of Duplicate content and what is actual duplicate content in the eye of search engine.

Sven Naumann from Google Search Quality Team gave some tips on duplicate content and what is consider as duplicate content. Here you will find some basic tips that Google recommend.

Sven Naumann mentioned that there are mainly two types of duplicate content

Same domain duplicate content: - If a website having same content for multiple pages then Google consider it as a duplicate content.
What Google will do once it finds duplicate content? It is clearly written on Google blog that during their crawling if Google find any duplicate content issue then they filter duplicate pages and show only one result in their index.
You can avoid duplicate content issue by blocking such pages using robots.txt file, let us Google crawl page which you like to get listed in search result and block other using noarchive or robots.txt.
Same content written in different language will not count as duplicate content.
If you have restructured your site then use 301 redirects through your .htaccess file. You can also set the preferred domain features available in webmaster tool.

Cross domain duplication :-
If someone directly copied your content and placed it on their website then Google probably search for the original copy of the content and will give higher weight age to the original site compare to the other.
Also if you syndicating your content then you need to ask your syndicating partners to put a link to your original content.

Bhavesh Goswami.

Friday, June 6, 2008

Wild Card Support ($ and *) - Google, Yahoo and MSN Robots.txt Exclusion Protocol

Hi All SEOs,

I am sure all webmasters (SEO) reading this block know about Robots.txt and how to use it. With robots.txt you can block any url, path or directory that you don’t want search engine to crawl. Also you can even block search crawler to crawl your entire site. Before few weeks all major search engines like Google, Yahoo and MSN announced that they all are now supporting Wild Card. Here I want to discuss about wild card support, what is wild card and how wild card is useful and how to use it?

$ Wild Card Support – This tells crawler to match everything from the end of a url. With $ Wild Card support webmaster can block certain types of urls, so now you don’t need to write every file type you want to block through robots.txt. You can block file types with specific patterns, you can specify special type of file extensions like PDF in your robots.txt file and search engines will not access that page and will not include in their database.

$ sign is used to block certain files types. For example if you want to block a file with .pdf extension then you need to write following syntax in your robots.txt file

User-agent: Googlebot
Disallow: /*.pdf$

* Wild Card Support – This tells crawler to match a sequence of characters. * Wild Card will block certain type of URL patterns like if you don’t want search engines to crawl URLs with session ids or other extraneous parameters. So from now specify the parameters that you don’t want to index by search engine using wild card and you have done, no need to create long list of URLs 

You can use * sign to block URLs with session IDs. For example if you want to block URLs with session IDs then you need to write following syntax in your robots.txt file

User-agent: *
Disallow: /*?

This will block all urls with Session IDs.

Bhavesh Goswami.