Duplication of happiness and penalties: when and why

Google wanted to demitialize the notion of a happy duplicate penalty because webmasters tend to misrepresent it. First, he said that "a penalty for duplicate content," strictly speaking, does not exist.

But things have changed since Panda Update. In contrast, the notion of duplicated content has spread to sites that have too many similar pages and are then given an overall negative score, devaluing their positioning as a whole.

An analysis of the case of similar, but not strictly recopied content is provided by a Google employee. In this case, there was a spread of 50 seats on the site.

Penalties

They exist in practice in the sense that a site cannot be indexed or classified if:

It takes content from one site to another or points to two domains in the same site.
He repeats the content of a third-party site on his own.
He again publishes an already published article without significant additions .
It has too many pages that look like each other.

Random duplicate

One of the most frequent and boring cases is when two domain names point to the same site. The webmaster imagines that this allows you to return to the same site Internet users who have acquired a different domain from memory, for example, in .com and .fr, but for search engine robots these are two different sites with the same content, and they both cannot be indexed.

The same problem can occur if robots access a dynamic URL, such as http://www.scriptol.com?x = 5 and a more significant URL created with the post header, but pointing to the same page as can happen with a CMS.

These pages will not be penalized, but they will suffer from filtering by search engines that do not want to have the same pages in multiple copies in the index. (Link).

What will happen then is determined by three points:

Two pages with the same content are deferred.
One of them is selected as the best URL.
If one of them is in the sitemap and the other is not, then the first is held.
Then factors that confirm or do not confirm this quality are taken into account, mainly the number of back links to this URL.

As Matt Cutts said in an interview with a group of webmasters, the URL chosen for the index is considered original and has the most backlinks.
If two pages contain the same information without being strictly similar, and if one of them has a link to the other, then the other will be considered a link.

Canonical lighthouse

To avoid duplication of legitimate content created by the author of the pages, Google introduced a tag to place in the <head> section and specifies the URL to consider for the page when it is available at several different addresses.

<link rel="canonical" href="url de la page" />

See how to create a common canonical tag in PHP.

Conclusion

The presence of duplicate content on the site can fine it in many ways without a fine formally imposed by search engines. If a duplicate is not found, PageRank will be diluted between two pages, and if so, only one of the two is indexed, not being sure that this is correct.
Still, we shouldn't worry if we realize we have duplicate content available to robots: Just removing the duplicate content, or just making it unavailable, is enough to make the negative effects disappear.

Links

Article in Google Webmaster Central (from Google ).
Another article about the duplicate, regarding the case when a third-party site copies your content.
See also Webmaster Guide for Duplicated Content.