Author: Lazarina Stoy
In a landscape where online presence is paramount, taking proactive steps to optimize your website by eliminating duplicate content can significantly improve your brand’s success at multiple stages of the customer journey.
For instance, at the start of their journey, users browse search results for the most relevant, high-quality websites, but duplicate content can reduce your brand’s search visibility and competitiveness.
And, as users proceed down the marketing funnel, duplicate content threatens the conversion once again due to a fractured/frustrating user experience, making it difficult to locate all the information about a given product, service, or topic on a single page.
In this guide, I’ll walk you through the nuances of content duplication, show you how to identify duplication on your website (and other properties on the web), and share some fixes you can implement to resolve any content overlap. This article primarily focuses on the type of duplicate content you have control over—on your own domain—however, external duplicate content can also diminish your search performance, so I'll also address that as well.
Table of contents:
What is duplicate content?
Duplicate content refers to content that’s very similar (or identical) to other content on the web, either on another website (i.e., external duplicate content) or another page of the same website (i.e., internal duplicate content).
You can flag a page as duplicative if you see that it has one (or more) of the following characteristics:
Significant content overlap — The content portions of a web page are either exact copies or very similar versions of the original content.
Structural and semantic similarities — There are substantial similarities in the overall structure of the page, including the on-page metadata (e.g., URLs, headings, subheadings, paragraph structure, etc.), as well as semantic similarities (e.g., entities mentioned, arguments made, etc). Content in which minor subtleties are the only differentiator can signal that the pages are trying to appear different, but actually serve the same purpose.
Lack of thought originality — The content is informed by the same sources and presents similar perspectives.
Similarities in ranking queries — Both pages are ranking for an identical set of keywords (signaling a lack of unique content) and target the same intent and audience.
Types of duplicate content
In addition to the internal and external duplicate content classifications I mentioned above, you can also examine duplicate content through the lens of similarity (i.e., exact duplicate or near duplicate).
Duplicate content type | Exact duplicate | Near duplicate |
Internal duplicate content |
|
|
External duplicate content |
|
|
It’s useful to assess whether a page is an exact duplicate or a near duplicate for the sake of prioritization, but also to understand how Google will perceive and rank the page.
To elaborate, exact duplicates (when left unmanaged) almost certainly will result in lower rankings, worse user experience, and harm your website’s search performance overall. On the contrary, near duplicates are considered with nuance, depending on the context and the degree of similarity.
When is duplicate content a problem?
Duplicate content isn’t always a cause for concern. In some instances, content duplication can be harmless and even intentional, like in the case of news syndication.
In general, problems occur when content duplication is malicious, misleading, hinders the user experience, and/or does not serve a specific purpose. When duplication is responsible and strategic, it can help you promote your content to a wider audience or serve other business functions.
Here’s a breakdown of when content duplication is (and isn’t) problematic.
Content duplication can be harmless when:
Syndication and repurposing are enabled — This allows you to share content across platforms (like social media or news aggregators) with permission and proper attribution. This can expand your content’s reach and enhance your overall brand.
Fair use is implemented — Duplicating content under fair use or licensing agreements is acceptable (e.g., quoting source material, educational materials, etc).
Canonicalization and URL management govern distribution — Use canonical tags and proper URL management to specify the original version for search engines. This applies to both re-publishing content on a platform like Medium and managing URLs with parameters internally.
Content duplication is problematic when:
Copy/pasted (or exact duplicate) content accounts for the majority of the content on a page — Word-for-word repetition across multiple pages or websites hinders search engine indexation and confuses users. Purposefully copying content from other websites without attribution is also punishable by law.
Paraphrased (or near duplicative) content accounts for the majority of the content on a page — Similar content with overlapping phrases, similar structure, or semantically-related arguments can frustrate users and worsen the search landscape.
Republishing is unauthorized or attribution is missing — Copying content without attribution or permission violates copyright laws and Google has said it “reserves” the right to penalize such websites.
“Duplicate content does happen. Now, that said, it’s certainly the case that if you do nothing but duplicate content, and you’re doing it in an abusive, deceptive, or malicious, or manipulative way, we do reserve the right to take action on spam.” — Matt Cutts, Google
Duplicate content metrics
There are several metrics that can show you whether your website’s content is duplicated internally or externally. Here’s what to look out for.
Metric | Why |
Pages indexed | Indexable pages that Google chooses not to serve can indicate poor content quality (i.e., duplicate content).
Look out for high numbers of the following status issues:
|
Erratic or declining traffic performance | Traffic fluctuations or sudden drops can indicate duplicate content, cannibalization, or external content surpassing your own website’s content. Monitor for drastic changes in positions and clicks from search to help identify pages that are potentially affected. |
Ranking query overlap for internal pages | Significant overlaps in ranking queries for internal pages can indicate lack of unique content and perspectives, causing diminished search traffic for one or both pages. |
Metric | Why |
Bounce rate on pages with similar terms or structure | High bounce rate on these pages could suggest that users are struggling to differentiate between the pages, causing a poor user experience. |
Session duration declines for pages with similar terms or structure | Noticeable drops in session duration for pages in the same topic cluster could signal duplicate content problems. |
Time on page | Track the average amount of time users spend on each page of your website. A longer average time on page suggests that users are engaged with the content and find it valuable. |
How to identify duplicate content
To identify duplicate content internally, you’ll need to evaluate your website’s pages against one another. To check for external duplicate content, you’ll compare your pages against other pages on the web. In the following sections, I’ll take you through the process of doing exactly that.
How to check for internal metadata duplication
If you are concerned about duplicate pages on your website, the quickest way to validate the issues is to run a duplication analysis on your metadata. Check for similarities in elements like page titles, page headings and subheadings, meta descriptions, and URLs.
This is a simple, straightforward duplication analysis and can be performed with freemium crawlers like Screaming Frog, which extracts data for all the aforementioned elements on all your site’s pages.
To scan for duplicate metadata on your website, download and install Screaming Frog and start a crawl. Once the crawl is complete, export all internal data to a Google Sheet and paste it into the first sheet of this template (you’ll need to make a copy of the template first).
Next, click on Extensions > AppScript in the top-level menu, then select findanddisplayduplicateswithURLs.gs from the list, and click on RUN, which will automatically sort the pages to show duplicate titles, headings, and URLs, as well as a summary, in the sheet titled “Summary of on-page duplication.”
You can also take your analysis a step further by analyzing your metadata for semantic similarities. This is useful for identifying patterns that might qualify as near duplicates, like abusing title or heading patterns, which can become tedious for users. While Screaming Frog doesn’t currently offer this semantic analysis for on-page data, there are workarounds you can use in Google Sheets.
Once you have entered your data in the Google Sheets template I provided above, the template will automatically run a fuzzy matching (string similarity) formula to weed out titles or headings that are similar but not exact. You can review these semantically-related on-page elements in the “Near Duplicates” tab, alongside the number of potential matches and most similar alternatives. This can help you pinpoint whether you’re abusing any particular title or heading formats or structures, e.g., “# Ways to connect {tool} with Looker Studio” or “Ultimate guide to X,” like below.
This type of duplication analysis is suitable for sites of any size/age. Ideally, you should conduct this analysis as part of regularly scheduled reporting processes, especially in organizations where:
There is more than one person (or team) creating content
There is no tracking in place for new content/web pages
The website/publication is more than a year old
These scenarios can lead to duplicate content (as a result of human error or lack of coordination and tracking), so you’ll need to be more vigilant if your website (or your clients’) falls under one of these scenarios.
How to check for internal duplicate content
If you have concerns about duplicate content on your own website, go a step beyond metadata analysis (mentioned above) and evaluate the written content on your website for self-plagiarism and duplication.
Here are my favorite, beginner-friendly, and readily available methods to quickly identify duplicate content on the same domain (according to website size and budget):
Method 1: Similarity assessment based on ranking queries
Using the Search Analytics for Sheets Google Sheets extension (free, with paid option for larger exports), pull out a report for organic ranking queries, grouped by Query and Page.
Ideally, set the report to be extracted into the same Google Sheet template, in the Sheet tab titled “Query-based duplication,” as shown in the image above.
Ensure that the OverlapSummary sheet is empty, and navigate to Extensions (top-level menu) > AppScript > CalculateQueryOverlap.gs. Run the script. The template will automatically apply a formula that sorts the data to show only the duplicated queries, the number of pages ranking for each of these, and the corresponding URLs.
In the sheet titled “OverlapSummary,” data will populate (via a custom AppScript formula) with a summary of the pages that overlap on ranking queries. Review pages that rank for more than 80-90% of the same queries, as you can either consolidate the content on these pages or further optimize them, because, as is, they are not ranking for unique queries.
Method 2: Run your website’s content through a plagiarism checker
Use a paid tool (like Siteliner, for example) to get a self-plagiarism report on your entire website. Siteliner provides a summary based on the page’s importance (how many internal links it has) and the degree of copied content on it.
Siteliner also shows you the exact content that is duplicated on each page and the URLs of other pages containing the same content. You can even scale this approach for larger websites via Siteliner’s API.
Method 3: Duplicate content analysis via crawling tools
You can also check for duplicate (and near duplicate) content with Screaming Frog. By default, the crawler identifies exact duplicate pages. For smaller websites (under 500 URLs) duplicate content detection is free and automatic for every crawl.
Before starting a crawl, you can instruct the tool to also detect near-duplicate pages, based on a threshold of your choice (i.e., a percentage number, such as above 90% similarity). This feature requires a paid license and will return a custom duplication report.
Other crawler tools, like Sitebulb, offer similar duplicate content detection features.
Here’s a summary of the methods you can use for internal duplicate content detection, with regard to their respective costs and advantages:
Method | Tool | Cost | Advantages |
Query-based analysis | Google Search Console | Free |
|
Plagiarism checker | Siteliner | Paid (free for sites under 250 URLs) |
|
Crawling tool | Screaming Frog, Sitebulb, or any other crawling tool | Can depend on the tool, but generally free for smaller sites and exact duplicates |
|
How to check for external duplicate content
In addition to auditing for duplicate content on your own website, you should also actively search for instances where your content has been used (or copied entirely) elsewhere without your permission.
While external content duplication can sometimes be unintentional and harmless to your organic search performance (e.g., someone directly quoting a snippet from your blog), it can also be malicious in cases where the creator of the duplicate does not add any new information, is monetizing the copied work, or has not credited your original work.
One tool that I like to use for external content duplication analysis is Copyscape, which offers a straightforward and low-cost way to discover if your web content appears on other sites. There are several ways to use the service, like submitting a URL or sitemap, pasting a batch of URLs for analysis, or even via an API.
The tool then scans the web for any pages that contain similar or identical content, identifying the number of matches for each page, the degree of risk for the given page, top-matching sites, and even flagging any errors it encounters with the URLs provided.
It not only identifies full copies (i.e., exact duplicates) of your content but also finds instances where parts of your text have been used. You can also set up automatic monitoring, which alerts you when the tool finds new copies of your content.
By proactively monitoring for duplicate content, you protect not only your intellectual property but also your site’s rankings on search engines, as duplicated content can chip away at your organic search performance.
How to fix duplicate content issues
Once you’ve identified duplicate content (using any of the methods listed above), you can take the following steps to reduce similarities in your website’s metadata or content.
Fix internal metadata duplication
Use unique heading tags: Headings help organize the content of a page and provide context for search engines. Similar to title tags, heading tags (H1, H2, H3, etc.) should also be unique for each page.
Canonicalize product URLs: For eCommerce websites, product URLs with parameter variations (like color, size, etc.) should be canonicalized to the main product URL. This tells search engines that the main product page is the authoritative source for product information.
Fix internal duplicate content
Aim for mostly unique content: Ensure the content on each page is at least 70% unique. While some repurposing is expected in a website, there shouldn’t be patterns of content abuse with the aim of gaming the search algorithm.
Consolidate and redirect, don’t delete: When merging content, implement a 301 redirect instead of deleting the obsolete page to avoid loss of link equity and 404 page status errors.
Diversify your content portfolio: Avoid writing in the same style, the same structure, or format for all your website content. Diversify the types of content you create by experimenting with listicles, case studies, tutorials, and so on.
Fix external duplicate content
Signal originality to Google: Implement canonical tags on your website and make sure all content has a publishing date, revision dates (if any), and E-E-A-T-establishing information.
Improve your content further: Consider further enhancing your content by adding new perspectives and insights as a way to set it apart from the duplicated instances.
Get in touch with the other website’s owner: Some website owners would prefer to simply remove your content when requested, instead of risking escalation (more on that below).
Report infringing content to Google: In a video on content duplication, Google stated that, in certain circumstances, you can file for a DMCA (Digital Millennium Copyright Act) takedown. This only applies if copying your content is illegal (such as in the case of music, for example).
Take legal action: In case you have evidence of intellectual theft, take legal action under the DMCA or European Union Copyright Directive (EUCD) (depending on your location).
The goal for all of these content deduplication initiatives is improving your website’s organic search performance (by improving your content quality and improving user experience). As you progress towards that goal, you should also see improvements in the metrics I discussed earlier.
How to audit your website for duplicate content: The takeaways
Content duplication, while not always problematic, can be difficult to identify or manage, especially on larger and more established websites. It can manifest in several cases: internally or externally, by exact or partial duplication, or sometimes even unintentionally. In all cases, duplicate content issues should be monitored and, where possible, addressed to ensure optimal search performance and user experience.
To identify duplicate content issues, review your traffic patterns, search performance, and indexation in GSC data, as well as your page experience, engagement rates, and time on site in GA4.
Regularly audit your website for duplicate on-page metadata like titles and headings, but also utilize tools like Siteliner for internal duplicate content analysis and Copyscape to identify copies of your content on the web.
If you identify duplicate content internally, implement the necessary actions to make pages and their respective content unique. In the case of malicious copies of your content elsewhere, try to get in touch with the website owner before reporting it to Google or the relevant authorities in your area.
By implementing systems to monitor and audit your content for duplication, and taking corrective action when needed by producing unique and helpful content, you are ensuring that your website and business are operating sustainably now and into the future.
Lazarina is an organic marketing consultant specializing in SEO, CRO, and data science. She's worked with countless teams in B2B, SaaS, and big tech to improve their organic positioning. As an advocate of SEO automation, Lazarina speaks on webinars and at conferences and creates helpful resources for fellow SEOs to kick off their data science journey. Twitter | Linkedin
Comments