Audit and fix duplicate content: A guide to helping Google choose what to rank

Lazarina Stoy
Jan 10, 2024
13 min read

Updated: Sep 27, 2024

An image of author Lazarina Stoy, accompanied by various search-related iconography

In a landscape where online presence is paramount, taking proactive steps to optimize your website by eliminating duplicate content can significantly improve your brand’s success at multiple stages of the customer journey.

For instance, at the start of their journey, users browse search results for the most relevant, high-quality websites, but duplicate content can reduce your brand’s search visibility and competitiveness.

And, as users proceed down the marketing funnel, duplicate content threatens the conversion once again due to a fractured/frustrating user experience, making it difficult to locate all the information about a given product, service, or topic on a single page.

In this guide, I’ll walk you through the nuances of content duplication, show you how to identify duplication on your website (and other properties on the web), and share some fixes you can implement to resolve any content overlap. This article primarily focuses on the type of duplicate content you have control over—on your own domain—however, external duplicate content can also diminish your search performance, so I'll also address that as well.

Table of contents:

What is duplicate content?
- Types of duplicate content
- When is duplicate content a problem?
Duplicate content metrics
How to identify duplicate content
How to fix duplicate content issues
Takeaways

What is duplicate content?

A graphic that says “Signs of duplicate content: significant content overlap, structural and semantic similarities, lack of thought originality, similar ranking queries.

Duplicate content refers to content that’s very similar (or identical) to other content on the web, either on another website (i.e., external duplicate content) or another page of the same website (i.e., internal duplicate content).

You can flag a page as duplicative if you see that it has one (or more) of the following characteristics:

Significant content overlap — The content portions of a web page are either exact copies or very similar versions of the original content.
Structural and semantic similarities — There are substantial similarities in the overall structure of the page, including the on-page metadata (e.g., URLs, headings, subheadings, paragraph structure, etc.), as well as semantic similarities (e.g., entities mentioned, arguments made, etc). Content in which minor subtleties are the only differentiator can signal that the pages are trying to appear different, but actually serve the same purpose.
Lack of thought originality — The content is informed by the same sources and presents similar perspectives.
Similarities in ranking queries — Both pages are ranking for an identical set of keywords (signaling a lack of unique content) and target the same intent and audience.

Types of duplicate content

In addition to the internal and external duplicate content classifications I mentioned above, you can also examine duplicate content through the lens of similarity (i.e., exact duplicate or near duplicate).

Duplicate content type	Exact duplicate	Near duplicate
Internal duplicate content	Duplicate on-page meta data Parameter URLs that are indexable and non-canonicalized, leading to the same page Feed content pages (e.g., blog archive, tag pages, category pages)	Paraphrasing copy/pasted content Content cannibalization Boilerplate content, like on-page banners and CTAs, product descriptions, or info boxes, that are very similar or identical across many pages Using a similar page and heading structure across multiple pages Overlapping phrasing, sentence, and paragraph structure over multiple pages
External duplicate content	Syndicated content published on multiple sites Unauthorized republishing, scraped, or “cloned” content (copied without permission or proper attribution) Cross-domain duplication	Partial copying (e.g., when affiliate websites use similar or identical product descriptions, borrowing from the original website’s product description) Paraphrased content, content structures, and website architecture (e.g., websites, created to manipulate search engine results and not add value for user) Fully AI-generated content is also inherently unoriginal, especially when paired with site structure copied from another website

It’s useful to assess whether a page is an exact duplicate or a near duplicate for the sake of prioritization, but also to understand how Google will perceive and rank the page.

To elaborate, exact duplicates (when left unmanaged) almost certainly will result in lower rankings, worse user experience, and harm your website’s search performance overall. On the contrary, near duplicates are considered with nuance, depending on the context and the degree of similarity.

When is duplicate content a problem?

Duplicate content isn’t always a cause for concern. In some instances, content duplication can be harmless and even intentional, like in the case of news syndication.

In general, problems occur when content duplication is malicious, misleading, hinders the user experience, and/or does not serve a specific purpose. When duplication is responsible and strategic, it can help you promote your content to a wider audience or serve other business functions.

Here’s a breakdown of when content duplication is (and isn’t) problematic.

Content duplication can be harmless when:

Syndication and repurposing are enabled — This allows you to share content across platforms (like social media or news aggregators) with permission and proper attribution. This can expand your content’s reach and enhance your overall brand.
Fair use is implemented — Duplicating content under fair use or licensing agreements is acceptable (e.g., quoting source material, educational materials, etc).
Canonicalization and URL management govern distribution — Use canonical tags and proper URL management to specify the original version for search engines. This applies to both re-publishing content on a platform like Medium and managing URLs with parameters internally.

Content duplication is problematic when:

Copy/pasted (or exact duplicate) content accounts for the majority of the content on a page — Word-for-word repetition across multiple pages or websites hinders search engine indexation and confuses users. Purposefully copying content from other websites without attribution is also punishable by law.
Paraphrased (or near duplicative) content accounts for the majority of the content on a page — Similar content with overlapping phrases, similar structure, or semantically-related arguments can frustrate users and worsen the search landscape.
Republishing is unauthorized or attribution is missing — Copying content without attribution or permission violates copyright laws and Google has said it “reserves” the right to penalize such websites.

“Duplicate content does happen. Now, that said, it’s certainly the case that if you do nothing but duplicate content, and you’re doing it in an abusive, deceptive, or malicious, or manipulative way, we do reserve the right to take action on spam.” — Matt Cutts, Google

Duplicate content metrics

There are several metrics that can show you whether your website’s content is duplicated internally or externally. Here’s what to look out for.

In Google Search Console:

Metric	Why
Pages indexed	Indexable pages that Google chooses not to serve can indicate poor content quality (i.e., duplicate content). Look out for high numbers of the following status issues: “Duplicate, Google chose different canonical than user” “Crawled - currently not indexed"
Erratic or declining traffic performance	Traffic fluctuations or sudden drops can indicate duplicate content, cannibalization, or external content surpassing your own website’s content. Monitor for drastic changes in positions and clicks from search to help identify pages that are potentially affected.
Ranking query overlap for internal pages	Significant overlaps in ranking queries for internal pages can indicate lack of unique content and perspectives, causing diminished search traffic for one or both pages.

Screengrab from Google Search Console, demonstrating the status errors: Crawled, currently not indexed, and Duplicate, Google chose different canonical than user, int the page indexing report. — High numbers of these status issues could indicate duplicate content on your website.

In Google Analytics 4:

Metric	Why
Bounce rate on pages with similar terms or structure	High bounce rate on these pages could suggest that users are struggling to differentiate between the pages, causing a poor user experience.
Session duration declines for pages with similar terms or structure	Noticeable drops in session duration for pages in the same topic cluster could signal duplicate content problems.
Time on page	Track the average amount of time users spend on each page of your website. A longer average time on page suggests that users are engaged with the content and find it valuable.

Screengrab from GA4, showing the metrics and dimensions to monitor that can signal duplicate content issues, organized in a custom report. — Declining session duration and time on page can signal duplicate content issues.

How to identify duplicate content

To identify duplicate content internally, you’ll need to evaluate your website’s pages against one another. To check for external duplicate content, you’ll compare your pages against other pages on the web. In the following sections, I’ll take you through the process of doing exactly that.

How to check for internal metadata duplication

If you are concerned about duplicate pages on your website, the quickest way to validate the issues is to run a duplication analysis on your metadata. Check for similarities in elements like page titles, page headings and subheadings, meta descriptions, and URLs.

This is a simple, straightforward duplication analysis and can be performed with freemium crawlers like Screaming Frog, which extracts data for all the aforementioned elements on all your site’s pages.

To scan for duplicate metadata on your website, download and install Screaming Frog and start a crawl. Once the crawl is complete, export all internal data to a Google Sheet and paste it into the first sheet of this template (you’ll need to make a copy of the template first).

Next, click on Extensions > AppScript in the top-level menu, then select findanddisplayduplicateswithURLs.gs from the list, and click on RUN, which will automatically sort the pages to show duplicate titles, headings, and URLs, as well as a summary, in the sheet titled “Summary of on-page duplication.”

Screengrab from Google Sheets, template: On-page Meta Data Internal Duplication Analysis Template - Lazarina Stoy., Sheet: Summary of on-page duplication, after the function findanddisplayduplicateswithURLs.gs is called, showing the duplicate titles, meta descriptions, and headings identified, alongside their respective URLs, where the duplicate meta data instances are.

You can also take your analysis a step further by analyzing your metadata for semantic similarities. This is useful for identifying patterns that might qualify as near duplicates, like abusing title or heading patterns, which can become tedious for users. While Screaming Frog doesn’t currently offer this semantic analysis for on-page data, there are workarounds you can use in Google Sheets.

Once you have entered your data in the Google Sheets template I provided above, the template will automatically run a fuzzy matching (string similarity) formula to weed out titles or headings that are similar but not exact. You can review these semantically-related on-page elements in the “Near Duplicates” tab, alongside the number of potential matches and most similar alternatives. This can help you pinpoint whether you’re abusing any particular title or heading formats or structures, e.g., “# Ways to connect {tool} with Looker Studio” or “Ultimate guide to X,” like below.

Screengrab from template in Google Sheets, in Near Duplicates tab, that shows the title, and associated URL, alongside the most similar title from the list, and its similarity score.

This type of duplication analysis is suitable for sites of any size/age. Ideally, you should conduct this analysis as part of regularly scheduled reporting processes, especially in organizations where:

There is more than one person (or team) creating content
There is no tracking in place for new content/web pages
The website/publication is more than a year old

These scenarios can lead to duplicate content (as a result of human error or lack of coordination and tracking), so you’ll need to be more vigilant if your website (or your clients’) falls under one of these scenarios.

How to check for internal duplicate content

If you have concerns about duplicate content on your own website, go a step beyond metadata analysis (mentioned above) and evaluate the written content on your website for self-plagiarism and duplication.

Here are my favorite, beginner-friendly, and readily available methods to quickly identify duplicate content on the same domain (according to website size and budget):

Method 1: Similarity assessment based on ranking queries

Using the Search Analytics for Sheets Google Sheets extension (free, with paid option for larger exports), pull out a report for organic ranking queries, grouped by Query and Page.

Screengrab from Search Analytics for Sheets extension, showing the settings for doing an export of search performance report with results grouped by query and page

Ideally, set the report to be extracted into the same Google Sheet template, in the Sheet tab titled “Query-based duplication,” as shown in the image above.

Ensure that the OverlapSummary sheet is empty, and navigate to Extensions (top-level menu) > AppScript > CalculateQueryOverlap.gs. Run the script. The template will automatically apply a formula that sorts the data to show only the duplicated queries, the number of pages ranking for each of these, and the corresponding URLs.

In the sheet titled “OverlapSummary,” data will populate (via a custom AppScript formula) with a summary of the pages that overlap on ranking queries. Review pages that rank for more than 80-90% of the same queries, as you can either consolidate the content on these pages or further optimize them, because, as is, they are not ranking for unique queries.

Screengrab from the Google Sheets template: On-page Meta Data Internal Duplication Analysis Template, sheet: OverlapSummary, after the function CalculateQueryOverlap.gs has been executed, showing page pairs and their query overlap percentage.

Method 2: Run your website’s content through a plagiarism checker

Use a paid tool (like Siteliner, for example) to get a self-plagiarism report on your entire website. Siteliner provides a summary based on the page’s importance (how many internal links it has) and the degree of copied content on it.

Screengrab from Siteliner Premium results summary page, which shows crawled internal URLs, alongside their title, the number of match words found on other pages, the number of match words found on other pages as a percentage of the overall content of the page, the number of match pages, and the degree of importance of the crawled page.

Siteliner also shows you the exact content that is duplicated on each page and the URLs of other pages containing the same content. You can even scale this approach for larger websites via Siteliner’s API.

Screengrab from Siteliner Premium results crawl, showing on a selected page the text that is duplicated internally, and the URLs of pages, where the same text is found.

Method 3: Duplicate content analysis via crawling tools

You can also check for duplicate (and near duplicate) content with Screaming Frog. By default, the crawler identifies exact duplicate pages. For smaller websites (under 500 URLs) duplicate content detection is free and automatic for every crawl.

Before starting a crawl, you can instruct the tool to also detect near-duplicate pages, based on a threshold of your choice (i.e., a percentage number, such as above 90% similarity). This feature requires a paid license and will return a custom duplication report.

Other crawler tools, like Sitebulb, offer similar duplicate content detection features.

Here’s a summary of the methods you can use for internal duplicate content detection, with regard to their respective costs and advantages:

Method	Tool	Cost	Advantages
Query-based analysis	Google Search Console	Free	Quick and easy way to identify non-direct duplication Offers a view of your site as Google sees it
Plagiarism checker	Siteliner	Paid (free for sites under 250 URLs)	Great for getting a snapshot of issues on medium or large-sized websites Detailed analysis Hands-on interface Good for ad-hoc analysis
Crawling tool	Screaming Frog, Sitebulb, or any other crawling tool	Can depend on the tool, but generally free for smaller sites and exact duplicates	Great for getting a snapshot of issues on medium- or large-sized or large websites Detailed analysis Great for enhancing ongoing reporting Additional metrics enable a more holistic analysis Historical comparison

How to check for external duplicate content

In addition to auditing for duplicate content on your own website, you should also actively search for instances where your content has been used (or copied entirely) elsewhere without your permission.

While external content duplication can sometimes be unintentional and harmless to your organic search performance (e.g., someone directly quoting a snippet from your blog), it can also be malicious in cases where the creator of the duplicate does not add any new information, is monetizing the copied work, or has not credited your original work.

One tool that I like to use for external content duplication analysis is Copyscape, which offers a straightforward and low-cost way to discover if your web content appears on other sites. There are several ways to use the service, like submitting a URL or sitemap, pasting a batch of URLs for analysis, or even via an API.

The tool then scans the web for any pages that contain similar or identical content, identifying the number of matches for each page, the degree of risk for the given page, top-matching sites, and even flagging any errors it encounters with the URLs provided.

Screencapture from Copyscrape Premium's interface showing the results from the analysis, including URL crawled, number of matches and degree of risk. The service also shows top matching websites, and crawled URLs with errors. The service allows for results download.

It not only identifies full copies (i.e., exact duplicates) of your content but also finds instances where parts of your text have been used. You can also set up automatic monitoring, which alerts you when the tool finds new copies of your content.

Screencapture from Copyscrape Premium's result page from the external content duplication analysis of one URL, showing matching URLs from other websites, the level of similarity of the content, and associated duplicated text.

By proactively monitoring for duplicate content, you protect not only your intellectual property but also your site’s rankings on search engines, as duplicated content can chip away at your organic search performance.

How to fix duplicate content issues

Once you’ve identified duplicate content (using any of the methods listed above), you can take the following steps to reduce similarities in your website’s metadata or content.

Fix internal metadata duplication

Use unique heading tags: Headings help organize the content of a page and provide context for search engines. Similar to title tags, heading tags (H1, H2, H3, etc.) should also be unique for each page.
Canonicalize product URLs: For eCommerce websites, product URLs with parameter variations (like color, size, etc.) should be canonicalized to the main product URL. This tells search engines that the main product page is the authoritative source for product information.

Fix internal duplicate content

Aim for mostly unique content: Ensure the content on each page is at least 70% unique. While some repurposing is expected in a website, there shouldn’t be patterns of content abuse with the aim of gaming the search algorithm.
Consolidate and redirect, don’t delete: When merging content, implement a 301 redirect instead of deleting the obsolete page to avoid loss of link equity and 404 page status errors.
Diversify your content portfolio: Avoid writing in the same style, the same structure, or format for all your website content. Diversify the types of content you create by experimenting with listicles, case studies, tutorials, and so on.

Fix external duplicate content

Signal originality to Google: Implement canonical tags on your website and make sure all content has a publishing date, revision dates (if any), and E-E-A-T-establishing information.
Improve your content further: Consider further enhancing your content by adding new perspectives and insights as a way to set it apart from the duplicated instances.
Get in touch with the other website’s owner: Some website owners would prefer to simply remove your content when requested, instead of risking escalation (more on that below).
Report infringing content to Google: In a video on content duplication, Google stated that, in certain circumstances, you can file for a DMCA (Digital Millennium Copyright Act) takedown. This only applies if copying your content is illegal (such as in the case of music, for example).
Take legal action: In case you have evidence of intellectual theft, take legal action under the DMCA or European Union Copyright Directive (EUCD) (depending on your location).

The goal for all of these content deduplication initiatives is improving your website’s organic search performance (by improving your content quality and improving user experience). As you progress towards that goal, you should also see improvements in the metrics I discussed earlier.

How to audit your website for duplicate content: The takeaways

Content duplication, while not always problematic, can be difficult to identify or manage, especially on larger and more established websites. It can manifest in several cases: internally or externally, by exact or partial duplication, or sometimes even unintentionally. In all cases, duplicate content issues should be monitored and, where possible, addressed to ensure optimal search performance and user experience.

To identify duplicate content issues, review your traffic patterns, search performance, and indexation in GSC data, as well as your page experience, engagement rates, and time on site in GA4.

Regularly audit your website for duplicate on-page metadata like titles and headings, but also utilize tools like Siteliner for internal duplicate content analysis and Copyscape to identify copies of your content on the web.

If you identify duplicate content internally, implement the necessary actions to make pages and their respective content unique. In the case of malicious copies of your content elsewhere, try to get in touch with the website owner before reporting it to Google or the relevant authorities in your area.

By implementing systems to monitor and audit your content for duplication, and taking corrective action when needed by producing unique and helpful content, you are ensuring that your website and business are operating sustainably now and into the future.

Lazarina Stoy - SEO & Data Science Consultant

Lazarina is an organic marketing consultant specializing in SEO, CRO, and data science. She's worked with countless teams in B2B, SaaS, and big tech to improve their organic positioning. As an advocate of SEO automation, Lazarina speaks on webinars and at conferences and creates helpful resources for fellow SEOs to kick off their data science journey. Twitter | Linkedin