Skip to product information

Detect Duplicate Web Pages Using Google Drive and Postgres

Detect Duplicate Web Pages Using Google Drive and Postgres

 (200+Reviews)
Regular price £10.99
Regular price £10.99 Sale price
SAVE Sold out
⬇
Instant Digital Download
∞
Unlimited Downloads
★
Lifetime Access in Your Account
🔥
128+ Sold
Popular with n8n builders
âš¡
23 people viewing
High interest right now
✅
9 added today
Fast-moving digital product
Detect Duplicate Web Pages Using Google Drive and Postgres

Detect Duplicate Web Pages Using Google Drive and Postgres

Regular price £10.99
Regular price £10.99 Sale price
SAVE Sold out

Effortlessly Detect Duplicate Web Pages with Google Drive and Postgres Connection

Streamline your content management by leveraging the power of n8n to identify duplicate web pages directly from your Google Drive using Postgres. This intelligent workflow automates the detection of similar HTML content, providing insightful reports in just a few clicks. Perfect for web content managers and SEO specialists aiming to maintain unique and high-quality content across their sites.

What this workflow does:

  • Initiates a manual start to clear past data from Postgres tables dedicated to scraped pages and stored vectors.
  • Efficiently lists and filters HTML documents in a chosen Google Drive folder.
  • Downloads each HTML file, extracts and cleans the visible text, then seamlessly upserts it into a Postgres table.
  • Generates embeddings using Ollama from cleaned text chunks and populates PGVector with these chunk vectors alongside relevant metadata.
  • Constructs an HNSW index in Postgres and conducts a similarity search to produce a detailed chunk-match report, available as a downloadable CSV.
  • Calculates per-page centroid embeddings to flag probable duplicates and exports a comprehensive page-level similarity report in CSV format.

Use cases:

  • Web developers ensuring content uniqueness across a client's site portfolio.
  • SEO teams needing to quickly detect and eliminate duplicate content that could harm search rankings.
  • Content administrators monitoring regular updates to web pages for inadvertent duplication.

Technical details:

  • Integrates Google Drive with OAuth2 credentials for seamless HTML file access.
  • Utilizes Postgres to store and process the text and vector data, requiring the pgvector extension.
  • Implements various powerful n8n nodes, including set, code, html, filter, postgres, and sticky note for streamlined operations.

This n8n automation workflow is meticulously designed for efficiency and precision, allowing you to maintain a clutter-free and distinctive web presence without manual effort. Optimize your content strategy today with this advanced duplicate detection solution.

View full details