Skip to product information

Detect Duplicate Web Pages: Google Drive & Postgres Workflow

Detect Duplicate Web Pages: Google Drive & Postgres Workflow

 (200+Reviews)
Regular price £17.99
Regular price £17.99 Sale price
SAVE Sold out
⬇
Instant Digital Download
∞
Unlimited Downloads
★
Lifetime Access in Your Account
🔥
128+ Sold
Popular with n8n builders
âš¡
23 people viewing
High interest right now
✅
9 added today
Fast-moving digital product
Detect Duplicate Web Pages: Google Drive & Postgres Workflow

Detect Duplicate Web Pages: Google Drive & Postgres Workflow

Regular price £17.99
Regular price £17.99 Sale price
SAVE Sold out

Uncover Duplicate Content Effortlessly with Google Drive & Postgres Workflow

Streamline your web management processes with our Detect Duplicate Web Pages: Google Drive & Postgres Workflow. This powerful tool effortlessly identifies semantically duplicate HTML web pages stored in your Google Drive and flags them for review with advanced PGVector similarity search capabilities. By leveraging the capabilities of Ollama for local vector embeddings, this workflow not only enhances the accuracy of duplicate detection but also simplifies the management of web content.

What this workflow does

  • Begins by manually clearing the existing PGVector embeddings and scraped page text tables in Postgres.
  • Identifies and lists HTML files within a specified Google Drive folder, focusing on target documents for batch processing.
  • Downloads, extracts, and cleans the main body text from each HTML document before upserting it into a Postgres table for scraped pages.
  • Retrieves the cleaned text from Postgres, splits it into overlapping chunks, and appends associated metadata such as sheet_id, file_name, and file_url.
  • Generates embeddings locally using Ollama, deduplicates processed pages, and updates Postgres with chunk vectors and metadata via PGVector.
  • Develops an HNSW index in Postgres to compute similarity matches, producing a comprehensive pairwise page report in CSV format.
  • Calculates page-level centroid embeddings to identify highly similar page pairs and exports these findings in a detailed CSV duplicate report.

Use cases

  • SEO Optimization: Ensure unique content for SEO effectiveness by detecting duplicated or similar web pages.
  • Website Migration: During migrations, identify duplicate pages to streamline and reduce content redundancy.
  • Content Management: Aid content managers in maintaining diverse and distinct content across web portfolios.

Technical details

  • Integrations used: Google Drive, Postgres, PGVector, Ollama
  • Nodes employed: set, code, html, filter, postgres, sticky note
View full details