{"product_id":"detect-duplicate-web-pages-google-drive-postgres-workflow","title":"Detect Duplicate Web Pages: Google Drive \u0026 Postgres Workflow","description":"\u003ch3\u003eUncover Duplicate Content Effortlessly with Google Drive \u0026amp; Postgres Workflow\u003c\/h3\u003e\n\u003cp\u003eStreamline your web management processes with our Detect Duplicate Web Pages: Google Drive \u0026amp; Postgres Workflow. This powerful tool effortlessly identifies semantically duplicate HTML web pages stored in your Google Drive and flags them for review with advanced PGVector similarity search capabilities. By leveraging the capabilities of Ollama for local vector embeddings, this workflow not only enhances the accuracy of duplicate detection but also simplifies the management of web content.\u003c\/p\u003e\n\n\u003ch3\u003eWhat this workflow does\u003c\/h3\u003e\n\u003cul\u003e\n  \u003cli\u003eBegins by manually clearing the existing PGVector embeddings and scraped page text tables in Postgres.\u003c\/li\u003e\n  \u003cli\u003eIdentifies and lists HTML files within a specified Google Drive folder, focusing on target documents for batch processing.\u003c\/li\u003e\n  \u003cli\u003eDownloads, extracts, and cleans the main body text from each HTML document before upserting it into a Postgres table for scraped pages.\u003c\/li\u003e\n  \u003cli\u003eRetrieves the cleaned text from Postgres, splits it into overlapping chunks, and appends associated metadata such as sheet_id, file_name, and file_url.\u003c\/li\u003e\n  \u003cli\u003eGenerates embeddings locally using Ollama, deduplicates processed pages, and updates Postgres with chunk vectors and metadata via PGVector.\u003c\/li\u003e\n  \u003cli\u003eDevelops an HNSW index in Postgres to compute similarity matches, producing a comprehensive pairwise page report in CSV format.\u003c\/li\u003e\n  \u003cli\u003eCalculates page-level centroid embeddings to identify highly similar page pairs and exports these findings in a detailed CSV duplicate report.\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch3\u003eUse cases\u003c\/h3\u003e\n\u003cul\u003e\n  \u003cli\u003e\n\u003cstrong\u003eSEO Optimization:\u003c\/strong\u003e Ensure unique content for SEO effectiveness by detecting duplicated or similar web pages.\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eWebsite Migration:\u003c\/strong\u003e During migrations, identify duplicate pages to streamline and reduce content redundancy.\u003c\/li\u003e\n  \u003cli\u003e\n\u003cstrong\u003eContent Management:\u003c\/strong\u003e Aid content managers in maintaining diverse and distinct content across web portfolios.\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch3\u003eTechnical details\u003c\/h3\u003e\n\u003cul\u003e\n  \u003cli\u003eIntegrations used: \u003cstrong\u003eGoogle Drive, Postgres, PGVector, Ollama\u003c\/strong\u003e\n\u003c\/li\u003e\n  \u003cli\u003eNodes employed: \u003cstrong\u003eset, code, html, filter, postgres, sticky note\u003c\/strong\u003e\n\u003c\/li\u003e\n\u003c\/ul\u003e","brand":"N8N Commerce","offers":[{"title":"Default Title","offer_id":45590339911859,"sku":"N8N-16540","price":17.99,"currency_code":"GBP","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0749\/6279\/6723\/files\/foOmBojo1WZVnBL4nFcH4_264c99c7cecf4eb5981dc43110754001.jpg?v=1782119004","url":"https:\/\/buyflowscripts.com\/products\/detect-duplicate-web-pages-google-drive-postgres-workflow","provider":"N8N Commerce","version":"1.0","type":"link"}