{"product_id":"detect-duplicate-web-pages-using-google-drive-and-postgres","title":"Detect Duplicate Web Pages Using Google Drive and Postgres","description":"\u003ch3\u003eEffortlessly Detect Duplicate Web Pages with Google Drive and Postgres Connection\u003c\/h3\u003e\n\n\u003cp\u003eStreamline your content management by leveraging the power of n8n to identify duplicate web pages directly from your Google Drive using Postgres. This intelligent workflow automates the detection of similar HTML content, providing insightful reports in just a few clicks. Perfect for web content managers and SEO specialists aiming to maintain unique and high-quality content across their sites.\u003c\/p\u003e\n\n\u003ch3\u003eWhat this workflow does:\u003c\/h3\u003e\n\u003cul\u003e\n  \u003cli\u003eInitiates a manual start to clear past data from Postgres tables dedicated to scraped pages and stored vectors.\u003c\/li\u003e\n  \u003cli\u003eEfficiently lists and filters HTML documents in a chosen Google Drive folder.\u003c\/li\u003e\n  \u003cli\u003eDownloads each HTML file, extracts and cleans the visible text, then seamlessly upserts it into a Postgres table.\u003c\/li\u003e\n  \u003cli\u003eGenerates embeddings using Ollama from cleaned text chunks and populates PGVector with these chunk vectors alongside relevant metadata.\u003c\/li\u003e\n  \u003cli\u003eConstructs an HNSW index in Postgres and conducts a similarity search to produce a detailed chunk-match report, available as a downloadable CSV.\u003c\/li\u003e\n  \u003cli\u003eCalculates per-page centroid embeddings to flag probable duplicates and exports a comprehensive page-level similarity report in CSV format.\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch3\u003eUse cases:\u003c\/h3\u003e\n\u003cul\u003e\n  \u003cli\u003eWeb developers ensuring content uniqueness across a client's site portfolio.\u003c\/li\u003e\n  \u003cli\u003eSEO teams needing to quickly detect and eliminate duplicate content that could harm search rankings.\u003c\/li\u003e\n  \u003cli\u003eContent administrators monitoring regular updates to web pages for inadvertent duplication.\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003ch3\u003eTechnical details:\u003c\/h3\u003e\n\u003cul\u003e\n  \u003cli\u003eIntegrates Google Drive with OAuth2 credentials for seamless HTML file access.\u003c\/li\u003e\n  \u003cli\u003eUtilizes Postgres to store and process the text and vector data, requiring the pgvector extension.\u003c\/li\u003e\n  \u003cli\u003eImplements various powerful n8n nodes, including set, code, html, filter, postgres, and sticky note for streamlined operations.\u003c\/li\u003e\n\u003c\/ul\u003e\n\n\u003cp\u003eThis n8n automation workflow is meticulously designed for efficiency and precision, allowing you to maintain a clutter-free and distinctive web presence without manual effort. Optimize your content strategy today with this advanced duplicate detection solution.\u003c\/p\u003e","brand":"N8N Commerce","offers":[{"title":"Default Title","offer_id":45649479172275,"sku":"N8N-16835","price":10.99,"currency_code":"GBP","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0749\/6279\/6723\/files\/ZrvXgnxZbcG6pgZAIeNvT_a8a9b29a84fc43f88a0590e21a16b4ca.jpg?v=1783242356","url":"https:\/\/buyflowscripts.com\/products\/detect-duplicate-web-pages-using-google-drive-and-postgres","provider":"N8N Commerce","version":"1.0","type":"link"}