How do you handle duplicate news articles at scale?
Quick Answer
Duplicate detection at scale uses text embeddings with cosine similarity for semantic matching, MinHash/LSH for near-duplicate detection, and title normalization. Articles above a similarity threshold (typically 0.85) are grouped, and the most informative version is selected as the primary.
Detailed Answer
Deduplication Strategies for News Aggregation
When 50+ outlets publish articles about the same Bitcoin price movement, you need robust deduplication.
The Problem at Scale
| Duplication Type | Example | % of Daily Articles |
|---|---|---|
| Exact copies | Syndicated wire stories (Reuters, AP) | 5-10% |
| Near-duplicates | Same story, slightly rewritten | 20-30% |
| Same-event coverage | Different angles on same news | 30-40% |
| Unique content | Original analysis, exclusive reporting | 20-30% |
Technique 1: Title-Based Matching (Fast, Basic)
- Normalize titles (lowercase, remove punctuation, strip common prefixes)
- Jaccard similarity on word sets
- Good for catching exact and near-exact copies
- Limitation: Different titles can cover the same story
Technique 2: Embedding Similarity (Recommended)
- Generate embeddings for each article (title + first 500 words)
- Store in a vector database (pgvector, Pinecone, Qdrant)
- For each new article, query top-K similar articles from last 48 hours
- If similarity > 0.85, flag as duplicate
Performance: Handles 10,000+ comparisons per second with approximate nearest neighbor search
Technique 3: MinHash / LSH (Scalable)
- Generate MinHash signatures from article n-grams
- Use Locality-Sensitive Hashing for fast approximate matching
- O(1) lookup time regardless of corpus size
- Best for exact and near-duplicate detection
Deduplication Pipeline
New Article ↓ [Title normalization] → Exact match? → Skip ↓ [Generate embedding] → Similarity > 0.85? → Group with existing ↓ [MinHash signature] → LSH bucket collision? → Flag for review ↓ [Unique article] → Add to index
Selecting the "Best" Version
When multiple articles cover the same story, score by:
- Content depth: Longer, more detailed articles win
- Source reputation: Higher-authority sources preferred
- Freshness: First to publish gets a small boost
- Original reporting: Exclusive quotes, data, or analysis


Comments
Loading comments...