Building an AI-Powered News Aggregator: How We Filter 10,000+ Crypto Articles Down to What Actually Matters
Most crypto news aggregators dump everything into a single feed. The signal-to-noise ratio is terrible — 90% of articles are rewritten press releases, duplicate coverage of the same event, or irrelevant content that somehow made it past editorial review.
We built a two-stage AI pipeline that ingests raw RSS feeds and outputs a curated, scored, and enriched news feed. The system processes thousands of articles daily, filters out ~70% as noise, deduplicates aggressively at three levels, and tags the rest with sentiment, importance scores, tickers, and categories — all before a single human ever sees the feed.
Here's how it works under the hood.
The Problem
If you subscribe to CoinDesk, The Block, Decrypt, and a dozen other crypto/AI sources, you'll notice:
- Same story, 15 versions. Bitcoin ETF news gets rewritten by every outlet within minutes
- Low-quality filler. Sponsored content, listicles, and opinion pieces with no new information
- No prioritization. A protocol hack and a minor token listing get the same visual weight
- Zero context. No sentiment analysis, no ticker enrichment, no importance scoring
We needed a system that could answer: Is this article relevant? Is it a duplicate? How important is it? What's the market sentiment? Which tokens are affected?
Architecture Overview
RSS Sources (CoinDesk, The Block, etc.) │ ▼ ┌─────────────┐ │ RSS Parser │ ← rss-parser, normalized extraction └──────┬──────┘ │ ▼ ┌─────────────┐ │ Dedup L1: │ ← SHA-256 URL hash (exact match) │ URL Hash │ └──────┬──────┘ │ ▼ ┌─────────────┐ │ Dedup L2: │ ← Dice coefficient on trigrams (>70% = dupe) │ Trigram Sim │ └──────┬──────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ Stage 1: BitNet (fast, local) │ │ 4 parallel checks: │ │ ├── Relevance (AI/crypto/DeFi? yes/no) │ │ ├── Category (ai|crypto|defi|ai_crypto) │ │ ├── Sentiment (bullish|bearish|neutral) │ │ └── Importance (1-10 scale) │ └──────────────┬──────────────────────────────┘ │ Gate: relevant=true AND importance >= 4 │ ▼ ┌─────────────┐ │ Dedup L3: │ ← BitNet semantic comparison │ Semantic │ "Same story?" yes/no └──────┬──────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ Stage 2: Claude Sonnet 4 │ │ Structured JSON output: │ │ ├── summary (2-3 sentences) │ │ ├── keyTakeaways (5 bullets) │ │ ├── category, sentiment, importance │ │ ├── tags (max 10) │ │ ├── isActionable (should traders act?) │ │ └── metaTitle + metaDescription (SEO) │ └──────────────┬──────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────┐ │ Stage 3: Ticker Enrichment │ │ ├── Detect ticker symbols (BTC, ETH, SOL…) │ │ └── Fetch current prices + 24h change │ └──────────────┬──────────────────────────────┘ │ ▼ Published Feed (finalScore = 70% Claude + 30% BitNet)
Tech Stack
| Layer | Tech | Why |
|---|---|---|
| Backend framework | NestJS 10 | Modular, decorator-driven, great for pipeline services |
| Database | MongoDB 8 + Mongoose | Flexible schema for evolving article fields, text indexes |
| Job queue | Bull 4 (Redis-backed) | Retry logic, exponential backoff, job cleanup |
| Fast AI model | BitNet (local) | Sub-second inference, zero API cost, runs on CPU |
| Deep AI model | Claude Sonnet 4 (API) | Structured JSON extraction, reliable categorization |
| Frontend | Next.js 15 (App Router) | SSR, ISR for SEO, React Server Components |
| Scheduling | NestJS Schedule | Cron-based RSS fetch intervals |
Stage 1: BitNet — The Fast Gatekeeper
BitNet is a local model that runs on CPU. It's not as smart as Claude, but it's fast — we run 4 checks in parallel and get results in under a second.
const [relevance, classification, sentiment, importanceResult] = await Promise.all([ this.bitnetService.checkRelevance(text), this.bitnetService.classify(text), this.bitnetService.analyzeSentiment(text), this.bitnetService.scoreImportance(text), ]);
Each check sends a structured prompt to the BitNet completion endpoint and parses the response:
- Relevance → "Yes" or "No" + confidence percentage
- Category → one of
,ai
,crypto
,defi
,ai_cryptomixed - Sentiment →
,bullish
, orbearish
based on market impactneutral - Importance → numeric score from 1 to 10
The gate logic is simple: if
isRelevant === false or importance < 4, the article stops here. It gets marked isPublished: false and never reaches Stage 2. This saves Claude API costs by filtering out ~60-70% of articles at the cheapest possible layer.
Fail-open design: if BitNet is down or errors out, we default to
isRelevant: true and importance: 5 — letting Claude make the final call. We'd rather process a few extra articles than miss breaking news.
Three Levels of Deduplication
Crypto news has a severe duplication problem. When the SEC approves a Bitcoin ETF, you'll get 40+ articles in an hour. We handle this at three levels:
Level 1: URL Hash
SHA-256(normalize(url)) → unique constraint in MongoDB
Catches exact duplicates from the same source. Zero-cost, handled at insert time.
Level 2: Trigram Similarity
We compute the Dice coefficient on character trigrams of article titles:
private trigramSimilarity(a: string, b: string): number { const trigramsA = this.getTrigrams(a); const trigramsB = this.getTrigrams(b); let intersection = 0; for (const trigram of trigramsA) { if (trigramsB.has(trigram)) intersection++; } return (2 * intersection) / (trigramsA.size + trigramsB.size); }
Threshold: >70% similarity = duplicate. Compared against articles from the past 48 hours (excluding the current article by
_id, not by title — so articles with identical titles from different sources are correctly detected). This catches the "same headline, different outlet" pattern that dominates crypto media.
Level 3: Semantic Similarity
For candidates that pass a loose trigram filter (>35% match), we ask BitNet directly:
Title A: "SEC Approves First Spot Bitcoin ETF in Historic Decision" Title B: "Bitcoin ETF Finally Gets Green Light from US Regulators" Answer: YES (same story) or NO (different stories)
This catches semantically identical stories with completely different wording — something trigram matching alone can't handle.
Defense in Depth: Query-Time & Frontend Dedup
Even with three ingestion-time levels, legacy duplicates may already exist in the database. We add two more safety nets:
- Query-time dedup — the
endpoint over-fetches articles and deduplicates byfindAll
before returning results. The first (highest-ranked) article per title wins.originalTitle - Frontend dedup — the segment page filters articles by title before rendering, ensuring no visual duplicates even if the API returns them.
Stage 2: Claude — The Deep Analyst
Articles that survive the BitNet gate and all three dedup checks get sent to Claude Sonnet 4 for structured analysis.
The prompt asks Claude to return a JSON object with:
| Field | Type | Purpose |
|---|---|---|
| string | 2-3 sentence factual summary |
| string[] | 5 bullet points |
| enum | | | | | |
| enum | | | |
| 1-10 | 1-3 minor, 4-5 noteworthy, 6-7 significant, 8-9 breaking, 10 defining |
| string[] | Up to 10 lowercase tags |
| boolean | Should traders consider taking action? |
| string | SEO-optimized title (<70 chars) |
| string | SEO meta description (<160 chars) |
Post-processing validates and sanitizes everything: caps importance to 1-10, validates sentiment against the enum, limits tags to 10, truncates SEO fields to character limits.
Fallback strategy: if Claude fails (rate limit, timeout, malformed response), we publish with BitNet data only. The article still gets out — just with less enrichment.
Final Score
finalScore = Math.round(claudeImportance * 0.7 + bitnetImportance * 0.3);
Why 70/30? Claude's analysis is more nuanced and considers the full article body. BitNet only sees the first 1,500 characters. But BitNet provides a useful sanity check — if both models agree an article is important, we're more confident in the score.
Stage 3: Ticker Enrichment
After AI processing, we scan the full article text for known ticker symbols (BTC, ETH, SOL, AVAX, etc.) and fetch current prices + 24h change from market data APIs.
This enables two features in the frontend:
- Inline ticker context — readers see that an article mentioning ETH was written when ETH was at $3,200 (+4.2%)
- Trending tickers — aggregate ticker mentions across articles from the past 7 days, ranked by mention count
Query & Filtering API
The REST API exposes flexible filtering for the frontend:
GET /api/articles?segment=crypto&sentiment=bullish&minImportance=7&sort=importance&limit=20
| Parameter | Type | Description |
|---|---|---|
| | | | Maps to category groups |
| string | Exact category override |
| | | | Market sentiment filter |
| string | Filter by mentioned ticker (e.g., ) |
| string | Filter by tag |
| string | Full-text search (weighted: title 10x, summary 5x, tags 3x) |
| boolean | Only featured articles |
| 1-10 | Minimum importance threshold |
| | | | | Sort order |
Segment-to-Category Mapping
The
segment parameter is a UX convenience — users think in terms of "AI news" or "Crypto news", but the database has more granular categories:
SEGMENT_CATEGORIES = { ai: ['ai', 'ai_crypto'], crypto: ['crypto', 'defi', 'ai_crypto'], // 'all' applies no filter }
Notice
ai_crypto appears in both segments. An article about "AI agents trading on-chain" is relevant to both audiences.
MongoDB Indexes
Performance at scale requires the right indexes. We use compound indexes that match our most common query patterns:
{ isPublished: 1, publishedAt: -1 } // Default chronological feed { isPublished: 1, category: 1, publishedAt: -1 } // Segment-filtered feed { isPublished: 1, finalScore: -1 } // Trending sort { isPublished: 1, isFeatured: 1, publishedAt: -1 } // Featured articles { isPublished: 1, tickers: 1, publishedAt: -1 } // Ticker filter { isPublished: 1, tags: 1, publishedAt: -1 } // Tag filter // Weighted text index for search { originalTitle: 'text', summary: 'text', tags: 'text' } // weights: { originalTitle: 10, summary: 5, tags: 3 }
Every compound index starts with
isPublished: 1 because we never query unpublished articles in the public API.
Frontend Integration
The Next.js frontend uses a custom
useFilters hook that syncs filter state with URL search params:
const { sentiment, importance, sort, setSentiment, setImportance, setSort } = useFilters(); // Changes URL to /news?sentiment=bullish&minImportance=7&sort=importance setSentiment('bullish'); setImportance(7); setSort('importance');
This makes filtered views shareable and bookmarkable. Opening a link like
/news?sentiment=bearish&sort=importance shows exactly the filtered view the sender intended.
Results
After running this pipeline for several weeks:
- ~70% of articles filtered at Stage 1 — mostly irrelevant content, sponsored posts, and articles below importance threshold 4
- ~15% caught by deduplication — across all three levels, with Level 2 (trigram) catching the most
- ~15% published — the final curated feed
- Average processing time: ~200ms for BitNet (Stage 1), ~2-3s for Claude (Stage 2)
- Cost: ~$0.002 per article that reaches Claude. Articles filtered by BitNet cost nothing beyond compute
Key Takeaways
- Two-stage AI is the sweet spot. A fast local model as gatekeeper + a powerful API model for deep analysis gives you cost efficiency without sacrificing quality
- Deduplication needs multiple strategies. URL hash alone misses 80% of duplicates. Trigram + semantic catches almost everything at ingestion time, and query-time + frontend dedup handles any legacy data
- Fail-open is better than fail-closed for news. Missing a breaking story is worse than showing a borderline article
- Weighted scoring builds trust. When two independent models agree on importance, the signal is stronger
- Segment abstraction matters for UX. Users don't think in database categories — they think in topics
Who This Is For
- Teams building content aggregation products who need more than RSS + chronological sort
- Crypto/fintech startups looking for a production-tested AI pipeline architecture
- Engineers evaluating local vs. API models — this is a practical example of using both
The full implementation runs as part of y0.exchange, an AI-to-blockchain infrastructure platform. The news module is one piece of a larger system that connects AI agents to crypto wallets.
Built with NestJS, MongoDB, Bull, BitNet, and Claude Sonnet 4. Running in production at news.y0.exchange.


Comments
Loading comments...