Yevgen`s official homepage

Yevgen Somochkin

"ΞΣΣΘ"

CTO ESSO.DEV

Building the Future of AI, Automation & Web3 | Full-stack & DevOps Architect | Scalable, Data-driven Web & Mobile Apps

Hamburg, Germany

Yevgen Somochkin

Home
Blog
Services
AI & Machine Learning
Blockchain & Web3 Development
CRM Implementation & Integration
LowCode and Automatization
Mobile Application Development
Search Engine Optimization
Web Development
Technology Stack
Analytics & SEO Tools
Automation Tools
Backend Technologies
Cloud & DevOps
Frontend Technologies
Customer Relationship Management Software
Blockchain & Web3
FAQ
AI Integration & Development
CRM Implementation & Integration
Low-Code & Automation
Mobile App Development
SEO & GEO Optimization
Web Development
AI Agents Security
LLM Privacy & Compliance
Blockchain & Web3
Need help?
Join team

Need help?

Book a call

Software Engineer & Architect

Hamburg, Germany

Twitter LinkedIn

Building an AI-Powered News Aggregator: How We Filter 10,000+ Crypto Articles Down to What Actually Matters

Mar 1, 2026•

Blockchain & Web3 DevelopmentAI & Machine LearningWeb Development

Most crypto news aggregators dump everything into a single feed. The signal-to-noise ratio is terrible — 90% of articles are rewritten press releases, duplicate coverage of the same event, or irrelevant content that somehow made it past editorial review.

We built a two-stage AI pipeline that ingests raw RSS feeds and outputs a curated, scored, and enriched news feed. The system processes thousands of articles daily, filters out ~70% as noise, deduplicates aggressively at three levels, and tags the rest with sentiment, importance scores, tickers, and categories — all before a single human ever sees the feed.

Here's how it works under the hood.

The Problem

If you subscribe to CoinDesk, The Block, Decrypt, and a dozen other crypto/AI sources, you'll notice:

Same story, 15 versions. Bitcoin ETF news gets rewritten by every outlet within minutes
Low-quality filler. Sponsored content, listicles, and opinion pieces with no new information
No prioritization. A protocol hack and a minor token listing get the same visual weight
Zero context. No sentiment analysis, no ticker enrichment, no importance scoring

We needed a system that could answer: Is this article relevant? Is it a duplicate? How important is it? What's the market sentiment? Which tokens are affected?

Architecture Overview

RSS Sources (CoinDesk, The Block, etc.)
         │
         ▼
   ┌─────────────┐
   │  RSS Parser  │  ← rss-parser, normalized extraction
   └──────┬──────┘
          │
          ▼
   ┌─────────────┐
   │ Dedup L1:   │  ← SHA-256 URL hash (exact match)
   │ URL Hash    │
   └──────┬──────┘
          │
          ▼
   ┌─────────────┐
   │ Dedup L2:   │  ← Dice coefficient on trigrams (>70% = dupe)
   │ Trigram Sim  │
   └──────┬──────┘
          │
          ▼
   ┌─────────────────────────────────────────────┐
   │ Stage 1: BitNet (fast, local)               │
   │ 4 parallel checks:                          │
   │ ├── Relevance  (AI/crypto/DeFi? yes/no)     │
   │ ├── Category   (ai|crypto|defi|ai_crypto)   │
   │ ├── Sentiment  (bullish|bearish|neutral)     │
   │ └── Importance (1-10 scale)                  │
   └──────────────┬──────────────────────────────┘
                  │
          Gate: relevant=true AND importance >= 4
                  │
                  ▼
   ┌─────────────┐
   │ Dedup L3:   │  ← BitNet semantic comparison
   │ Semantic    │     "Same story?" yes/no
   └──────┬──────┘
          │
          ▼
   ┌─────────────────────────────────────────────┐
   │ Stage 2: Claude Sonnet 4                    │
   │ Structured JSON output:                     │
   │ ├── summary (2-3 sentences)                 │
   │ ├── keyTakeaways (5 bullets)                │
   │ ├── category, sentiment, importance         │
   │ ├── tags (max 10)                           │
   │ ├── isActionable (should traders act?)      │
   │ └── metaTitle + metaDescription (SEO)       │
   └──────────────┬──────────────────────────────┘
                  │
                  ▼
   ┌─────────────────────────────────────────────┐
   │ Stage 3: Ticker Enrichment                  │
   │ ├── Detect ticker symbols (BTC, ETH, SOL…)  │
   │ └── Fetch current prices + 24h change       │
   └──────────────┬──────────────────────────────┘
                  │
                  ▼
            Published Feed
       (finalScore = 70% Claude + 30% BitNet)

Tech Stack

Layer	Tech	Why
Backend framework	NestJS 10	Modular, decorator-driven, great for pipeline services
Database	MongoDB 8 + Mongoose	Flexible schema for evolving article fields, text indexes
Job queue	Bull 4 (Redis-backed)	Retry logic, exponential backoff, job cleanup
Fast AI model	BitNet (local)	Sub-second inference, zero API cost, runs on CPU
Deep AI model	Claude Sonnet 4 (API)	Structured JSON extraction, reliable categorization
Frontend	Next.js 15 (App Router)	SSR, ISR for SEO, React Server Components
Scheduling	NestJS Schedule	Cron-based RSS fetch intervals

Stage 1: BitNet — The Fast Gatekeeper

BitNet is a local model that runs on CPU. It's not as smart as Claude, but it's fast — we run 4 checks in parallel and get results in under a second.

const [relevance, classification, sentiment, importanceResult] =
  await Promise.all([
    this.bitnetService.checkRelevance(text),
    this.bitnetService.classify(text),
    this.bitnetService.analyzeSentiment(text),
    this.bitnetService.scoreImportance(text),
  ]);

Each check sends a structured prompt to the BitNet completion endpoint and parses the response:

Relevance → "Yes" or "No" + confidence percentage
Category → one of
```
ai
```
,
```
crypto
```
,
```
defi
```
,
```
ai_crypto
```
,
```
mixed
```
Sentiment →
```
bullish
```
,
```
bearish
```
, or
```
neutral
```
based on market impact
Importance → numeric score from 1 to 10

The gate logic is simple: if

isRelevant === false

importance < 4

, the article stops here. It gets marked

isPublished: false

and never reaches Stage 2. This saves Claude API costs by filtering out ~60-70% of articles at the cheapest possible layer.

Fail-open design: if BitNet is down or errors out, we default to

isRelevant: true

and

importance: 5

— letting Claude make the final call. We'd rather process a few extra articles than miss breaking news.

Three Levels of Deduplication

Crypto news has a severe duplication problem. When the SEC approves a Bitcoin ETF, you'll get 40+ articles in an hour. We handle this at three levels:

Level 1: URL Hash

SHA-256(normalize(url)) → unique constraint in MongoDB

Catches exact duplicates from the same source. Zero-cost, handled at insert time.

Level 2: Trigram Similarity

We compute the Dice coefficient on character trigrams of article titles:

private trigramSimilarity(a: string, b: string): number {
  const trigramsA = this.getTrigrams(a);
  const trigramsB = this.getTrigrams(b);
  let intersection = 0;
  for (const trigram of trigramsA) {
    if (trigramsB.has(trigram)) intersection++;
  }
  return (2 * intersection) / (trigramsA.size + trigramsB.size);
}

Threshold: >70% similarity = duplicate. Compared against articles from the past 48 hours (excluding the current article by

_id

, not by title — so articles with identical titles from different sources are correctly detected). This catches the "same headline, different outlet" pattern that dominates crypto media.

Level 3: Semantic Similarity

For candidates that pass a loose trigram filter (>35% match), we ask BitNet directly:

Title A: "SEC Approves First Spot Bitcoin ETF in Historic Decision"
Title B: "Bitcoin ETF Finally Gets Green Light from US Regulators"

Answer: YES (same story) or NO (different stories)

This catches semantically identical stories with completely different wording — something trigram matching alone can't handle.

Defense in Depth: Query-Time & Frontend Dedup

Even with three ingestion-time levels, legacy duplicates may already exist in the database. We add two more safety nets:

Query-time dedup — the
```
findAll
```
endpoint over-fetches articles and deduplicates by
```
originalTitle
```
before returning results. The first (highest-ranked) article per title wins.
Frontend dedup — the segment page filters articles by title before rendering, ensuring no visual duplicates even if the API returns them.

Stage 2: Claude — The Deep Analyst

Articles that survive the BitNet gate and all three dedup checks get sent to Claude Sonnet 4 for structured analysis.

The prompt asks Claude to return a JSON object with:

Field	Type	Purpose
`summary`	string	2-3 sentence factual summary
`keyTakeaways`	string[]	5 bullet points
`category`	enum	`ai` \| `crypto` \| `defi` \| `ai_crypto` \| `mixed`
`sentiment`	enum	`bullish` \| `bearish` \| `neutral`
`importance`	1-10	1-3 minor, 4-5 noteworthy, 6-7 significant, 8-9 breaking, 10 defining
`tags`	string[]	Up to 10 lowercase tags
`isActionable`	boolean	Should traders consider taking action?
`metaTitle`	string	SEO-optimized title (<70 chars)
`metaDescription`	string	SEO meta description (<160 chars)

Post-processing validates and sanitizes everything: caps importance to 1-10, validates sentiment against the enum, limits tags to 10, truncates SEO fields to character limits.

Fallback strategy: if Claude fails (rate limit, timeout, malformed response), we publish with BitNet data only. The article still gets out — just with less enrichment.

Final Score

finalScore = Math.round(claudeImportance * 0.7 + bitnetImportance * 0.3);

Why 70/30? Claude's analysis is more nuanced and considers the full article body. BitNet only sees the first 1,500 characters. But BitNet provides a useful sanity check — if both models agree an article is important, we're more confident in the score.

Stage 3: Ticker Enrichment

After AI processing, we scan the full article text for known ticker symbols (BTC, ETH, SOL, AVAX, etc.) and fetch current prices + 24h change from market data APIs.

This enables two features in the frontend:

Inline ticker context — readers see that an article mentioning ETH was written when ETH was at $3,200 (+4.2%)
Trending tickers — aggregate ticker mentions across articles from the past 7 days, ranked by mention count

Query & Filtering API

The REST API exposes flexible filtering for the frontend:

GET /api/articles?segment=crypto&sentiment=bullish&minImportance=7&sort=importance&limit=20

Parameter	Type	Description
`segment`	`ai` \| `crypto` \| `all`	Maps to category groups
`category`	string	Exact category override
`sentiment`	`bullish` \| `bearish` \| `neutral`	Market sentiment filter
`ticker`	string	Filter by mentioned ticker (e.g., `BTC` )
`tag`	string	Filter by tag
`search`	string	Full-text search (weighted: title 10x, summary 5x, tags 3x)
`featured`	boolean	Only featured articles
`minImportance`	1-10	Minimum importance threshold
`sort`	`latest` \| `importance` \| `trending` \| `views`	Sort order

Segment-to-Category Mapping

The

segment

parameter is a UX convenience — users think in terms of "AI news" or "Crypto news", but the database has more granular categories:

SEGMENT_CATEGORIES = {
  ai: ['ai', 'ai_crypto'],
  crypto: ['crypto', 'defi', 'ai_crypto'],
  // 'all' applies no filter
}

Notice

ai_crypto

appears in both segments. An article about "AI agents trading on-chain" is relevant to both audiences.

MongoDB Indexes

Performance at scale requires the right indexes. We use compound indexes that match our most common query patterns:

{ isPublished: 1, publishedAt: -1 }           // Default chronological feed
{ isPublished: 1, category: 1, publishedAt: -1 } // Segment-filtered feed
{ isPublished: 1, finalScore: -1 }              // Trending sort
{ isPublished: 1, isFeatured: 1, publishedAt: -1 } // Featured articles
{ isPublished: 1, tickers: 1, publishedAt: -1 }    // Ticker filter
{ isPublished: 1, tags: 1, publishedAt: -1 }       // Tag filter

// Weighted text index for search
{ originalTitle: 'text', summary: 'text', tags: 'text' }
// weights: { originalTitle: 10, summary: 5, tags: 3 }

Every compound index starts with

isPublished: 1

because we never query unpublished articles in the public API.

Frontend Integration

The Next.js frontend uses a custom

useFilters

hook that syncs filter state with URL search params:

const { sentiment, importance, sort, setSentiment, setImportance, setSort } = useFilters();

// Changes URL to /news?sentiment=bullish&minImportance=7&sort=importance
setSentiment('bullish');
setImportance(7);
setSort('importance');

This makes filtered views shareable and bookmarkable. Opening a link like

/news?sentiment=bearish&sort=importance

shows exactly the filtered view the sender intended.

Results

After running this pipeline for several weeks:

~70% of articles filtered at Stage 1 — mostly irrelevant content, sponsored posts, and articles below importance threshold 4
~15% caught by deduplication — across all three levels, with Level 2 (trigram) catching the most
~15% published — the final curated feed
Average processing time: ~200ms for BitNet (Stage 1), ~2-3s for Claude (Stage 2)
Cost: ~$0.002 per article that reaches Claude. Articles filtered by BitNet cost nothing beyond compute

Key Takeaways

Two-stage AI is the sweet spot. A fast local model as gatekeeper + a powerful API model for deep analysis gives you cost efficiency without sacrificing quality
Deduplication needs multiple strategies. URL hash alone misses 80% of duplicates. Trigram + semantic catches almost everything at ingestion time, and query-time + frontend dedup handles any legacy data
Fail-open is better than fail-closed for news. Missing a breaking story is worse than showing a borderline article
Weighted scoring builds trust. When two independent models agree on importance, the signal is stronger
Segment abstraction matters for UX. Users don't think in database categories — they think in topics

Who This Is For

Teams building content aggregation products who need more than RSS + chronological sort
Crypto/fintech startups looking for a production-tested AI pipeline architecture
Engineers evaluating local vs. API models — this is a practical example of using both

The full implementation runs as part of y0.exchange, an AI-to-blockchain infrastructure platform. The news module is one piece of a larger system that connects AI agents to crypto wallets.

Built with NestJS, MongoDB, Bull, BitNet, and Claude Sonnet 4. Running in production at news.y0.exchange.

Comments

Loading comments...