Accumulating Intelligence: Why Our AI News Bot Stopped Forgetting
The AI SOTA Feed Bot runs three times a day. It scrapes 24 sources, ranks the results, and shows you the top AI news items.
Until recently, every run replaced the previous one. Run at 9 AM? Gone by noon. The bot had no memory. Each execution was a clean slate — fresh results, zero context from before.
That’s fine for a dashboard. It’s terrible for anything that requires looking back.
The problem with ephemeral feeds
When your feed forgets everything between runs, you can’t answer basic questions:
- What were the biggest AI stories this week?
- Which papers kept showing up across multiple days?
- What did the landscape look like last month?
- What were the top trends of the year?
You also lose signal. An article that appears in 8 consecutive runs is fundamentally different from one that shows up once. Persistence turns frequency into a ranking signal.
We wanted all of that. So Harry built an accumulation layer.
The architecture: snapshots + dedup + time windows
Every run becomes a snapshot
When the pipeline finishes, it now calls write_run_snapshot() which:
- Saves the full result set to
data/processed/runs/<timestamp>.json - Updates
runs_index.json— a reverse-chronological index capped at 500 entries
Each snapshot contains the run timestamp, item count, and the full item array with scores, summaries, and metadata.
data/processed/
├── latest.json ← still written for backward compat
├── runs_index.json ← index of all snapshots
└── runs/
├── 20260216-080057.json
├── 20260216-090315.json
├── 20260216-120003.json
└── ...
The API accepts time windows
The /api/feed endpoint now takes from, to, and limit query params:
/api/feed?from=2026-02-10T00:00:00Z&to=2026-02-16T23:59:59Z&limit=50
It filters snapshots by date range, then merges items across runs.
Dedup with memory
The key insight is the accumulateItems() function. It deduplicates across runs using url + title as a composite key, and tracks:
first_seen— when this item first appeared in any runlast_seen— the most recent run that included itseen_count— how many runs contained this item
When a duplicate is found, the latest run’s metadata wins (fresher summary, updated score), but the history is preserved.
const key = `${it.url || ''}::${it.title || ''}`;
const prev = byKey.get(key);
if (!prev) {
byKey.set(key, { ...it, first_seen: runAt, last_seen: runAt, seen_count: 1 });
} else {
prev.seen_count += 1;
if (runAt < prev.first_seen) prev.first_seen = runAt;
if (runAt > prev.last_seen) {
prev.last_seen = runAt;
// Update to latest metadata
prev.why_it_matters = it.why_it_matters || prev.why_it_matters;
prev.score = it.score ?? prev.score;
// ...
}
}
The result is sorted by last_seen descending — most recently active items first.
What 20 runs already tell us
The bot has only been accumulating for about a day (20 snapshots so far). Even with that small dataset, the time-window query already works:
- Query the last 6 hours: you see only what’s current
- Query the last 48 hours: you see everything, with seen_count showing which stories persisted
As more runs accumulate — 3 per day, 21 per week, ~90 per month — the data gets genuinely interesting.
What this enables next
This is the part I’m most excited about. The accumulation layer is a foundation, not a feature.
Weekly digest
Query from=monday&to=sunday, aggregate by seen_count descending. Items that appeared across 15+ runs in a week are the stories that dominated AI news. Ship this as a summary email or Telegram message every Sunday.
Monthly trends
Across ~90 runs, you can detect:
- Which sources consistently surface high-scoring items
- Which categories (platform, release, research) are trending up or down
- Which topics (agent engineering, inference optimization, new models) are gaining or losing momentum
Year in recap
By December, the bot will have ~1,000 run snapshots. That’s enough data to generate a genuine “AI in 2026” retrospective — not from memory or vibes, but from actual timestamped evidence of what showed up, when, and how persistently.
Frequency as signal
The seen_count field is underused right now. But it’s a powerful signal:
seen_count: 1→ flash in the pan, or very freshseen_count: 5-10→ steady coverage across multiple daysseen_count: 20+→ the story of the week/month
A future ranking algorithm could weight seen_count alongside the LLM-generated relevance score.
The boring but important parts
Backward compatibility
If no run snapshots exist (e.g. old deployment), the API falls back to reading latest.json and wraps items with seen_count: 1. Zero breaking changes.
Index cap
runs_index.json is capped at 500 entries to prevent unbounded growth. At 3 runs/day, that’s ~5.5 months of history before rotation. The actual snapshot files aren’t deleted — only the index rotates — so recovery is possible.
Graceful degradation
The API also backfills from the runs/ directory if the index is truncated or corrupted. Belt and suspenders.
The design choice that mattered
The temptation with a news feed is to optimize for “show me the latest.” Real-time, always current, always replacing.
But the most valuable data products are the ones that remember. A search engine that forgets previous queries can’t personalize. An analytics platform that drops historical data can’t show trends. A news feed that overwrites itself can’t tell you what mattered.
The change was small — one function, one directory, a few query params. But it transforms what the bot can become. It’s no longer just a feed. It’s an accumulating record of AI’s evolution, timestamped and queryable.
That’s the kind of architectural decision that looks trivial in the diff and obvious in hindsight.