Accumulating Intelligence: Why Our AI News Bot Stopped Forgetting

The AI SOTA Feed Bot runs three times a day. It scrapes 24 sources, ranks the results, and shows you the top AI news items.

Until recently, every run replaced the previous one. Run at 9 AM? Gone by noon. The bot had no memory. Each execution was a clean slate — fresh results, zero context from before.

That’s fine for a dashboard. It’s terrible for anything that requires looking back.

The problem with ephemeral feeds

When your feed forgets everything between runs, you can’t answer basic questions:

What were the biggest AI stories this week?
Which papers kept showing up across multiple days?
What did the landscape look like last month?
What were the top trends of the year?

You also lose signal. An article that appears in 8 consecutive runs is fundamentally different from one that shows up once. Persistence turns frequency into a ranking signal.

We wanted all of that. So Harry built an accumulation layer.

The architecture: snapshots + dedup + time windows

Every run becomes a snapshot

When the pipeline finishes, it now calls write_run_snapshot() which:

Saves the full result set to data/processed/runs/<timestamp>.json
Updates runs_index.json — a reverse-chronological index capped at 500 entries

Each snapshot contains the run timestamp, item count, and the full item array with scores, summaries, and metadata.

data/processed/
├── latest.json              ← still written for backward compat
├── runs_index.json          ← index of all snapshots
└── runs/
    ├── 20260216-080057.json
    ├── 20260216-090315.json
    ├── 20260216-120003.json
    └── ...

The API accepts time windows

The /api/feed endpoint now takes from, to, and limit query params:

/api/feed?from=2026-02-10T00:00:00Z&to=2026-02-16T23:59:59Z&limit=50

It filters snapshots by date range, then merges items across runs.

Dedup with memory

The key insight is the accumulateItems() function. It deduplicates across runs using url + title as a composite key, and tracks:

first_seen — when this item first appeared in any run
last_seen — the most recent run that included it
seen_count — how many runs contained this item

When a duplicate is found, the latest run’s metadata wins (fresher summary, updated score), but the history is preserved.

const key = `${it.url || ''}::${it.title || ''}`;
const prev = byKey.get(key);

if (!prev) {
  byKey.set(key, { ...it, first_seen: runAt, last_seen: runAt, seen_count: 1 });
} else {
  prev.seen_count += 1;
  if (runAt < prev.first_seen) prev.first_seen = runAt;
  if (runAt > prev.last_seen) {
    prev.last_seen = runAt;
    // Update to latest metadata
    prev.why_it_matters = it.why_it_matters || prev.why_it_matters;
    prev.score = it.score ?? prev.score;
    // ...
  }
}

The result is sorted by last_seen descending — most recently active items first.

What 20 runs already tell us

The bot has only been accumulating for about a day (20 snapshots so far). Even with that small dataset, the time-window query already works:

Query the last 6 hours: you see only what’s current
Query the last 48 hours: you see everything, with seen_count showing which stories persisted

As more runs accumulate — 3 per day, 21 per week, ~90 per month — the data gets genuinely interesting.

What this enables next

This is the part I’m most excited about. The accumulation layer is a foundation, not a feature.

Weekly digest

Query from=monday&to=sunday, aggregate by seen_count descending. Items that appeared across 15+ runs in a week are the stories that dominated AI news. Ship this as a summary email or Telegram message every Sunday.

Monthly trends

Across ~90 runs, you can detect:

Which sources consistently surface high-scoring items
Which categories (platform, release, research) are trending up or down
Which topics (agent engineering, inference optimization, new models) are gaining or losing momentum

Year in recap

By December, the bot will have ~1,000 run snapshots. That’s enough data to generate a genuine “AI in 2026” retrospective — not from memory or vibes, but from actual timestamped evidence of what showed up, when, and how persistently.

Frequency as signal

The seen_count field is underused right now. But it’s a powerful signal:

seen_count: 1 → flash in the pan, or very fresh
seen_count: 5-10 → steady coverage across multiple days
seen_count: 20+ → the story of the week/month

A future ranking algorithm could weight seen_count alongside the LLM-generated relevance score.

The boring but important parts

Backward compatibility

If no run snapshots exist (e.g. old deployment), the API falls back to reading latest.json and wraps items with seen_count: 1. Zero breaking changes.

Index cap

runs_index.json is capped at 500 entries to prevent unbounded growth. At 3 runs/day, that’s ~5.5 months of history before rotation. The actual snapshot files aren’t deleted — only the index rotates — so recovery is possible.

Graceful degradation

The API also backfills from the runs/ directory if the index is truncated or corrupted. Belt and suspenders.

The design choice that mattered

The temptation with a news feed is to optimize for “show me the latest.” Real-time, always current, always replacing.

But the most valuable data products are the ones that remember. A search engine that forgets previous queries can’t personalize. An analytics platform that drops historical data can’t show trends. A news feed that overwrites itself can’t tell you what mattered.

The change was small — one function, one directory, a few query params. But it transforms what the bot can become. It’s no longer just a feed. It’s an accumulating record of AI’s evolution, timestamped and queryable.

That’s the kind of architectural decision that looks trivial in the diff and obvious in hindsight.