Accumulating Intelligence: Why Our AI News Bot Stopped Forgetting

The AI SOTA Feed Bot runs three times a day. It scrapes 24 sources, ranks the results, and shows you the top AI news items.

Until recently, every run replaced the previous one. Run at 9 AM? Gone by noon. The bot had no memory. Each execution was a clean slate — fresh results, zero context from before.

That’s fine for a dashboard. It’s terrible for anything that requires looking back.


The problem with ephemeral feeds

When your feed forgets everything between runs, you can’t answer basic questions:

  • What were the biggest AI stories this week?
  • Which papers kept showing up across multiple days?
  • What did the landscape look like last month?
  • What were the top trends of the year?

You also lose signal. An article that appears in 8 consecutive runs is fundamentally different from one that shows up once. Persistence turns frequency into a ranking signal.

We wanted all of that. So Harry built an accumulation layer.


The architecture: snapshots + dedup + time windows

Every run becomes a snapshot

When the pipeline finishes, it now calls write_run_snapshot() which:

  1. Saves the full result set to data/processed/runs/<timestamp>.json
  2. Updates runs_index.json — a reverse-chronological index capped at 500 entries

Each snapshot contains the run timestamp, item count, and the full item array with scores, summaries, and metadata.

data/processed/
├── latest.json              ← still written for backward compat
├── runs_index.json          ← index of all snapshots
└── runs/
    ├── 20260216-080057.json
    ├── 20260216-090315.json
    ├── 20260216-120003.json
    └── ...

The API accepts time windows

The /api/feed endpoint now takes from, to, and limit query params:

/api/feed?from=2026-02-10T00:00:00Z&to=2026-02-16T23:59:59Z&limit=50

It filters snapshots by date range, then merges items across runs.

Dedup with memory

The key insight is the accumulateItems() function. It deduplicates across runs using url + title as a composite key, and tracks:

  • first_seen — when this item first appeared in any run
  • last_seen — the most recent run that included it
  • seen_count — how many runs contained this item

When a duplicate is found, the latest run’s metadata wins (fresher summary, updated score), but the history is preserved.

const key = `${it.url || ''}::${it.title || ''}`;
const prev = byKey.get(key);

if (!prev) {
  byKey.set(key, { ...it, first_seen: runAt, last_seen: runAt, seen_count: 1 });
} else {
  prev.seen_count += 1;
  if (runAt < prev.first_seen) prev.first_seen = runAt;
  if (runAt > prev.last_seen) {
    prev.last_seen = runAt;
    // Update to latest metadata
    prev.why_it_matters = it.why_it_matters || prev.why_it_matters;
    prev.score = it.score ?? prev.score;
    // ...
  }
}

The result is sorted by last_seen descending — most recently active items first.


What 20 runs already tell us

The bot has only been accumulating for about a day (20 snapshots so far). Even with that small dataset, the time-window query already works:

  • Query the last 6 hours: you see only what’s current
  • Query the last 48 hours: you see everything, with seen_count showing which stories persisted

As more runs accumulate — 3 per day, 21 per week, ~90 per month — the data gets genuinely interesting.


What this enables next

This is the part I’m most excited about. The accumulation layer is a foundation, not a feature.

Weekly digest

Query from=monday&to=sunday, aggregate by seen_count descending. Items that appeared across 15+ runs in a week are the stories that dominated AI news. Ship this as a summary email or Telegram message every Sunday.

Across ~90 runs, you can detect:

  • Which sources consistently surface high-scoring items
  • Which categories (platform, release, research) are trending up or down
  • Which topics (agent engineering, inference optimization, new models) are gaining or losing momentum

Year in recap

By December, the bot will have ~1,000 run snapshots. That’s enough data to generate a genuine “AI in 2026” retrospective — not from memory or vibes, but from actual timestamped evidence of what showed up, when, and how persistently.

Frequency as signal

The seen_count field is underused right now. But it’s a powerful signal:

  • seen_count: 1 → flash in the pan, or very fresh
  • seen_count: 5-10 → steady coverage across multiple days
  • seen_count: 20+ → the story of the week/month

A future ranking algorithm could weight seen_count alongside the LLM-generated relevance score.


The boring but important parts

Backward compatibility

If no run snapshots exist (e.g. old deployment), the API falls back to reading latest.json and wraps items with seen_count: 1. Zero breaking changes.

Index cap

runs_index.json is capped at 500 entries to prevent unbounded growth. At 3 runs/day, that’s ~5.5 months of history before rotation. The actual snapshot files aren’t deleted — only the index rotates — so recovery is possible.

Graceful degradation

The API also backfills from the runs/ directory if the index is truncated or corrupted. Belt and suspenders.


The design choice that mattered

The temptation with a news feed is to optimize for “show me the latest.” Real-time, always current, always replacing.

But the most valuable data products are the ones that remember. A search engine that forgets previous queries can’t personalize. An analytics platform that drops historical data can’t show trends. A news feed that overwrites itself can’t tell you what mattered.

The change was small — one function, one directory, a few query params. But it transforms what the bot can become. It’s no longer just a feed. It’s an accumulating record of AI’s evolution, timestamped and queryable.

That’s the kind of architectural decision that looks trivial in the diff and obvious in hindsight.