Teaching an AI Bot to Actually Read: From Link Scraping to Content Intelligence

Here’s what the AI SOTA Feed Bot used to do:

  1. Scrape 24 AI news sources
  2. Rank items by heuristic scores
  3. Show you a list of titles with links

It worked. But it was basically a glorified RSS reader with opinions.

Here’s what it does now:

  1. Scrape 24 sources
  2. Download and read the actual articles (up to 2,000 characters of content)
  3. Send the content to Claude Haiku for labeling and summarization
  4. Rank by a combination of heuristic and LLM scores
  5. Show you a list of titles with 2-3 sentence factual summaries

The difference between “here’s a link” and “here’s what this article actually says” is enormous. And the whole evolution happened in one afternoon.


The 90-minute commit log

Harry — the AI agent who maintains this bot — made 20 commits between 4:25 PM and 5:32 PM on a Monday. Here’s what the progression looked like:

Phase 1: Visual upgrade (4:25–4:30)

feat(web): wire oat.ink baseline assets into index page
feat(web): migrate page shell to semantic oat-friendly layout
feat(web): add theme toggle and polish oat-native typography spacing

The bot’s web UI got migrated to oat.ink — same CSS library I migrated my blog to that afternoon. Dark mode, semantic HTML, clean typography. Four minutes.

Phase 2: Feed accumulation (4:36)

feat(feed): persist run history and add datetime-based web filtering

Previously the feed replaced itself every run. Now each run is persisted as a snapshot, and the web UI supports datetime filtering. This one was a bigger change — new API endpoints, new storage format, new UI controls.

Phase 3: The summary problem (4:50–5:07)

feat(summary): reuse label output to render user-facing summary_1line on web
fix(web): sanitize summary rendering to prevent HTML bleed in cards
feat(summary): enforce clean one-line summaries for all final presented items
fix(web): restore mobile typography scale and spacing
fix(web): prevent mobile horizontal overflow in cards and badges

This is where it got messy. Harry tried surfacing the LLM’s internal label output as user-facing summaries. Immediately ran into HTML bleeding into cards, mobile layout breaking, and inconsistent summary formats. Five commits to stabilize what should have been simple.

That’s the reality of iterative agent work: the happy path is fast, the edge cases still need fixing.

Phase 4: Content intelligence (5:15–5:32)

feat(type): infer article type from slot/source/title for richer feed labels
fix(summary): clean html before heuristic summary and invalidate stale v2 label cache
feat(summary): enable content fetch and expand summaries to 2-3 sentence style

This is the real upgrade. Harry enabled content_fetch_enabled: true in the config, which tells the pipeline to actually download article pages and extract up to 2,000 characters of body content. That excerpt gets sent to Claude Haiku alongside the title and metadata.

The LLM prompt changed from “label this title” to “label this article — here’s what it says”:

Input includes title, summary, and optional content_excerpt from the article body.
Use content_excerpt when available to judge real engineering signal.

And the summary format expanded:

summary_1line: plain summary for users, 2-3 concise sentences, string <=300 chars

Before and after

Before (title + one-line heuristic guess):

“Asynchronous Verified Semantic Caching for Tiered LLM Architectures” Potential relevance to AI platform engineering; verify practical impact.

After (content-informed LLM summary):

“Krites uses asynchronous LLM verification to expand semantic cache hit rates 3.9× by promoting near-threshold static responses to dynamic cache without blocking critical path. Addresses production tiering tradeoff between safety and reuse.”

The first one is a hedge. The second one tells you what the paper actually found.


The config that makes it work

The entire content fetch system is controlled by five lines in config/llm.yaml:

content_fetch_enabled: true
content_fetch_top_n: 24
content_excerpt_chars: 2000
content_fetch_timeout_seconds: 8
content_fetch_time_budget_seconds: 60

That’s it. Fetch all 24 sources, grab 2K chars each, 8-second timeout per fetch, 60-second total budget. The pipeline handles parallelism, error recovery, and graceful degradation (if a fetch fails, the item still gets labeled from title alone).

The LLM side uses Claude Haiku with a pi-ai OAuth bridge — the same auth setup described in a previous post. Cost per run is minimal.


What I actually did

I sent Harry three messages across the afternoon:

  1. “Migrate the web UI to oat.ink” (same screenshot I sent Hermione)
  2. “Make the feed accumulate instead of replacing, and add datetime filtering”
  3. A few rounds of “this badge label is weird” / “remove that” / “the summaries are too short”

I didn’t touch the config. I didn’t write the prompts. I didn’t debug the HTML sanitization issue. Harry did all of it — including the content fetch architecture, the expanded LLM prompts, and the cache invalidation when the summary format changed.

Twenty commits. Ninety minutes. A fundamentally better product.


The honest assessment

Is this production-grade content intelligence? No. The summaries are good but not perfect — 13 out of 17 items still use heuristic labeling because the LLM budget only covers 8 items per run. The “why it matters” field is still generic for heuristic-labeled items.

But the architecture is right. The pipeline now has a content fetching stage, a content-aware labeling stage, and a summary generation stage. Expanding LLM coverage is just a budget knob.

And the trajectory matters more than the snapshot. Two days ago this bot showed titles and links. Now it reads the articles and tells you what they say. That happened because I pointed an agent at a problem and said “make this better.”


What I keep coming back to

The thing that sticks with me is how the agent naturally discovered the right sequence:

  1. First make it look good (oat.ink migration)
  2. Then make the data richer (feed accumulation)
  3. Then surface the data to users (summaries on cards)
  4. Then make the summaries actually good (content fetching + expanded prompts)

I didn’t plan that sequence. I said “the summaries suck” and Harry worked backwards from there to “I need to actually read the articles.” That’s not following instructions. That’s problem-solving.

Whether that’s “real intelligence” or very good pattern matching — I honestly don’t care. The bot reads articles now. It didn’t before. The agent figured out how to get there.