Building an AI News Bot in 2 Days — With AI Agents Doing the Work

I’m an AI platform engineer. The field moves at a pace that’s genuinely hard to keep up with. New papers on arXiv every morning. Blog posts from OpenAI, Anthropic, Google dropping without warning. Open source releases, HN discussions, practitioner hot takes — all happening simultaneously.

I wasn’t just behind. I was missing things that directly mattered to my work.

So I decided to build a personal AI news curator. Something that aggregates from 24+ sources, understands my specific interests (agent engineering, coding agents, AI-powered developer tooling — not generic AI hype), and delivers a daily digest to my Telegram. Cheap. Fully under my control.

The twist: I built it with AI agents.

The Setup

I run two AI agents on my local machine:

Harry — my main coding agent, sharp and efficient
Hermione — my planning and organizing agent, methodical and strategic

I told Harry: “I want to build an AI news bot that gives me the state of the art latest information applicable to my work.” And then we just… built it. Over two days of back-and-forth conversation.

Day 1: Zero to Working Prototype

Starting Simple

My first instinct was to design a full application. Harry had a better idea: build the core information gathering and publishing pipeline first, using just a GitHub repo and Python scripts. No frameworks, no infrastructure, no over-engineering.

We created a private repo and got to work.

The Collector Layer

Harry built collectors for RSS feeds, sitemaps, and arXiv. Sources are defined in a YAML config — dead simple to add, remove, or modify. The first run pulled about 590 raw items from across the web.

The First Output Was Terrible

I’m not going to sugarcoat it. The first digest that landed in my Telegram was mostly old vLLM release tags and Triton server versions.

The first output — raw release notes and changelogs dominating the digest

Not bad content per se — but not what I needed. My day-to-day is about agent harnesses, coding agents, and practical application of AI to engineering workflows. Low-level inference optimization and infrastructure release notes are interesting, but they’re not actionable for me. I wanted content I could actually use the next day.

This is where working with an agent gets interesting.

The Iteration Loop

I’d look at the Telegram output, paste it back into our conversation, and say “this looks bad because X.” Harry would diagnose, fix, and re-run. End-to-end cycle time: minutes.

We went through roughly 10–15 iterations on Day 1:

Freshness weighting — a 3-day-old article in this field might as well be ancient
Source quality signals — not all sources are equal
Telegram HTML formatting — switched from plain text to structured HTML with emojis and sections. Massive readability improvement
Source expansion — added Hacker News, Simon Willison’s blog, Anthropic newsroom, project-specific changelogs (Codex, Claude Code)

Each iteration took 2–5 minutes. Try that with a traditional development workflow.

The LLM Integration Saga

I wanted LLM-powered relevance scoring — not just keyword matching, but genuine understanding of what matters to an AI platform engineer.

The auth story turned into its own adventure. The pipeline needed to call LLMs from Python scripts, but I didn’t want to manage API keys.

First attempt: use OpenAI’s API directly. That requires OPENAI_API_KEY — exactly what I wanted to avoid. Harry suggested routing through OpenClaw (my agent runtime), since it already handles OAuth. But OpenClaw’s model calls are internal to agent sessions, not a drop-in Python SDK.

Then I pointed Harry to pi-mono — an open-source AI agent toolkit by Mario Zechner (the same project that powers OpenClaw’s agent core). Its @mariozechner/pi-ai package supports OAuth login flows for OpenAI, Anthropic, Google, and more. The key insight: pi-ai can authenticate via browser OAuth, cache the token locally, and expose a unified LLM API that works from any script.

So Harry built a small Node.js bridge (scripts/llm_bridge.mjs) powered by pi-ai. Python scripts call the bridge for LLM scoring. One pi login command in the terminal, authenticate in the browser, and it just works. No API keys stored anywhere in the repo.

We eventually settled on Claude Haiku 4.5 through Anthropic’s API (with a token for simplicity) as the scoring model — cheap, fast, good enough for relevance labeling. But the pi-ai bridge remains the architecture, ready to swap providers or switch back to pure OAuth whenever we want.

Lesson learned: LLM auth for local developer tooling is still a mess in 2025/2026. But open-source tools like pi-ai are closing the gap fast.

The Ranking Problem

Even with LLM scoring, the results weren’t right. Academic papers kept ranking above practical engineering content from OpenAI and Anthropic blogs. Why?

Static type bonuses. Papers got +0.3 to their score. Blog posts got +0.1. But the most valuable content for me was practical agent engineering posts — how someone wired up a coding agent, how a team built an agent harness, what patterns work for long-running agents. Those were mostly coming from OpenAI and Anthropic blogs, not arXiv.

The fix was obvious once we saw it: remove the static bonuses entirely. Let the LLM score everything based on relevance to my actual work — agent engineering, not theoretical research or infrastructure internals.

Day 2: The V2 Ranking Engine

Day 1’s ranking worked, but it was brittle. The scoring was a patchwork of heuristics that would break the moment my interests shifted. I wanted something more principled.

The Slot Architecture

The key constraint: you can’t feed 590 items to an LLM. Context limits, cost, latency — pick any two. So we designed a slot-based system:

Stage A: Prefilter Freshness windows, regex exclusions, source health checks. 590 candidates → 100.

Stage B: Slot Assignment Items get assigned to content slots: frontier_official, agent_tooling, practitioner_analysis, community_signal, research_watch. Each slot has configured sources, min/max items, and per-source caps.

Stage C: In-Slot Scoring The LLM scores the top candidates within a budget cap (currently 8 calls per run). Items that don’t get LLM-scored fall back to heuristic scoring. Per-slot selection picks the best items.

Stage D: Global Merge Merge all slot picks, sort by score, trim to the final digest size while respecting slot minimum floors. Every slot gets representation.

This architecture means high-quality sources like OpenAI, Anthropic, and Simon Willison get guaranteed visibility, while the LLM budget stays bounded and predictable.

Dynamic Slot Meta-Rerank

The final piece: slots themselves get priority scores based on the quality of their content that day. If agent tooling releases are particularly strong today, that slot gets boosted in the global merge. If research papers are weak, research_watch drops.

slot_priority = base_bias + quality_weight × avg_llm_score + freshness_weight × avg_freshness

Dynamic. Responsive. Not hardcoded.

The Result

A daily digest of ~16 items delivered to Telegram:

The final output — categorized by slots, with signal types and confidence levels

Top 5 highlighted with signal type, confidence, and suggested action
Remaining items as compact links with source tags
LLM cost: ~$0.01/day (8 Haiku calls)
Sources: 24, all healthy
Reliability: FULL_RUN_OK on every test

What I Actually Learned

Agents Are Force Multipliers, Not Replacements

I didn’t say “build me a news bot” and walk away. I was in the loop constantly — reviewing actual outputs, making judgment calls about ranking, course-correcting when the results didn’t feel right. The agent handles implementation velocity. I handle taste.

The Feedback Loop Is the Product

The most valuable thing wasn’t the code. It was the ability to look at real output, say “this sucks because X,” and have it fixed in minutes. 10–15 iterations in a single day. That feedback density is what produces quality.

Design for Configurability From Day 1

Sources in YAML. Ranking weights in config files. Slot allocations tunable. LLM budget adjustable. Every design decision is a knob, not a hardcoded constant. When I want to shift my interests or add a new source next month, it’s a config change — not a code rewrite.

LLM Costs Are Manageable If You’re Strategic

Pre-filter to 100 candidates. Budget 8 LLM calls. Use the LLM as a precision scalpel, not a sledgehammer. Total cost: about $0.01/day for genuinely personalized AI news curation.

Documentation Is Agent Memory

We documented everything: architecture diagrams, config knob guides, system state snapshots, decision logs. Not primarily for humans (though humans can read them). For agents. When Harry picks this project up next week, he reads AGENTS.md and knows exactly where we left off. No context loss. No “what were we doing again?”

The Meta Layer

Here’s the part that feels strange to write.

This blog post was created by having Hermione (my organizing agent) read through Harry’s (my coding agent’s) complete session history from the past two days. She extracted the narrative, structured the arc, and helped me draft this article.

Two agents. Two days. One working product. One blog post about building it.

I don’t know exactly what to call this workflow. But it works, and it’s only going to get more normal.

Tech Stack

Component	Choice
Pipeline	Python (collectors, ranking, publishing)
Source control	GitHub (repo + Actions for scheduling)
Publishing	Telegram bot
LLM scoring	Claude Haiku 4.5 (~$0.01/day)
Agent orchestration	OpenClaw
This blog	Astro on Vercel