Skip to main content
// JH

· 9 min read

Building an AI Intelligence Layer in Pure Go

Scout gathers intelligence from RSS, GitHub, Reddit, and Google News in parallel, scores it with a weighted formula, and deduplicates with Jaccard similarity — all in ~3,000 lines of Go with zero dependencies.

ai · golang · intelligence · scout · via

TL;DR

Scout is a pure Go CLI that gathers intelligence from RSS feeds, GitHub trending repos, Reddit, and Google News in parallel across 4 configurable topic presets. Every item gets scored on a weighted formula (40% relevance, 30% recency, 30% engagement) and deduplicated using Jaccard title similarity at a 0.6 threshold. The whole thing is ~3,000 lines with zero external dependencies.


The Tab Problem

I had 47 browser tabs open. Arxiv papers I meant to read. A TechCrunch article about a funding round that might matter. Three Reddit threads about a new model release, all saying slightly different things. A GitHub repo someone mentioned in a Discord that I bookmarked and never went back to.

This is how I consumed AI news for months. Manual, scattered, always behind. By the time I read something, the conversation had moved on. Worse, I was building an orchestration system that could route tasks to dozens of agents — but none of those agents knew what was happening in the field they were working in. My agents could write code, research APIs, and plan architectures, but they had no idea what models shipped last week or what tools were trending on GitHub.

I needed a sensory input layer. Something that could answer "what's happening right now?" across the domains I care about, score what matters, and pipe it into the plugin ecosystem for downstream consumption.

I built Scout in a day.

Four Sources, One Interface

Scout gathers from four source types: RSS/Atom feeds, GitHub's search API, Reddit's public JSON API, and Google News via RSS. Each source maps to a different signal. RSS gives me curated editorial content — OpenAI's blog, arXiv papers, TechCrunch. GitHub surfaces what developers are actually building. Reddit captures community sentiment and discussion. Google News catches everything else.

The architectural trick is that every gatherer produces the same output type:

internal/gather/types.go
type IntelItem struct {
  ID         string    `json:"id"`
  Title      string    `json:"title"`
  Content    string    `json:"content"`
  SourceURL  string    `json:"source_url"`
  Timestamp  time.Time `json:"timestamp"`
  Score      float64   `json:"score,omitempty"`
  Engagement int       `json:"engagement,omitempty"`
}

No formal Gatherer interface is declared anywhere. Each gatherer just happens to have the same Gather(searchTerms []string) ([]IntelItem, error) signature. This is idiomatic Go — if the contract is satisfied, you don't need to spell it out. The dispatch happens in a switch statement, and the compiler catches mismatches at call sites.

What "engagement" means varies by source. For Reddit, it's post.Score + post.NumComments. For GitHub, it's the star count. For RSS feeds, it's zero — editorial content doesn't have a public engagement metric. The scoring formula handles this gracefully.

Two-Level Parallelism

Gathering runs in parallel at two levels. Topics run concurrently, and sources within each topic also run concurrently. Four topics with up to three sources each means as many as nine goroutines fetching simultaneously:

internal/cmd/gather.go
var wg sync.WaitGroup
for _, topic := range topics {
  wg.Add(1)
  go func(t gather.TopicConfig) {
      defer wg.Done()
      topicResults := gatherTopic(t, jsonOutput)
      mu.Lock()
      results = append(results, topicResults...)
      mu.Unlock()
  }(topic)
}
wg.Wait()

I used sync.WaitGroup + sync.Mutex instead of channels. For a fanout/fanin pattern where you just need to collect results into a slice, this is simpler and more readable than a channel pipeline. The same pattern repeats inside gatherTopic — each source gets its own goroutine.

The Reddit gatherer is the only one that can't fully parallelize. Reddit's public JSON API needs a 2-second sleep between subreddit searches to avoid rate limiting. So while the RSS and GitHub gatherers run concurrently with each other, the Reddit gatherer walks through subreddits sequentially with time.Sleep(2 * time.Second) between each call. Honest concurrency — you parallelize what you can and rate-limit what you must.

Scoring What Matters

Every gathered item gets scored on three dimensions, each normalized to 0.0–1.0 and weighted into a final 0–100 score:

internal/gather/score.go
// Relevance: matched search terms / total terms
relevance := float64(matches) / float64(len(searchTerms))

// Recency: linear decay from 1.0 to 0.0 over 720 hours (30 days)
recency := math.Max(0, 1-hoursSince/720)

// Engagement: log scale, capped at 10,000
engagement := math.Min(1, math.Log10(float64(item.Engagement)+1)/4)

score := (relevance*0.4 + recency*0.3 + engagement*0.3) * 100

Relevance gets the highest weight at 40% because an old but highly relevant item is more useful than a fresh but irrelevant one. Recency decays linearly over 30 days — a 15-day-old item scores 0.5 on recency. Engagement uses a logarithmic scale capped at 10,000 interactions, which means the difference between 10 and 100 upvotes matters more than the difference between 5,000 and 10,000. This prevents viral Reddit posts from dominating the feed.

The scoring is designed to be re-runnable. When scout brief or scout context reads stored intel, it re-scores every item with time.Now(). Items don't go stale in the database — they go stale at read time. This means a gathering run from three days ago produces different scores today than it did then, without re-fetching anything.

Jaccard Deduplication

The same story appears on Reddit, in an RSS feed, and in Google News with slightly different titles. "OpenAI Releases GPT-5" and "GPT-5 Released by OpenAI" are the same story. Without deduplication, the top 10 list becomes the top 3, repeated.

Scout uses Jaccard similarity on title words, with a threshold of 0.6:

internal/gather/dedupe.go
func jaccardSimilarity(a, b string) float64 {
  wordsA := uniqueWords(a)
  wordsB := uniqueWords(b)
  intersection := 0
  for w := range wordsA {
      if wordsB[w] { intersection++ }
  }
  union := len(wordsA) + len(wordsB) - intersection
  return float64(intersection) / float64(union)
}

The uniqueWords function extracts lowercase words longer than 2 characters — stripping articles, prepositions, and punctuation noise. Two titles sharing 60% of their significant words are considered duplicates. When a duplicate is found, the higher-scored version wins.

I chose title-only comparison deliberately. Titles are ~10 words. Content can be hundreds of words with filler that dilutes the signal. Comparing "OpenAI releases GPT-5" against "GPT-5 released by OpenAI" is a clean signal. Comparing their full article bodies would introduce noise from boilerplate paragraphs that appear in every tech article.

From Raw Intel to Actionable Output

Gathering is only half the system. The other half is synthesis. scout brief produces a human-readable summary with top stories, trending themes, and recent items. scout context exports structured data for downstream consumption.

The pipeline is the same for both: load intel files, re-score with current time, deduplicate, sort by score, detect themes.

Theme detection works by tokenizing titles, counting word frequency across the top items, and grouping items that share significant words. If "claude" appears in 4 of the top 10 titles, that's a theme. The algorithm caps at 5 themes and avoids double-counting items across themes.

The context command is where Scout connects to the rest of Via's plugin ecosystem. It outputs markdown or JSON that any downstream tool can consume:

## Scout Context: ai-models (Feb 12, 2026)

### Top 10 by score
1. **Claude 4.5 Sonnet Benchmarks** (score: 87) — blog.anthropic.com
2. **GPT-5 API Now Available** (score: 82) — openai.com
...

### Themes
- claude: 4 items
- benchmark: 3 items

### Sources scanned
- RSS: 4 feeds (12 files)
- Reddit: 3 subreddits (8 files)
- Web: Google News (4 files)

A planned YouTube Shorts pipeline would consume this context to turn trending AI stories into video scripts. That pipeline is in progress, not production. But the data layer is ready — scout context ai-models --json produces a structured payload that any future consumer can parse.

The Zero-Dependency Choice

Scout uses only Go's standard library. encoding/json for Reddit and GitHub APIs. encoding/xml for RSS and Atom feeds. net/http for every network call. crypto/sha256 for deterministic ID generation. sync for concurrency primitives. No Cobra for CLI parsing — hand-rolled argument loops. No goquery for HTML — a 15-line stripTags function that walks characters.

This is a conscious tradeoff. The RSS parser handles 8 date formats and auto-detects RSS 2.0 vs Atom 1.0 by trying both:

internal/gather/rss.go
func parseFeed(data []byte) ([]IntelItem, error) {
  var rss rssFeed
  if err := xml.Unmarshal(data, &rss); err == nil && len(rss.Channel.Items) > 0 {
      return rssItemsToIntel(rss.Channel.Items), nil
  }
  var atom atomFeed
  if err := xml.Unmarshal(data, &atom); err == nil && len(atom.Entries) > 0 {
      return atomEntriesToIntel(atom.Entries), nil
  }
  return nil, fmt.Errorf("unrecognized feed format")
}

A library like gofeed would handle more edge cases. But for the feeds I'm actually consuming — arXiv, OpenAI's blog, TechCrunch, HuggingFace — the stdlib approach works. And the web gatherer reuses this same parseFeed function to parse Google News results. The entire web gatherer is 56 lines because it just constructs a Google News RSS URL and calls the existing RSS parser. That kind of composability only works when you control the full stack.

The binary compiles to a single static executable. No runtime dependencies. No Docker. go build and copy to ~/.local/bin/.

The Honest Limitations

No semantic deduplication. Jaccard catches "GPT-5 Released by OpenAI" vs "OpenAI Releases GPT-5" but misses "OpenAI's latest model" vs "GPT-5 launches today." Same story, different words. The learnings system uses Gemini embeddings for semantic dedup — Scout should too, but that would break the zero-dependency constraint.

No persistent state across runs. Each gathering session writes fresh JSON files. There's no database tracking what I've already read or which items were useful. The brief command re-processes everything from scratch every time. For ~1,200 items this takes milliseconds, but it won't scale forever.

Theme detection is naive. Single-word frequency counting misses multi-word concepts. "foundation model" is two separate words that individually mean nothing, but together represent a theme. N-gram analysis or even simple bigram counting would catch this.

Four sources, not five. The task spec and help text mention X/Twitter as a source, but no implementation exists. Reddit's public API carries most of the community signal, so the gap isn't critical — but it's a gap.

The system works despite these gaps. I run scout gather daily, scout brief ai-models when I need a quick status check, and scout context when I need to feed current intelligence into a downstream pipeline. The intel is fresh, scored, and deduplicated enough to be useful. The rest is refinement.

Next: What 1,600+ AI Learnings Reveal


Related Posts

Jan 12, 2026

Why I Built a Multi-LLM Orchestration System (And You Might Want One Too)

Jan 22, 2026

Why I Built a Personal Intelligence OS

Jan 25, 2026

Starting Line: The Case for Personal AI