Why Build Another AI Assistant?
In a world with ChatGPT, Claude, and Gemini, building yet another AI assistant seems redundant. But here's what existing solutions don't solve well: multi-client context management for virtual assistants.
A VA managing 10+ executives needs more than a chat interface. They need an AI that understands "when I'm working with Sarah, I prefer morning meetings" while simultaneously knowing "John's budget approvals need his CFO cc'd." Generic AI assistants treat every conversation as isolated. EA treats them as interconnected contexts.
This article isn't a tutorial on calling the Claude API. It's a deep dive into the architectural decisions that shaped a production AI system—decisions that took 18 months and several expensive mistakes to get right.
The Stack: Unexpected Choices
Here's what EA runs on:
| Layer | Choice | Not This |
|---|---|---|
| Backend Framework | Hono | Express, Fastify |
| Real-time | Native WebSocket | Socket.io, Pusher |
| AI Integration | Vercel AI SDK + Claude | Direct API, LangChain |
| Storage | Upstash Redis + Vector | Self-managed Redis, Pinecone |
| Frontend | Next.js 16 + React 19 | Remix, SvelteKit |
| Auth | Clerk | Auth0, NextAuth |
Each choice has a story. Let me tell you the ones that matter.
Why Hono Over Express
Express has been my default for a decade. But when you're streaming AI responses over WebSocket, every millisecond counts.
// Hono: 12 lines for WebSocket + streaming
app.get('/ws', upgradeWebSocket((c) => ({
onMessage: async (event, ws) => {
const stream = await generateStream(event.data)
for await (const chunk of stream) {
ws.send(chunk)
}
}
})))Compare this to Express with ws or socket.io:
// Express + ws: 40+ lines, separate HTTP and WS servers
const server = createServer(app)
const wss = new WebSocketServer({ server })
wss.on('connection', (ws) => {
ws.on('message', async (data) => {
// Manual upgrade handling, heartbeat management,
// connection state tracking...
})
})The difference isn't just lines of code. Hono is built for edge runtimes—it starts in milliseconds, handles thousands of concurrent connections efficiently, and its TypeScript types are first-class. For an AI application where response latency directly impacts user experience, these gains compound.
The numbers: In our benchmarks, Hono handled WebSocket upgrades 3.2x faster than Express + ws, with 40% lower memory footprint. For a system serving 100+ concurrent AI conversations, that translates to real cost savings.
The Streaming Architecture
AI responses shouldn't feel like waiting for a page to load. They should feel like someone typing to you. This requires streaming at every layer:
User Input → WebSocket → Claude API (streaming) → Token-by-token → User
↓ ↓ ↓
~50ms Response starts First token visible
immediately within 200msHere's the pattern that works:
// Server: Stream Claude responses through WebSocket
async function handleMessage(ws: WebSocket, message: string) {
const stream = await anthropic.messages.stream({
model: 'claude-3-5-sonnet-20241022',
messages: [{ role: 'user', content: message }],
system: buildContextualPrompt(ws.clientId)
})
for await (const event of stream) {
if (event.type === 'content_block_delta') {
ws.send(JSON.stringify({
type: 'token',
content: event.delta.text
}))
}
}
ws.send(JSON.stringify({ type: 'complete' }))
}The key insight: don't batch tokens. Send each token as it arrives. The 2-3ms overhead per message is invisible to users, but the perceived responsiveness is dramatically better than waiting for sentence boundaries.
Why Upstash Over Self-Managed Redis
I've run Redis clusters in production. They're reliable once configured correctly. But for EA, I needed something different: vector search alongside traditional caching.
Upstash offers both in a single managed service:
// Traditional caching
await redis.set(`session:${userId}`, sessionData, { ex: 3600 })
// Semantic search for conversation history
const similar = await vector.query({
vector: await embed(userQuery),
topK: 5,
includeMetadata: true,
filter: { clientId: currentClient }
})This enables EA's killer feature: semantic context retrieval. When a user asks "what did we discuss about the Q3 budget?", we don't search for keywords. We find semantically similar past conversations.
We also use hybrid search (dense + sparse vectors) for best-of-both-worlds retrieval:
// Hybrid search: semantic + keyword matching
async findRelevantMessages(query: string, topK: number = 10) {
const queryEmbedding = await this.generateEmbedding(query)
// Sparse vector for keyword matching (BM25-style)
const sparseVector = generateSparseVector(query)
const results = await vectorStore.query({
vector: queryEmbedding,
sparseVector, // Reciprocal Rank Fusion combines both
topK,
filter: `userId = '${this.userId}'`,
})
}Cost comparison:
- Self-managed Redis + Pinecone: ~$150/month + operational overhead
- Upstash Redis + Vector: ~$40/month, zero ops
For a product in early stages, reducing operational complexity is worth more than any optimization.
The 4-Tier Skill Resolution System
EA's most complex architectural decision was the skill resolution system. Each client can have different preferences for how the AI behaves—preferred meeting times, email tone, task categorization rules.
The resolution order:
System Defaults → User Defaults → Template Defaults → Client Overrides
↓ ↓ ↓ ↓
Base behavior "My preferences" "Software client "Sarah specifically
for all clients" standard config" prefers X"This required careful data modeling:
interface SkillResolution {
// Resolve skills in <10ms (target: 3.5ms achieved)
resolve(userId: string, clientId?: string): Promise<ResolvedSkills>
}
// Implementation: 3 Redis lookups, parallelized
async resolve(userId: string, clientId?: string) {
const [system, user, template, client] = await Promise.all([
redis.get('skills:system'),
redis.get(`skills:user:${userId}`),
this.getTemplateForClient(clientId),
clientId ? redis.get(`skills:client:${clientId}`) : null
])
return deepMerge(system, user, template, client)
}Why this matters: A VA managing 50 clients can't reconfigure the AI for each one. Templates let them define "software client standard config" once, then override only the specifics per client.
The Confidence Scoring Pattern
Most AI assistants are binary: either they execute actions or they ask for permission. EA implements a confidence spectrum:
interface ProposedAction {
type: 'calendar_create' | 'email_draft' | 'task_add'
confidence: number // 0.0 - 1.0
details: ActionDetails
}
// User configures thresholds per action type
const autonomyConfig = {
calendar_create: { autoExecute: 0.85, requireApproval: 0.5 },
email_draft: { autoExecute: 0.95, requireApproval: 0.7 },
task_add: { autoExecute: 0.6, requireApproval: 0.3 }
}This creates three zones:
- High confidence (>threshold): Execute automatically
- Medium confidence: Propose with one-click approval
- Low confidence (<approval threshold): Ask for details
The implementation considers both confidence and action risk:
// Execute based on confidence, autonomy mode, and risk level
function shouldAutoExecute(type: string, confidence: number, config: any) {
const isRisky = RISKY_ACTIONS.includes(type) // email_send, calendar_delete
const mode = config.autonomy || 'assist' // suggest|assist|automate
switch (mode) {
case 'suggest':
return false // Always draft, never execute
case 'assist':
if (isRisky) return false // Draft risky actions
return confidence >= config.threshold
case 'automate':
return confidence >= config.threshold // Execute everything
}
}The confidence itself comes from Claude's structured output:
const response = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
messages: [/* conversation */],
system: `When proposing actions, include a confidence score (0.0-1.0) based on:
- Clarity of user intent
- Availability of required information
- Historical accuracy for similar requests`
})Token Optimization: Full vs Lite Prompts
AI API costs scale with token usage. EA implements a two-tier prompt system:
// Full prompt: Complete context for complex queries
const fullPrompt = buildFullPrompt({
systemContext: true,
clientProfile: true,
recentMessages: 20,
semanticContext: true // Vector search results
})
// Lite prompt: Minimal context for simple operations
const litePrompt = buildLitePrompt({
systemContext: true,
recentMessages: 5
})
// Dynamic selection based on query complexity
const prompt = estimateComplexity(userMessage) > 0.7 ? fullPrompt : litePromptImpact: Lite prompts use 60-70% fewer tokens. For simple queries like "schedule a meeting tomorrow at 2pm," the full semantic context is unnecessary. This reduced our Claude API costs by approximately 40% without impacting quality where it matters.
Lessons Learned
What Worked Well
1. Betting on Hono early. The ecosystem has matured significantly, and our choice looks prescient now. Edge-first frameworks are becoming the default.
2. Upstash for everything. Single vendor for caching and vector search simplified operations dramatically. The Redis protocol compatibility meant zero learning curve.
3. Streaming from day one. Retrofitting streaming into a request-response architecture is painful. Designing for it upfront made everything cleaner.
What Was Challenging
1. Context window management. Claude's 200K context window seems infinite until you're managing 50 client contexts. We had to implement aggressive summarization and semantic retrieval instead of naive context stuffing.
2. Voice synthesis costs. ElevenLabs at $4-6 per active user per month eats into margins. Voice became a premium feature, not default.
3. Multi-tenant complexity. The 4-tier resolution system took three iterations to get right. The initial version was simpler but couldn't handle agency use cases.
What I'd Do Differently
1. Usage-based pricing research earlier. We designed for flat monthly pricing, then discovered our heaviest users cost 10x to serve. Usage-based tiers should have been day-one.
2. Voice as premium from the start. We offered voice to all users initially, then had to "take it away" when costs became unsustainable. Launching premium-only would have avoided user frustration.
3. Less architectural ambition initially. The template system is elegant but complex. For the first 6 months, simple per-client settings would have been sufficient.
The Honest Assessment
Building EA taught me that 80% of what makes an AI assistant valuable is already solved by ChatGPT and Claude. The remaining 20%—multi-client context, confidence-based autonomy, real-time streaming—is where the real engineering challenge lies.
If you're building an AI product, don't compete on "has AI." Compete on the workflow-specific features that generic assistants can't provide. For EA, that's multi-client context switching. For your product, it's something else.
The architecture decisions outlined here aren't universally correct. They're correct for EA's specific requirements: real-time streaming, multi-tenant contexts, and cost-conscious AI usage. Your requirements will differ. But the decision-making framework—evaluate trade-offs explicitly, benchmark claims, and optimize for your actual bottlenecks—applies everywhere.
What's Next
EA is actively evolving. Current focus areas:
- Voice transcription for hands-free operation
- Calendar and email integration for automated action execution
- Agency dashboards for VA teams managing dozens of clients
If you're building something similar, I'm happy to discuss architecture decisions. Find me on GitHub or LinkedIn.
This article is part of a series on building AI products. Next up: "AI Integration Patterns: Semantic Context, Confidence Scoring, and Token Optimization."
Related Reading
- Why I Built a Multi-LLM Orchestration System — The developer workflow that powers how I build systems like EA
- LifeOS: Building an AI-Powered Personal Operating System — Another take on AI-powered systems, applied to personal knowledge management