Most recent: Architected a multi-tenant RAG pipeline serving 50K+ queries/day with <200ms p99 latency. Fine-tuned domain-specific language models that reduced manual review time by 73%. Built MLOps infrastructure that took model deployment from 3 weeks to 4 hours.
Production ML infrastructure that handles real traffic, real failures, and real business value. Built for reliability, observability, and iteration speed.
A 90% accurate model that deploys reliably beats a 95% accurate model that breaks in production. I care about latency budgets, error handling, monitoring, and what happens when your database goes down at 3 AM.
Fast feedback loops matter more than perfect architecture. I build prototypes that fail quickly, then productionize what works. Every pipeline I write has observability baked in from day one.
Attention mechanisms are elegant. But deployment scripts, error handling, and load testing are what separate demos from products.
Tools I use daily to ship ML systems that work in production environments.
ReAct patterns, function calling, tool use, and memory systems that actually work in production environments.
Vision-language models, CLIP embeddings, cross-modal retrieval, and building unified representations.
Speculative decoding, continuous batching, PagedAttention, and serving models at scale efficiently.
Chain-of-thought reasoning, few-shot learning, structured outputs, and reliability patterns.
Interested in Staff/Principal ML Engineer roles building LLM infrastructure, or technical leadership in teams solving hard NLP/GenAI problems at scale.