Field Notes & Perspectives

Thoughts from the
Engineering Trenches

Real findings from building AI systems at scale — open source models, GPU infrastructure, LLM quirks, and lessons learned the hard way in production healthcare environments.

JR
Jagadesh Rajendran · VP of Engineering
18 Articles
Open Source02
Mistral vs LLaMA vs OpenLLaMA — What I Actually Found When Running Them On-Prem
I've run all three in production annotation pipelines for healthcare data. The benchmarks don't tell the full story — here's what actually matters when you can't send data to the cloud.
GPU & Compute03
NVIDIA A10 vs A100: Choosing the Right GPU When You're Not Google
Most GPU comparison articles assume you have a hyperscaler budget. We ran our training workloads on an A10 with 24GB VRAM — here's what we learned and when it's enough.
AI & LLMs04
RAG Is Not Magic — The Real Failure Modes Nobody Talks About
After deploying RAG pipelines into our healthcare Decision Engine, I hit every failure mode in the book. Retrieval hallucinations, stale embeddings, chunk boundary disasters — and how we fixed them.
Production05
The Moment Our LLM Pipeline Went Down in Prod — And What We Learned
A dead-letter queue overflow at 2am took out our claims classification service. This is a post-mortem of that incident, the fix, and why idempotency is not optional when you're processing 5M+ claims a month.
AI & LLMs06
Prompt Engineering Is an Engineering Discipline — Treat It Like One
We went from 61% to 94% classification accuracy on insurance denial codes by treating prompts as versioned, tested artifacts — not one-off experiments. Here's the system we built.
Open Source07
Why We Chose OpenLLaMA Over GPT-4 for Healthcare Data Labeling
HIPAA compliance, data sovereignty, and cost — three reasons we couldn't send our training data to OpenAI. How we made open-source LLMs work for large-scale annotation at 96% accuracy.
Infrastructure08
Kafka for AI Pipelines: What Changes When Your Consumer Is an LLM
Running LLM inference as a Kafka consumer is not like running a database write. Backpressure, token limits, latency spikes — here's the architecture pattern that actually held up at scale.
GPU & Compute09
Mixed Precision Training Saved Us 40% Memory — Here's the Actual Setup
FP16 mixed precision isn't just a flag you flip. We hit NaN losses, gradient underflow, and silent accuracy regressions before getting it right. This is the configuration that finally worked on our A10.
AI & LLMs10
Concept Drift in Production LLMs Is Real — And Sneaky
Our threat detection model was quietly degrading for three weeks before we caught it. The data hadn't changed — the language had. This is how we built a drift detection loop that now catches it in under 24 hours.
Open Source11
Apache Airflow for ML Pipelines: The Good, the Bad, and the DAG That Broke at 3am
We orchestrate our entire retraining pipeline with Airflow. After a year in production, here's an honest take — what it does beautifully, where it falls apart, and the pattern we use to avoid weekend on-call nightmares.
Infrastructure12
On-Prem AI Is Back — And the Cloud Vendors Are Nervous
Healthcare, finance, and government can't put sensitive data in someone else's cloud. After building our entire AI stack on-prem, I'm convinced the pendulum is swinging back. Here's why — and what the architecture looks like.
AI & LLMs13
TF-IDF Still Beats Transformers in Some Real-World Cases — Here's the Data
We benchmarked TF-IDF, FastText, BERT embeddings, and transformer classifiers on 1.4M healthcare news headlines. The winner wasn't what anyone expected — and the compute difference was enormous.
Production14
Class Imbalance at 3.7% — How We Stopped Our Model from Ignoring Rare Threats
When only 3.7% of your 1.6M training examples are the class you actually care about, your model learns to be lazy. Oversampling, class weighting, and precision-recall tradeoffs — the exact approach that got us to 96% accuracy.
Infrastructure15
Chaos Engineering an AI Platform: What Breaks First (It's Not What You Think)
We ran zone failure simulations on our AI inference cluster. The model serving layer held. What collapsed first was the feature store cache. A breakdown of our chaos engineering program and what it found before production did.
GPU & Compute16
CUDA Out of Memory: The Error That Taught Me Everything About GPU Memory Management
Every ML engineer has stared at "CUDA out of memory" at midnight. I've hit it more times than I can count. Gradient checkpointing, batch size tricks, memory profiling — the definitive playbook from someone who's been through it.
Open Source17
FastAPI + Docker + Model Serving: The Stack That Actually Works at Scale
We serve LLM inference endpoints via FastAPI wrapped in Docker, orchestrated with Kubernetes, behind an API gateway. Two years in production serving healthcare clients — here's what we'd change and what we'd keep.
Production18
Building a Human-in-the-Loop Feedback System That Engineers Actually Maintain
Human-in-the-loop sounds great on paper. In practice, annotators get tired, labels drift, and the feedback loop breaks. Here's the lightweight system we built that has stayed running for 18 months without becoming a maintenance nightmare.