Blog — Jagadesh Rajendran

⚡

GPU & Compute 01

Why GPU Training Cost Us $0.014 While the Same LLM Would Have Cost $83,000

When we benchmarked our NewsGuard-LLM framework against training a full LLaMA 3.2 7B model, the numbers were staggering. Here's the real breakdown from our IEEE paper — and what it means for teams trying to run AI on a budget.

Jun 2025 Read more →

Open Source02

Mistral vs LLaMA vs OpenLLaMA — What I Actually Found When Running Them On-Prem

I've run all three in production annotation pipelines for healthcare data. The benchmarks don't tell the full story — here's what actually matters when you can't send data to the cloud.

May 2025Read more →

GPU & Compute03

NVIDIA A10 vs A100: Choosing the Right GPU When You're Not Google

Most GPU comparison articles assume you have a hyperscaler budget. We ran our training workloads on an A10 with 24GB VRAM — here's what we learned and when it's enough.

May 2025Read more →

AI & LLMs04

RAG Is Not Magic — The Real Failure Modes Nobody Talks About

After deploying RAG pipelines into our healthcare Decision Engine, I hit every failure mode in the book. Retrieval hallucinations, stale embeddings, chunk boundary disasters — and how we fixed them.

Apr 2025Read more →

Production05

The Moment Our LLM Pipeline Went Down in Prod — And What We Learned

A dead-letter queue overflow at 2am took out our claims classification service. This is a post-mortem of that incident, the fix, and why idempotency is not optional when you're processing 5M+ claims a month.

Apr 2025Read more →

AI & LLMs06

Prompt Engineering Is an Engineering Discipline — Treat It Like One

We went from 61% to 94% classification accuracy on insurance denial codes by treating prompts as versioned, tested artifacts — not one-off experiments. Here's the system we built.

Mar 2025Read more →

Open Source07

Why We Chose OpenLLaMA Over GPT-4 for Healthcare Data Labeling

HIPAA compliance, data sovereignty, and cost — three reasons we couldn't send our training data to OpenAI. How we made open-source LLMs work for large-scale annotation at 96% accuracy.

Mar 2025Read more →

Infrastructure08

Kafka for AI Pipelines: What Changes When Your Consumer Is an LLM

Running LLM inference as a Kafka consumer is not like running a database write. Backpressure, token limits, latency spikes — here's the architecture pattern that actually held up at scale.

Feb 2025Read more →

GPU & Compute09

Mixed Precision Training Saved Us 40% Memory — Here's the Actual Setup

FP16 mixed precision isn't just a flag you flip. We hit NaN losses, gradient underflow, and silent accuracy regressions before getting it right. This is the configuration that finally worked on our A10.

Feb 2025Read more →

AI & LLMs10

Concept Drift in Production LLMs Is Real — And Sneaky

Our threat detection model was quietly degrading for three weeks before we caught it. The data hadn't changed — the language had. This is how we built a drift detection loop that now catches it in under 24 hours.

Jan 2025Read more →

Open Source11

Apache Airflow for ML Pipelines: The Good, the Bad, and the DAG That Broke at 3am

We orchestrate our entire retraining pipeline with Airflow. After a year in production, here's an honest take — what it does beautifully, where it falls apart, and the pattern we use to avoid weekend on-call nightmares.

Jan 2025Read more →

Infrastructure12

On-Prem AI Is Back — And the Cloud Vendors Are Nervous

Healthcare, finance, and government can't put sensitive data in someone else's cloud. After building our entire AI stack on-prem, I'm convinced the pendulum is swinging back. Here's why — and what the architecture looks like.

Dec 2024Read more →

AI & LLMs13

TF-IDF Still Beats Transformers in Some Real-World Cases — Here's the Data

We benchmarked TF-IDF, FastText, BERT embeddings, and transformer classifiers on 1.4M healthcare news headlines. The winner wasn't what anyone expected — and the compute difference was enormous.

Dec 2024Read more →

Production14

Class Imbalance at 3.7% — How We Stopped Our Model from Ignoring Rare Threats

When only 3.7% of your 1.6M training examples are the class you actually care about, your model learns to be lazy. Oversampling, class weighting, and precision-recall tradeoffs — the exact approach that got us to 96% accuracy.

Nov 2024Read more →

Infrastructure15

Chaos Engineering an AI Platform: What Breaks First (It's Not What You Think)

We ran zone failure simulations on our AI inference cluster. The model serving layer held. What collapsed first was the feature store cache. A breakdown of our chaos engineering program and what it found before production did.

Nov 2024Read more →

GPU & Compute16

CUDA Out of Memory: The Error That Taught Me Everything About GPU Memory Management

Every ML engineer has stared at "CUDA out of memory" at midnight. I've hit it more times than I can count. Gradient checkpointing, batch size tricks, memory profiling — the definitive playbook from someone who's been through it.

Oct 2024Read more →

Open Source17

FastAPI + Docker + Model Serving: The Stack That Actually Works at Scale

We serve LLM inference endpoints via FastAPI wrapped in Docker, orchestrated with Kubernetes, behind an API gateway. Two years in production serving healthcare clients — here's what we'd change and what we'd keep.

Oct 2024Read more →

Production18

Building a Human-in-the-Loop Feedback System That Engineers Actually Maintain

Human-in-the-loop sounds great on paper. In practice, annotators get tired, labels drift, and the feedback loop breaks. Here's the lightweight system we built that has stayed running for 18 months without becoming a maintenance nightmare.

Sep 2024Read more →

Thoughts from theEngineering Trenches

Thoughts from the
Engineering Trenches