Running AI Locally Changed How I Think About Software
What happens when you stop renting GPUs and start feeling constraints. A story about Colab bills, CUDA crashes, and why friction is the best teacher.
I kept recharging my Colab subscription until I realized I was renting confusion.
The Cloud Trap
I started with Google Colab Pro.
The goal was simple: train a binary classifier, then work my way up to fine-tuning small LLMs using HuggingFace.
Colab made it easy to start. GPU in the browser. No setup. No hardware. Just code.
But the sessions kept dying.
I'd be mid-training, tweaking hyperparameters, watching loss curves - and the runtime would disconnect. Progress lost. Re-upload. Re-run. Wait.
Then the compute units ran out.
I recharged.
Ran out again.
Recharged again.
Somewhere around the third top-up, I stopped and did the math.
I was paying cloud rent for a GPU I didn't control, on a timeline I couldn't predict, debugging problems I couldn't see.
So I switched to my own machine.
The Crash
I tried fine-tuning Mistral 7B with LoRA using the HuggingFace PEFT library.
The script started.
Tokenizer loaded.
Model loaded.
Then:
CUDA out of memory. Tried to allocate 2.14 GiB.
GPU 0 has a total capacity of 8.00 GiB...
I stared at the terminal.
Googled the error.
Every answer said the same thing:
Reduce batch size. Use gradient checkpointing. Or get more VRAM.
I tried the first two.
Still crashed.
That's when I understood something that no tutorial had taught me:
The model doesn't care about your ambition. It cares about memory.
What the Crash Taught Me
Running locally stripped away every abstraction I'd been hiding behind.
VRAM matters more than compute.
You can have a powerful GPU and still fail because:
- The model doesn't fit
- Memory fragments over time
- Multiple processes compete silently
Raw FLOPS mean nothing if there's nowhere to put the tensors.
There's no autoscaling to save you.
In Colab, when something crashed, I blamed the platform. Restarted. Tried again.
Locally, the crash was mine. The constraints were mine. There was nowhere to hide.
Constraints are teachers.
Once I stopped fighting the memory limits and started designing around them, everything changed.
Smaller batch sizes. Gradient checkpointing. Quantization. Offloading.
Each workaround taught me more than the original tutorial.
I'd Learned This Before
The funny thing is - this wasn't new.
Years earlier, at WalletHub, I'd learned the same lesson in a different domain.
We had databases, Elasticsearch, SSR pages, aggressive SEO targets, and constant security pressure.
I thought:
- The database stores data
- Elasticsearch makes search faster
- SSR makes pages visible
- Security is something you patch
Reality was harsher.
Indexing wasn't storage - it was a system. Search was adversarial. SEO was a feedback loop. Security fixes regressed unless the architecture was defensive by design.
Our sites were attacked constantly. BugCrowd. HackerOne. Open-source CVEs every month.
I didn't fail because I wrote bad code.
I failed because I misunderstood how hostile real systems are.
Local AI sharpened a lesson I'd already paid for:
Abstractions hide complexity. But complexity doesn't disappear. It waits.
Before vs After
Before running AI locally:
- Assumed resources were elastic
- Blamed platforms for failures
- Treated constraints as bugs
After:
- Resources are finite
- Failures are information
- Constraints are curriculum
These lessons apply far beyond AI.
They shape how I think about checkout flows, API rate limits, frontend performance, infrastructure contracts - anything that runs under pressure.
Why I Keep Doing It
I don't run AI locally to replace the cloud.
I do it to stay honest.
To remember that:
- Abstractions have costs
- Performance is contextual
- Systems fail physically
- And understanding the machine still matters
That perspective has made me a better engineer - everywhere else.
If You're Curious
You don't need a data center.
You don't need the latest GPU.
You just need enough hardware to feel:
- Constraints
- Friction
- Trade-offs
Start small.
Run Stable Diffusion locally. Or try a LoRA fine-tune on a 7B model. Push it until it crashes.
That crash is the curriculum.
That's where real learning starts.