Super short blog post today, just something I was talking about with a buddy a couple weeks ago that I wanted to mention here.
The capability to run large language models like Ollama and some of the distilled DeepSeek models on local hardware is really cool. The tech is fun to play around with, and solutions like LM Studio make it easier than ever. I’ve cut a video on that. But here’s the thing – the bottleneck is graphics cards. You need a GPU with enough RAM to load the model to get reasonable performance.
Sure, there are models you can run inside a virtual machine, but if you want good quality that you’ll be happy with for most tasks (some basic stuff like sentiment analysis might be fine with a small model), just be very careful about thinking that spending money on a graphics card is going to end up saving you money in the long run.
I just did a bunch of processing using the OpenAI API gpt-4o model. I don’t have access to the 03-mini model yet (which is annoying since I’ve upgraded to the $200 a month plan), but I checked my usage and I’ve racked up like a 6 cent bill. That’s 6 US pennies.
When you crunch the numbers about how much you’d have to use a graphics card – we’re talking $1000+ for a good NVIDIA card – versus using the API at pennies per usage (maybe even a penny per usage), it really doesn’t make a lot of sense. Plus, you’re probably getting better quality and faster results with the API.
For me, the decision to use a local model versus the API comes down to purely OPSEC. If there’s sensitive information or things I don’t want to give to OpenAI or Claude, then using a local model is absolutely fine. Otherwise, using the API is honestly probably the far more cost-effective solution long term.